Mastodon

Wireguard, ZeroTier, ControlD and NextDNS

Send to Kindle

This will be another soul-crushing length and techie laden post, so it definitely deserves a TL;DR to let most of you off the hook early.

TL;DR

When using two or more services, each of which is awesome on their own, sometimes, the interactions between them are unexpected and can cause headaches (followed by a hopefully improved system). I was bitten recently by one such interaction that led me down a deep rabbit hole.

Caveat

This tale of woe, followed by triumph, should not be taken as a knock on any of the services mentioned in the title. In fact, I think all four are absolutely amazing services. So, when you read that I’ve replaced one of them with another, it is not because there was any failing whatsoever in the one that was replaced (I’m not being politically correct here, I mean that 100%).

Brief History

(most people don’t consider anything I write to be brief, but I promise to try in this section…)

ZeroTier is one of the most amazing things I’ve ever used. I’ve been using it since 2017! It’s a SDWAN (Software Defined Wide Area Network).

Wireguard is a peer-to-peer VPN (Virtual Private Network) that is leaner than most VPNs. I’ve been using it (sparingly) for the past few years, mostly as a potential emergency backup to ZeroTier (that I’ve thankfully never really needed!).

NextDNS is a wonderful service for cloud-based DNS (Domain Name Service) that is fully customizable. It can block ads (like a PiHole), override domain lookups and many other things. I’ve been a paying customer for years now and have zero complaints about the service, highly recommended.

ControlD is also a wonderful service for cloud-based DNS. It’s a competitor of NextDNS. I switched to ControlD from NextDNS sometime last year. This had nothing to do with anything negative about NextDNS (in fact, I renewed for another year in case I wanted to switch back!). Instead, ControlD is made by the same company that makes my VPN of choice (Windscribe). When they introduced ControlD I was intrigued enough to check it out, and only after it was in the market for a while and stabilized, I switched to it.

For non-technical people, I would probably recommend NextDNS over ControlD. While they can both pretty much do the same things, NextDNS has a simpler nomenclature and perhaps an easier web navigation. ControlD feels a bit more powerful (though I’m not really sure that it is), but they use terminology that even I struggle to understand (I know what the words mean, but not why they chose those words to describe various parts of the system).

Still, I highly recommend all of the above services with no reservations!

The issue that kicked all of this off…

Five nights ago, we were watching TV (around 8pm) when I noticed that my backup phone couldn’t connect to the Internet. It had 5 bars (I have an Extender in the bedroom), but the 5 bars had an exclamation mark after it, indicating no Internet access. I glanced at my main phone and it had 5 bars with no exclamation mark, and was definitely connected to the Internet, so I figured something was stuck on the backup phone.

While still watching TV (we were watching something off of a USB stick, so I didn’t need the Internet to watch TV) I started messing with the backup phone. I toggled WiFi off/on, eventually rebooting the phone. Nope, no Internet connectivity, even with 5 bars of WiFi.

Then I noticed that our Android TV Stick (which was still working locally) flashed up a notification that it couldn’t connect to the Internet. I looked at my main phone again, and it was still connected and working. Hmmm.

Within a few minutes, it lost the connection as well. OK, this was clearly a problem on our Extender (for those who care, a FiOS E3200 unit connected to the main router, a G3100, via coax cable). I figured a reboot of the Extender was in order to get everything back up and running. Unfortunately, there are fewer devices in the world that take longer to come back up after a reboot than either a G3100 or an E3200.

Once it was back up, I still couldn’t get to the Internet from any device in the bedroom. I  walked to the main router and still had 5 bars, but still no Internet. The problem wasn’t with the Extender after all. It appeared that my FiOS service was down.

There was quite the rain storm going on at the time. While we were grateful that we didn’t lose power (not the most uncommon thing for us during a storm), I figured that something took down FiOS, possibly in our entire neighborhood.

I switched both phones (and Lois’ as well) to use cellular data (when we’re home, we keep them on AirPlane mode 100% of the time, with WiFi on, since we have WiFi calling/texting and we get better battery life without connecting to a weak cell signal). We both had good enough connections to be able to feel better about being connected to the world.

I texted a friend who lives two doors down and also has FiOS service to ask if he was out. If he was, I could ignore the issue, knowing that it was a neighborhood-wide outage and Verizon would eventually fix it. Fortunately for him, and unfortunately for me, he had service. Uh oh.

I went through an automated FiOS troubleshooting system (It was easier, more helpful and effective/efficient than I ever would have guessed!). After running some tests, the system suggested rebooting my router, which it offered to do. Instead of getting out of bed to do it manually, I allowed the automated system to reboot the main router, which it did. Again, I had to wait 5-10 minutes for it to come back up.

Unfortunately, that didn’t solve the problem. When the automated system asked if the problem was solved, and I answered no, it offered to connect me to an actual person. The wait was very short (again, I was very impressed).

That person was incredibly helpful and he tried a variety of things, saying that everything looked perfectly normal on his end after each test. He even had me manually reboot the router and disconnect the Ethernet cable as well, and again said it looked good on his end.

At this point, I was sure (and wrong) that it was the ONT (Optical Network Terminal) that is installed in my crawl space. When I mentioned that to the representative, he said that it couldn’t be that since he ran tests on the ONT, and could get all the way into my router (he was correct and even the automated system was able to remotely reboot my router, indicating that of course, the ONT was alive and well).

The rep tried to get me to connect to a video support site (no audio, for privacy reasons) so that he could watch me (or rather the router) while I did things that he talked me through. I couldn’t connect (it was an Android permissions issue, which I had struggled with in the past and resolved, but under pressure, was unable to get going).

Since he couldn’t see what was happening on my end, the rep suggested that I simply factory reset the router. That was probably a very good next step attempt, but I was unwilling to do it. Just that day I had made a few updates to my router config (no, don’t let the alarm bells go off yet, that wasn’t the issue!) and I hadn’t backed up the config (yet, but I did it the next day).

I thanked him for his help and told him that I’d take a fresh look in the morning. It was 10pm at this point, way past my bedtime! He was sorry he couldn’t help me, but marked the ticket with the progress we’d made and assured me that if I contacted Verizon the next day, they could pick up exactly where we left off.

I went to bed…

The next morning…

I didn’t have trouble sleeping, but when I woke up as early as I usually do, I snuck out of the bedroom without waking Lois up (a minor miracle in itself!) and logged on to my main computer (a UM790 Pro mini pc) at 4am. I was fully prepared to have zero Internet, and to have to start the trouble-shooting with renewed vigor.

Much to my surprise, I was connected to the Internet and everything was working perfectly. I assumed that indeed there had been some external issue (that somehow affected my house, but not my neighbor, and somehow still allowed Verizon remote access to my house?!?) and it was now cleared. Wrong again!

My phones were still not getting Internet access. This was particularly baffling (for a second), since my machine also connects via WiFi to the same router, using the same SSID that my phone does, in the same room, so nothing appeared down with FiOS.

I’m happy to report that I didn’t have to scratch my head too long. The issue, entirely caused by me, for somewhat spurious reasons, was a new configuration that I made to my local DNS server, coupled with changes to Wireguard. Hence the introduction which makes the point that while ControlD had been running flawlessly for 7 months, and Wireguard had been working for years, making what appeared to be innocuous changes to each (but in this case, perhaps more specifically to ControlD) caused me to lose Internet connectivity.

Once I understood the issue (which I haven’t explained to you, the reader, yet), I was able to fix it relatively quickly. I’ll get to that soon. First, I have to explain why everything went down while we were simply relaxing in bed watching TV!

I hadn’t made any changes while watching TV, and when we started watching, everything was working, so what caused it to all go wrong in a cascade of failures(the backup phone, then the Android TV stick, then my primary phone), over the course of an hour?

DHCP Leases Expired and were renewed!

It was hard (if not impossible) for me to think clearly as to what was wrong when from my perspective, nothing changed (at that moment) and all of a sudden, nothing worked.

That happened because when my backup phone requested a new DHCP lease, it got one (the correct one), but the DNS servers (yes, I have a primary and a backup one) both were down. Not the machines, they were up (it’s the same machines that returned the valid DHCP lease, so they were up. It was the ControlD service on each of the servers that was down.

Down the Rabbit Hole

This is already a long post, and it’s about to get a lot longer, and a lot more technical. If you’re still here, I imagine you’ll make it all the way through, but I’ll wish you luck in advance! ?

Up until last Tuesday (the same day that all of this happened), everything was running smoothly for quite a while. That was true while NextDNS was my main DNS service, and was still true after I switched to ControlD sometime in June 2023.

Likewise, I had been using ZeroTier for nearly 7 years (with only the tiniest occasional hiccups that were always fixed with a quick restart of the ZeroTier service, never any real or sustained outages).

When Wireguard was first announced, long before it became an official Linux kernel module, I installed it, just to see what all the hype was about. I got it running pretty easily, but given the way you had to generate peer-to-peer cryptographic keys, it seemed like managing more than a few machines, and wanting them to operate in a mesh (where every machine could communicate with every other one), was going to be somewhere between annoying and really frustrating.

This would be especially true when first starting out, because if you didn’t know exactly how to set things up, and you ended up wanting to make changes and tweak things, you’d be going through the frustrating generation process over and over (without having the proper automation tools (yet, as they now exist in a number of ways!).

So, at that time, I set up a handful of Wireguard connections (successfully!) and decided that I would use them solely as emergency backups to my ZeroTier network.

Specifically, I have multiple publicly accessible servers on the Internet, and a number of private servers running in what people affectionately call a home lab. My ZeroTier network has 22 devices defined on it. If each of them had a Wireguard connection to the other, that would be a lot manually generated keys and config files.

What I really needed, for emergency backup purposes only, was a way to get from a handful of key machines (mostly my main computer, but perhaps also one or two home lab servers and my phones), to be able to get to the servers on the Internet. That way, if ZeroTier failed (for whatever reason) and I needed to reach a public server, I’d be able to find some route through the existing Wireguard mesh to access the machine. All tests proved that this would work, and once proven, I stopped working on Wireguard or doing any further tests.

A few years later, Wireguard was included in the Linux kernel natively, and lots of tools (and services) appeared to make managing a Wireguard network much easier. Perhaps the most well-known (and touted) one is TailScale. It can be used for free and major parts of it are open source. But, one key component is closed source, and is typically run on their servers. There is at least open source alternative to that component, so there are choices, and people swear you can trust the company even if you use their proprietary controller, but for many, that is a no-go. There are also paid plans, that companies would definitely opt for (wisely). It makes setting up large Wireguard networks trivial.

I had no reason to play with it (even the fully open source version) due to my undying love of ZeroTier. ?

At some point a few months ago, probably because some sluggish network copy caught my eye, I decided to benchmark a few network scenarios. I was interested to see if Wireguard was truly faster than ZeroTier (people claim that Wireguard is the leanest, and hence fastest VPN-like solution around). I was also interested in whether going direct to a machine (using plain old ssh, rather than layering a VPN on top of the connection) would be even faster.

I performed many tests, and at the time, convinced myself that a Wireguard connection was much faster than a ZeroTier one, and that a pure ssh connection was faster than Wireguard. I still didn’t do anything about it for a few reasons:

  1. It would still be painful to duplicate my 22 device ZeroTier network in Wireguard, especially if I needed to make any changes occasionally, without using yet another service (TailScale being just one of a few alternatives). If I was going to rely on another service, I could simply continue to rely on the service that hadn’t let me down in 7 years (namely, ZeroTier).
  2. While I move around a fair amount of data on the network (hence the desire for a quicker transfer speed!), for the most part, I’m ssh’ing and doing CLI (Command Line Interface) things on the various machines (in which case speed is practically immaterial).
  3. Muscle memory for the various commands that I type was well ingrained (7 years worth) and would be annoying to have to change.
  4. Many scripts that I have written over the years embed the ZeroTier machine names in them, and all would need to be updated, and even enhanced if I wanted to easily be able to use either network.
  5. My original tests of Wireguard config files had me believe that you couldn’t use domain names in them. Back then, the only way I got things to work was by using static IP addresses. This was a maintenance nuisance (but not a show-stopper), because most of my devices are actually behind NAT’ed dynamic IP addresses (they don’t change often, so it’s not a maintenance nightmare, but they do change, so there would be some painful maintenance).

So, I had it in the back of my mind to one day revisit this, if it ever became an issue (speed wise) or if tooling for Wireguard got easier. I also thought it might be fun to write a script that regenerated all of the config files for all of the devices, but never got around to doing it.

A quick aside about ZeroTier vs Wireguard

It would be wrong to conclude that because Wireguard is faster than ZeroTier, you should use Wireguard instead of ZeroTier in all situations. In fact, while ZeroTier can do everything that Wireguard can do, the reverse isn’t even remotely true. ZeroTier can do all sorts of things, including deep packet inspection, using their own built-in rules. You could have read-only nodes, nodes that only process certain types of packets, etc. It’s literally up to your imagination as to how you want to configure your connection. Wireguard is purely a direct encrypted connection between two nodes, that’s it. No wonder it’s faster at moving data through the system. That said, I don’t use any of the advanced features, so for me, Wireguard can replace ZeroTier.

Back to our main story…

A week ago, I stumbled on an app called wg-meshconf that does exactly what I was considering writing (but 100x more professionally than I ever would have implemented it). I decided to try it, just to know, but still thought I wouldn’t switch from ZeroTier being my primary network solution.

wg-meshconf works as well as described (I think I discovered it from the Scaleway site). I was able to easily generate the conf files for my devices (I had no reason to generate all 22 since I still didn’t think I was going to switch over).

But, over the next day or so, I figured why not? After all, my original objection was the (potential) maintenance of the Wireguard config files, and wg-meshconf solved that problem.

I distributed the various Wireguard config files to their respective hosts. All tests worked, so wg-meshconf did it’s job. Now it was time to overcome the five objections listed above (numbered).

#1 was done (thanks to wg-meshconf). #2 was going to be an automatic improvement (or so I thought). #3 couldn’t really start until I made the switch (which I now did, and would find out how ingrained my muscle memory was). #4 would have to be done before those scripts would be usable (this involved not only changing the scripts, but making changes to /etc/hosts files and ControlD DNS entries to name all of the new connections that would only use Wireguard).

The most intriguing one, #5, was the one that ended up causing the headache last Tuesday! It seemed that now Wireguard could correctly handle domain names in the Endpoint config. It’s possible that it always did, and I didn’t know what I was doing. The only thing I can attest to is that in my very limited testing, I was not able to get it to work in the early days of Wireguard (before it was a kernel module).

I was extremely excited by this discovery! It meant that I would never need to mess with a Wireguard config file ever again, unless I needed to add or change devices. If my home IP changed, all machines that were outside my home would automatically pick up the new address the next time Wireguard was restarted. Since I’m alerted whenever any of my dynamic IP addresses change (via push notifications to my phones), it’s trivial for me to restart Wireguard if necessary. That’s certainly easier than editing all of the config files by hand before having to restart Wireguard anyway.

So, I built the Wireguard conf files with domain names instead of IP addresses. Then I took a look at the ControlD config file on my main computer and realized that I could simplify it considerably (famous last words). I should take a brief digression into my ControlD setup to avoid any confusion for people who are familiar with it, and help newcomers not make specific assumptions.

ControlD setup

Just like with NexDNS, you can run ControlD without ever installing a single piece of software on your computers or phones. You can manage everything in the cloud through their websites and simply point your DNS to their servers (this works perfectly fine and is the recommended way to use either service).

Since when did I follow recommended policies? Not that often. 😉

An alternate way to run either service (and that’s the way I ran NextDNS and now run ControlD) is to run an instance on your machine. You still configure most things in the cloud, but now the local instance, running on your machine, reads that data once from the cloud and then serves the DNS queries locally.

This has the advantage of caching DNS results locally, avoiding future requests for the same domain to have to leave your network to get a response. It also means that you can do some fancy things in the local config (like making decisions on which Profile or Device to route the request through). For me, it also means that I can run ControlD on my local servers and pass out their address through DHCP to all devices on the network (including, for example, my Android TV stick!), so that they benefit from any rules that I set up (like Ad Blocking, for example).

I run three instances of ControlD. The two on the DHCP servers that serve the entire home network, and one on my main computer, simply because I can (why should I make a call across my network, even though it’s internal, when I can have the answer sitting right there in RAM). ?

Now we can return to our regularly scheduled post…

Back to our saga…

Recall that I had only domain names in the Wireguard config files everywhere. Tests showed that working fine!

Then I simplified my ControlD file. That meant that instead of listening on separate interfaces, which I originally did to accommodate the ZeroTier network/addresses, I changed it to listen to all interfaces at the same time (using 0.0.0.0 as the catchall interface).

What I can’t swear to is whether I actually tested that or not. Even if I had, it likely would have worked on my computer. I most certainly did not test it on the DHCP servers! Oops!

So, the original purpose of this post was to point out that even if each of the above services was 100% flawless on their own, once you mix and match services and make them rely on one another, you might have unintended consequences. None of the vendors is responsible for that kind of interaction, that’s all on me (or you in your case).

So, what happened?

The actual problem

When I restarted ControlD on the two servers (without testing, or checking after the restart!), they likely worked (I can’t swear to that, but I believe that my devices would have failed instantly, rather than failing after a DHCP lease renewal (I can’t prove that after the fact though, but I can explain why I think that’s true!). I’ll come back to that explanation after explaining the next step though.

To recap: first, I distributed the Wireguard configs to every device that needed one. Then I restarted the Wireguard service to read the new config. That worked and was done without having to reboot any of the machines. So far, so good.

Next, I updated the ControlD config on my computer and it worked as well. Then I updated the ControlD config on each of the two DHCP servers. That (likely) worked too.

I regularly update my systems when there are upstream updates available. That day (or the next, I really can’t recall), there was an update that required a reboot (or at least makes me want to reboot), namely a kernel update or a systemd upgrade. If either of those are updated, I reboot any machine that got the update.

Now, finally, the explanation of what failed, exactly. When the machines rebooted, they ended up in a weird race condition (entirely caused by my updating both Wireguard and ControlD configs. I’m not sure which process started before the other (I could control that in a systemd unit file in the future if I wanted to), but it wouldn’t have mattered, given the way I set up the configs (recall, this was entirely a problem of my own making!).

For argument sake, let’s say ControlD started first. When it was told to listen to the interface 0.0.0.0 it tried to do just that. But, it failed on wgopt (the Wireguard interface), since it wasn’t up. However I had it configured (again, my fault), that caused ControlD to fail and simply die with an error in the log.

Now along comes Wireguard and tries to resolve all the domain names (recall, there were no IP addresses in the config). The DNS queries all fail, because the machines are explicitly told to resolve through the DHCP servers, and the ControlD process on each of those servers failed and died!

Whenever any other machine (like my phones) goes to renew their DHCP lease, they too try to resolve everything through those servers, and fail on every single call to retrieve a valid IP address for whatever domain they’re trying to reach. Total failure!

The reverse order would have failed too, and makes more sense. If Wireguard started before ControlD, it would have failed due to no DNS server running. Then ControlD would fail because Wireguard wasn’t running.

Ugh.

So, why did this all work on my computer (if you haven’t fallen asleep yet, and you’re still with me, you might recall that I said above that when I booted my computer in the morning, it just worked!).

The answer is that this is a new-ish machine. I’ve only had it since July 2023. By then, I had already decided to stick with ZeroTier (as noted above) so I never bothered to generate any Wireguard config on this box. I did finally put one on, and I did update the ControlD config, but the other machines had an explicit section which listened on the Wireguard interface, and this one never did, and I didn’t add one during the update (since I was going more generic).

That got me to notice that the process (ControlD) failed on both the DHCP servers. That got me to look at their log files and see why. I was able to bring them up reasonably quickly by making some small changes to the config. But, it made me stop and think about how and why I really wanted this all to work (hence the true rabbit hole nature of the problem!).

Further down the Rabbit Hole…

Taxing your memory again, I’ll remind you that I had convinced myself long ago that a direct ssh connection was the fastest, Wireguard was next, and ZeroTier was the slowest (still fast enough to satisfy me for 7 years!). In my (sometimes addled) brain, I convinced myself that this had to do with multiple levels of encryption).

In a direct ssh connection, there is a single encryption of each packet. When adding either Wireguard or ZeroTier on top, you would get their encryption, but since you’re still going through an ssh tunnel (whether you’re using scp or rsync, etc.), then ssh is encrypting the encrypted packets again, so it has to be slower, right? After all, I had tested this (albeit very long ago).

So, I decided to optimize this into oblivion. The simplest solution would be to replace the domain names with IP addresses in the Wireguard config file. That way, it couldn’t fail to start up due to any DNS issues, as it wouldn’t rely on DNS. Worst case, I’d have to update any dynamic IP addresses whenever they change. So be it.

But, why stop there? If direct ssh is faster, why not choose that whenever possible, and only use Wireguard when that’s not reasonable? So, I switched to public addresses where available (to avoid both Wireguard and ZeroTier) and dynamic addresses for NAT’ed networks. I had to make multiple versions of the Wireguard file on my main computer, which I’d have to manually swap based on what location I was in (I could probably automate that, but hey, one thing at a time…).

Then I updated all of my scripts to automatically choose the best route to whatever server I was trying to reach (ssh, Wireguard or ZeroTier). That all worked too!

We’ve been up and running fine for 5 days on the new config with no issues.

So, why label this section as “Further down the Rabbit Hole…”? Because…

Further tests reveal…

While writing this blog, I wanted to present actual data for the various speed tests. To do that, I copied (using scp) a 4GB file to my fastest cloud server (it’s fastest both in terms of raw CPU but also has the fastest network connection of the bunch).

I sent the file multiple times, in each of the protocols: ssh, Wireguard and ZeroTier. I was sure that they would register in that order. If you can’t tell already, I was wrong (yet again…).

Raw ssh took 3 minutes, at an average of 22MB/s.

ZeroTier took 3 minutes and 31 seconds, at an average speed of 19MB/s (like I said, the penalty wasn’t that hard to live with for 7 years, given the awesome flexibility and ease of administration).

But, Wireguard, which I expected to be between them (so figure 20 or 21MB/s), came in at 2 minutes and 28 seconds, at an average speed of 27MB/s.

All three were retested and came in at roughly their same respective speeds!

It’s ironic that at least for this test, Wireguard was dramatically faster than the other two, since I had every script using only Wireguard (for convenience) before I went and configured each of my scripts to be smarter and choose a direct connection whenever that was possible.

I’m still happy I did that, because writing that code was fun and instructive, but now I have to decide if I chalk that up to a learning experience and revert everything back to using Wireguard only (with ZeroTier as the new emergency backup), or just leave well enough alone!

I fully intended to have the above sentence be the last one in this post. But then it occurred to me that I should test the same file copy entirely on the LAN, to remove all Internet noise and routing issues. So, I repeated the test, ssh, Wireguard and ZeroTier, copying from my machine (like before), but to a local server over the LAN.

Raw ssh took 1 minute and 24 seconds, at an average speed of 47.2MB/s.

ZeroTier took 2 minutes and 29 seconds, at an average speed of 26.6MB/s.

Wireguard took 1 minute and 47 seconds, at an average speed of 36.8MB/s.

Whew! This validated my long-ago testing, where direct ssh was fastest, Wireguard was slower, but much faster than ZeroTier.

I guess I have my answer now, which is to just leave everything the heck alone! ?


Posted

in

, , ,

by

Comments

2 responses to “Wireguard, ZeroTier, ControlD and NextDNS”

  1. Jamie Thingelstad Avatar

    Hadar — your network debugging skills are impressive! 🙂

  2. hadar Avatar

    Thanks Jamie! I admit that I really love this stuff. Being retired gives me the time to create problems that I then enjoy fixing. ???

Leave a Reply

Your email address will not be published. Required fields are marked *