Mastodon

Reaction vs Fail2Ban vs Crowdsec

Send to Kindle

This is a highly technical post that will only interest people who host a server that’s open to the Internet. That excludes nearly everyone that typically reads my blog, so I’ll start with a TL;DR so you can bug out quickly.

TL;DR

reaction is a relatively new system that is capable of replacing fail2ban. It can also replace a majority of the capabilities of crowdsec, though it will likely never have the community aspect of crowdsec (the next version might at least be able to cluster multiple servers). I have switched to reaction and I love it.

The long, boring details…

If you expose a server to the Internet, then you know that it gets pounded (probed) all day, every day, by bots (and people) looking for vulnerabilities, or opportunities to spam/troll/phish. It’s like living in a proctologist’s office…

If you have the slightest sense, you try to minimize the potential damage in a variety of ways. One of those ways is to have monitoring software running on the server that attempts to detect bad actors and nip their intrusions in the bud. This is typically done by having the monitoring software read various log files and then ban any IP address that is trying to do the bad thing.

The two most prominent systems for accomplishing this (both monitoring and banning) are Fail2Ban and CrowdSec. Both are linked above in the TL;DR paragraph. Fail2Ban is the older (theoretically more mature) of the two, and was the first one that I implemented, years ago (I was unaware of CrowdSec at the time, and it might have been too early to rely on it back then anyway…).

Fail2Ban isn’t all that complicated (given that you’re running an open server on the Internet to begin with!), especially since the default configuration will typically give you reasonable protection out of the gate (relative to a naked server on the net). If you have multiple services running on the server, or want to tighten things up, it’s still not hard, but you’re going to need to read a lot of docs and experiment to get things right.

Part of the problem is that there are multiple files, in multiple directories, that all need to be understood (and potentially modified) to get Fail2Ban to do exactly what you want.

I ran Fail2Ban for many years, and was not unhappy with it, but I wasn’t exactly happy either. I tweaked it on occasion, but it was never pleasant, so I avoided doing it frequently and I was definitely not as aggressive with it as I would have liked to be.

Sometime in 2022 I became aware of CrowdSec. It can do all of the things that Fail2Ban does (or can do), but it had at least two additional, benefits over Fail2Ban.

  1. in addition to it being able to spot bad actors on your server, it could aggregate bad actors from all other CrowdSec users on the Internet and share that list with everyone. In other words, you could (theoretically) ban many bad IP addresses long before they even get to your server. That’s extremely cool (IMHO).
  2. if you operate more than one server, CrowdSec makes it very easy to share a ban from one machine to all of the others in your cluster, in real-time. That’s pretty cool too. The point is that a server is often targeted by grabbing all of the data from a DNS server, so the likelihood of multiple servers that are part of the same domain being hit by the same bots is very high. This applies even more to a simple load balanced cluster, where the machines are all answering the same domain name to begin with.

After having CrowdSec on my radar for well over a year, I finally decided to take the plunge and implement it. It was much simpler to get running than I expected (a big plus for them), and wasn’t that hard to add simple rules to. It was also easy to get a cluster going (I run multiple servers in multiple data centers), thanks to their excellent documentation.

Unfortunately, if you want to extend it in more intricate ways, it’s most definitely not intuitive (it may not be hard, but you’d have to read a ton of documentation to really make the correct changes). If a reasonable knock on Fail2Ban is that there are many files spread across multiple sub-directories, then CrowdSec puts Fail2Ban to shame in that regard.

There are many sub-directories (some of the naming struck me as absurd, but there’s likely a reason for it, even if it’s simply historical). Basically, after setting it up and doing some relatively simple tweaks, I was not motivated to dive in more deeply to make the changes I really wanted to make. That said, it was certainly running well, and I was never going to return to Fail2Ban over CrowdSec.

A few weeks ago I stumbled across Reaction. I believe it was in some Mastodon post, but if not, it was likely mentioned in a Reddit post. Either way, I was intrigued. You can implement reaction with two files: a single binary (reaction, written in Go, which doesn’t require any other dependencies!) and a single config file, which can be written in either YAML or JSONNET. I had never heard of JSONNET before, but the author of reaction recommends it over YAML, so I went with it as well, and I’m very happy with that choice too.

There is a third file that is recommended, but not required, called ip46tables (a tiny helper binary that the author makes available to simplify the banning and unbanning process). I use that too, and it works as intended. So, exactly three files to implement a full monitoring and banning system.

I’ve been running it for two weeks now, on four servers. I already mentioned that I love it, now I’ll tell you why. 🙂

The structure of the single config file makes so much sense. It’s clean, intuitive, and any changes or additions that you might want to make are fairly obvious as to how and where the changes should go. The example config is also heavily documented, so you don’t need to run all over creation and read chapters of documentation, in order to take a stab at making a change.

In the TL;DR above, I linked to a general tutorial article about getting started with reaction. That’s the first exposure I had as well. There are links in the article to the WiKi and repo, among others. All are good resources, but the tutorial itself is what sold me. There is an issue tracker that the author actively monitors and encourages people to participate in (I have a few in there as well).

I am currently monitoring many log files split across four major service groups (smtp, ssh, IMAP, http). The main example file provided in the WiKi only shows HTTP (nginx) usage. I was able to add the other ones (and extend the HTTP section extremely easily (and robustly). I add new rules without hesitation and change ban times, etc., on a daily basis (this is still new to me, and a ton of fun, so this will obviously settle down in the near future).

The point is that I can scan a log file for attempted exploits, and in seconds, add it to my reaction.jsonnet config file. Boom, exploit gone. I’m not foolish enough to think that I’m now secure. I realize that there are exploits I’ll never spot (and even some I couldn’t, like firmware bugs), but there’s no doubt that I’m already way better off than I was with Fail2Ban or CrowdSec.

I’m happy to share my full config file (slightly redacted for my whitelisted IPs, etc.), but part of the fun is discovering things for yourself. So, I’ll include a few snippets here, to whet your appetite.

I run a postfix smtp server. My config there stops a lot of nonsense, but it doesn’t have a way of stopping the repeated attempts thereof. In other words, just because someone can’t relay through my server, doesn’t mean they can’t send 1,000,000 relay requests (each of which will be caught). Each of those attempts can be spotted and their IPs banned, with the following stanza:

mail: {
  cmd: ['tail', '-fn0', '/var/log/mail.log'],
  filters: {
    spam: {
      regex: [
        @'NOQUEUE.*\[<ip>\][\:\; ].*',
      ],
      retry: 2,
      retryperiod: '24h',
      actions: banFor('48h'),
    },
  },
},

The above looks for two instances of a log entry with NOQUEUE in it, with the same ip address. If that occurs withing 24 hours, then that IP address will be banned for 48 hours. You can see how trivial it is to make changes to any of those parameters (e.g., I have no idea why I even wait for the second attempt, or only ban for 48 hours, etc.).

You can whitelist (reaction calls it ignore) any IP address that you trust and control. That way, if you do something silly (like fail your ssh login multiple times because you are hungover from New Year’s Eve), you won’t be locked out!

Here’s a slightly more sophisticated stanza, where multiple regexes are run as a group, with a single controlling action:

dovecot: {
  cmd: ['tail', '-fn0', '/var/log/dovecot.log'],
  filters: {
    nocert: {
      regex: [
        @"client didn't send a cert.*rip=<ip>,",
        @',<ip>\)\: Request timed out',
        @",<ip>,.* Client didn't present valid SSL certificate",
      ],
      retry: 2,
      retryperiod: '24h',
      actions: banFor('48h'),
    },
  },
},

In other words, there are three different ways that I check for whether a valid certificate has been presented. If any combination of them fail twice (again, why more than one?), then I ban for 48 hours.

One last stanza, which shows that you can search for multiple types of bad actors and deal with each differently if you want (different ban times, or even just different actions, you don’t need to ban, you can alert, etc.):

nginx: {
  cmd: ['tail', '-fn0', '/var/log/nginx/catchall.access.log', '/var/log/nginx/domain.access.log', '/var/log/nginx/domain-ssl.access.log', '/var/log/nginx/example.access.log'],
  // filters run actions when they match regexes on a stream
  filters: {
    httpprobe: {
      regex: [
        @'<ip> .* "40[035]" [0-9]',
        @'<ip> .*POST .* "40[0-9]" [0-9]',
      ],
      retry: 2,
      retryperiod: '24h',
      actions: banFor('96h'),
    },
    http404: {
      regex: [
        @'<ip> .* "404" [0-9]',
      ],
      retry: 3,
      retryperiod: '1m',
      actions: banFor('96h'),
    },
    dot404: {
      regex: [
        @'<ip> .*GET /\..* "404" [0-9]',
      ],
      retry: 1,
      retryperiod: '1m',
      actions: banFor('96h'),
    },
  },
},

There are three separate filters above: httpprobe, http404 and dot404. Each has a different number of times that they need to fail, potentially in different time-frames as well.

Critically important point about log files

I run four servers. While they all run nginx, the log formats are not identical. The version of nginx in each is the same, but one was installed more than a decade before the others, and the default format has likely changed since then. I’m sure I could standardize the formats, but instead of mucking with perfectly working installations, I decided to make slight tweaks to each separate reaction.jsonnet file, to account for the differences.

My point is that if you run nginx as well, don’t just blindly copy my nginx stanza above. Make sure it matches your log file format, as I have the same stanza on other machines that is slightly different (in this specific case, the HTTP RESPONSE CODE is in quotes on one server, and not quoted on others. The above stanza has the code quoted!

Anyway, the author of reaction is a hero to me (and I assume many others).


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *