Postfix

Email Spam Coordination Across Different Services?

Send to Kindle

Surprise! It’s been 1,743 days since my last post. This isn’t a music post, so even if you are momentarily pleased to see a new one pop up in your feed, you’ll likely be disappointed by this “techie” one…

Disclaimer: this is 100% speculation on my part, I have done zero research to see if my theory is correct. I’m sharing it mostly to remember these thoughts as they occur, and perhaps in the distant hope that someone will either confirm or disprove the hypothesis with real proof.

Hypothesis

There is high level coordination among disparate email providers to determine bulk senders, in a completely unintuitive but ostensibly extremely clever manner.

I put the above hypothesis first, so that you can stop reading right now if you have no interest in this topic, because as is my usual style, this is likely to get long, quickly…

Background

I have operated my own email server for over 20 years. My CTO friends think I’m insane (perhaps for other reasons as well, but definitely for running my own email server). For clarity, it’s Postfix, not an email server that I wrote, just one that I operate on a dedicated server (where this WordPress blog is hosted as well).

I’ve been through the wars with email, sometimes caused by my misconfiguration or misunderstanding, but sometimes entirely out of my control (e.g., when my dedicated server was transitioned to a new data center and the static IP changed, and it had previously been on many RBL blacklists!).

Over time, I’ve tamed the configuration into a very stable setup. That has included complying with SPF, DKIM, DOMAIN-KEYS, DMARC, etc. Basically, anything that Google or Microsoft claim will help them validate email from me as a sender, and not mark it as spam or worse, just bounce it back to me.

The Problem

While my setup works flawlessly most of the time, on occasion, Lois or I will get a bounce back from someone (typically Google/Gmail, but sometimes Microsoft/Outlook/Hotmail). Once that bounce occurs, we’re often shut out from sending email to that service for a full day (rarely, longer!).

As you can imagine, it’s wildly frustrating to not be able to send an individual mail (this is not spam or bulk mail, but rather one to one emails to friends).

This is the error message we get in the bounce:

               The mail system

XXX@gmail.com: host gmail-smtp-in.l.google.com[74.125.20.26] said:
550 Action not taken (in reply to end of DATA command)

Wow, very helpful, “Action not taken”. Nothing indicating what we did wrong and why Google rejected the email. It feels like it should be a transient error, but it not only persists, it typically stops us from sending any further emails to anyone on that service for the rest of the day.

This has been going on for at least a couple of years. Just not often enough for me to pull out my (one remaining) hair trying to track it down.

What is going on?

Until this week, I literally had no idea (perhaps I should have diagnosed it earlier…). While I can’t say with certainty that my new understanding covers all aspects of this error, I can say with certainty one use case that definitely causes the error, and it might explain every single occurrence that we’ve had in this regard.

We (Lois more than I, but I do it too) share identical emails individually with a variety of friends. Specifically, if a group that we love puts out a new music video, Lois will send a link to that video to a group of people, but each will get their own separate email with the same email body and subject, so that they can reply just to us and not be BCC’ed in a large group.

I had never made the connection before that this somehow triggered the bounces, even though they were short emails, sent to people we’ve sent emails to 100’s of times, that almost always respond to those emails, and that we’re likely in the contacts list of the receivers. I couldn’t imagine that we were tripping any spam filter.

What do I think is going on?

I now believe (very firmly, with zero proof) that the body of the email (not including the subject, or the receivers) is being hashed. When another email with the identical hash (of the body) comes through (not sure if there is a time-limit or not), the service bounces it immediately with the above error message of “Action not taken”.

How did I come to this conclusion?

A week ago, I sent out invitations to a number of people to a house concert in March. Most of them went out fine, a very few bounced, so I didn’t have any suspicions over those bounces just yet.

This week, we discovered that one of the band members couldn’t make it, and we agreed with the head of the band that we would simply cancel the show. So, I sent an email to everyone that I had previously written to (except those that already said they couldn’t make it) telling them that the show was cancelled.

The cancellation emails were all bouncing, except for the first one!

I complained to Lois that the cancellations were bouncing, and that we would likely have to wait at least a day for the bounces to clear before I could send them again (still having no idea why they were bouncing).

Lois asked me why I thought the vast majority of the invitations didn’t bounce (to the same people, and those had links to the band in them, so if anything, they would have appeared to be more spammy).

It was a good question, to which I had no answer.

But, my brain often needs to sleep on problems before enlightening me, and indeed, the next morning I woke up with a theory to test.

While the bodies of the invitations were nearly identical, they each started with “Dear XXX,”. So, they couldn’t have hashed into the same exact body. On the other hand, each of the cancellations were identical in every way (copy/paste) without the lead Dear XXX. So, they indeed would hash into the same body.

To test the theory, I added back the “Dear XXX” to the cancellations, and sure enough, every single one went out without any bounces!

How is that Cross Provider Coordination?

Aha! It turns out that once Gmail (for example) bounced an email, so did Microsoft, Verizon/AOL/Yahoo, Apple (via @mac.com) and likely others (like Comcast).

To be clear, once Gmail bounced an email, if the next email went to hotmail.com, but was the very first such email to any Microsoft address, it was bounced immediately.

That implies (to me, proves) that it’s not just that they too hash incoming emails and bounce duplicates, but that they share the hash (somehow) among the competitive service providers, in order to more efficiently identify bulk senders quickly.

Please don’t ask me to conjecture on how they do that, I don’t have a clue.

Summary

We are both unbelievably relieved to understand what has been going on with these bounces (or at least to delude ourselves into thinking we understand it now).

Postscript

I doubt anyone who normally reads my blog will have an interest in this, but I needed to get it off my chest anyway.

I can only hope that an expert will weigh in and either confirm or provably deny my hypothesis.

Also, I have at least one more (completely picayune) email issue (rant?) that I will share if I get any feedback that this kind of stuff is interesting to anyone…

Converting from Procmail to Maildrop

Send to Kindle

I’ve been using procmail to filter mail on the server forever. I like it, so it’s important to note that even though I switched, I have nothing bad to say about procmail.

So, why did I switch? Procmail can be a little terse to read (obviously, I’m used to it by now). Over the years, I have built a large set of rules. There is a ton of cruft in there. If I wanted to clean it up, I had to rewrite it. Rewriting it in procmail was definitely a possibility.

But, over the years, I was also aware of maildrop as a filtering solution. It has a cleaner (more accurately, a more straightforward) syntax. The documentation is a little sparse (missing a few key examples IMHO). There are also thousands (if not millions!) of example lines of procmail available on the net, and it can be hard to find complex real-world examples of maildrop filters.

But, I knew that if I rebuilt my filters in maildrop, I’d be forced to rethink everything, since I couldn’t get lazy and just grab hunks of procmail from my current system and plop them into the new one. So, maildrop it was going to be!

One last time, just to make sure I don’t offend lovers of procmail (of which I am one!), everything that I did in maildrop could easily have been done in procmail. I just happened to choose maildrop for this rewrite, and for now, will stick with it. Perhaps if I ever revisit this project, I’ll iterate the next time back in procmail.

The goal of my filters is to toss obvious spam (send it to /dev/null). Likely spam gets sorted into one of three IMAP folders. The only reason I split it into multiple folders is so that I can test rules before turning them into full-blow deletes. Finally, mail that falls through those is delivered to my inbox.

Over the years, I added numerous rules to filter classes of spam (stocks, watches, viagra, insurance, etc.). Without a doubt, I introduced tons of redundancy. I didn’t scan all of the previous rules to see where I might be able to add one more line because it would be too tedious vs just adding a new rule.

I was reasonably satisfied with the result, but over time, became less aggressive about deleting mail automatically, preferring to stuff it in a spam folder for visual scanning during the day.

Since I create my own rules (I don’t run a system like SpamAssassin, which I did for a while), I can start to see patterns and simplifications over time, which was the impetus for the rewrite. In other words, there are more commonalities across classes of spam, and I don’t have to spend as much time categorizing things as I was bothering to do.

I’ve now made my first cut of the maildrop-based system. It’s been in production now for seven days, and I’m very happy with it so far. The one major change I made is to default to deleting things (in other words, much more aggressive than the previous system), but, I keep a copy of all mail in an archive IMAP folder that I will prune through a cron job, and never scan visually.

I review my delete logs once a day, so if I spot an email that looks like I shouldn’t have deleted it, or someone contacts me asking why I didn’t respond, I will be able to check the archive and have the full mail there (for some reasonable period of time).

Here’s the result of the rewrite:

The original procmail system had roughly 3800 lines it in (including comments and blank lines). The new maildrop system has under 550 lines, including comments and blanks. I delete more mail automatically, and in a week, haven’t deleted a single mail that I didn’t mean to. I am getting a few more emails sneaking into my inbox, but each day, I add a few more lines and the list gets shorter the next day.

Now that I am getting a bit more spam each day into my inbox, Thunderbird junk filters are getting more to train on, and they are getting better too, so even the junk that is getting in, is mostly getting filed in the Junk folder locally, automatically.

Here are two things that took me longer than they should have to figure out with maildrop (they are related, meaning the solution is identical in both cases, but it wasn’t obvious to me):

  • How to negate a test using !
  • How to use weighted scoring correctly (very simple in procmail)

Here’s a line in maildrop format:

if ($TESTVAR =~ /123/)

do something useful if true…

The above will “do something useful” if the variable TESTVAR contains the pattern 123. What if I want to “do something” if TESTVAR does not contain 123? Well, until I figured it out, I was making an empty block for “do something”, and adding an else for the thing I really wanted. Ugly.

My first attempt was to change the “=~” to “!~” (seemed obvious). Nope, syntax error. I then tried “if !($TESTVAR =~ /123/)”. Nope, syntax error. I then tried “if (!$TESTVAR =~ /123/)”. No syntax error, but it doesn’t do what I wanted.

I stumbled on the solution via trial and error:

if (!($TESTVAR =~ /123/))

Ugh. The ! can only be applied to an expression, which is normally (but not always?!?) enclosed in parens. But, the if itself requires an expression, so you need to put parens around the negated expression as well. At least I know now…

The second problem was weighted matches. I was having the same problem. Once I put parens around my expressions, it started working. That’s one of the few places where the procmail syntax feels a drop cleaner:

COUNT=(/123/:b,1)

COUNT=$COUNT+(/456/:b,1)

COUNT=$COUNT+(/789/:b,1)

echo $COUNT

So, the above sets the variable COUNT to the number of times that the string 123 exists in the body of the message. That is then added to the number of times that the string 456 exists in the body, finally adding the number of times that the string 789 exists in the body. The total is then echoed to the console. Without the parens, no workie.

I don’t like the fact that I have to maintain the running count myself. In procmail, you basically set a limit and the tests stop once the limit is reached (which feels way more efficient). There might be a way to accomplish that with maildrop too, but I haven’t found it as yet…

While I fully expect to add more rules, or lines to existing rules, I can’t imagine a scenario where my file will even double from here, so it will end up at less than 1000 lines. That will be easier to maintain for a number of reasons, most notably syntax readability.

New Machine

Send to Kindle

On April 23rd I announced the christening of my new server. At the time, I put the percentage of services that had been ported over at 95. It’s been at least 5 days since I’ve been at 100%, so the new machine is definitely “official”. Everything has been updated to point to the new machine, and all but one thing are running as expected.

The only problem I have is with one VoIP provider. I can’t get any audio to work between us, and the problem is definitely on my end, which is the main reason for not naming the provider. I can still connect reliably to them from my old server, from a different server that I control, and from a softphone as well, so something is broken on my new server in the config for them. That said, all other providers work, including identically configured ones, so it’s not a firewall problem, nor generically a broken Asterisk install. I’m not happy with this, because I can’t think of anything more to test. I’ve written twice to the Asterisk mailing lists, with no useful suggestions left to try. 🙁

I could probably write for hours on the experience of building the new machine. Very few people would maintain interest in that, I’m sure. I also don’t need it for a cathartic release, because I took very copious notes on the whole thing in a Google Notebook.

So, I’ll try to boil the essence down here, with the hope of not losing your interest too quickly. 🙂

The purpose of the change was to upgrade the OS from Red Hat 9 to CentOS 5.0. That worked well. I actually installed CentOS 5.0 Beta first, and then did an upgrade through yum, which worked fine!

My first real disappointment was attempting to build OpenPKG on the new box. The concept sounded really cool to me. The biggest reason for moving from RH9 to CentOS5 was that newer RPMs were harder and harder to find for RH9. OpenPKG held out the promise that one wouldn’t have to worry about this in the future, with the added benefit that you would never accidentally step on the operating system’s packages.

Unfortunately, I ended up wasting a ton of time on it, and it eventually failed to install itself, claiming that gcc couldn’t create executables on the system. Of course it could, as I built quite a number of packages from source… So, great concept, just not right for me at this time…

Had a minor glitch with SELinux (first time I’ve been on a system that was running it). Had to temporarily disable some of the checks to get a package installed and running, but was able to turn it back on afterwards, and haven’t had a problem since.

I have been a very happy user of Courier-IMAP for years, and felt guilty about even considering an alternative (just a loyalty thing). But, I’d read a number of nice things about Dovecot, and it just went official 1.0 a few days before, so I decided to try it. I’m really happy with it. It worked correctly the first time, and configuration was as straightforward as I was led to believe. On the other hand, it wasn’t a quick config, because there are so many things that you can (and sometimes should) set. The single config file (which I like!) is huge, because it’s so well documented, that the choices are relatively simple. You just have to read all those darn docs… 😉

Also installed the latest Postfix 2.4.0. I’ve been really happy with Postfix for years, and had little intention of switching that.

One minor nit about Linux in general. It’s a little annoying that dependencies can get out of whack quite easily. Some system thing depends on openssl-0.9.7 (for example), and you know that 0.9.8e fixes some bugs, and perhaps some new software you’re installing wants that. So, now it needs to go in it’s own directory (’cause you can’t mess with the system one), and then every package has to be told where to find the new one, etc. It all works, but it’s still a PITA.

Installed the latest WordPress (which of course meant MySQL and PHP, etc.). This time, the email config problem that I had on the old machine just disappeared (hooray!). I didn’t config it any differently, so who knows what was wrong before…

Installed the latest Zope (2.10.3, not Zope 3), and had remarkably few problems slurping up my old Data.fs file from a Zope 2.6.x installation. Very cool.

Switched from one webmail client to another, even though I had been happy with the former for years. The latter does more, of which I’m sure I won’t partake of the additional functionality anyway. It works, so that’s all I care about. I rarely use webmail, but when it’s necessary, it’s also ultra convenient (and, as stated, necessary). 😉

One of the bigger odysseys was the installation of a Jabber server. This should probably be its own post, but if it was, I would never condense it, so I’ll do my best not to go on too much here. On the old machine, I was running jabberd-1.4.3 for years. Jabberd2 was just out at the time that I first installed 1.4.3 (they are not the same project). I was able to get jabberd2 to work at the time, but I could not get the AIM and ICQ transports to work, so I reverted to 1.4.3.

The jabberd14 project is still alive and kicking, and I could have saved a lot of headaches if I had stuck with it. But, for a while, I wanted to try ejabberd. It is the official server of jabber.org since February 2007, which seemed impressive to me. 😉

Ejabberd is written in Erlang, and is supposed to scale like crazy (not that I have the slightest need for scale). The concept intrigued me. I’ll spare you all of the insane problems I had getting it to work right. Suffice it to say that it was not my fault, which is rare in these situations. 😉

When I finally got it to work stably, I installed the Python-based AIM and ICQ transports (PyICQ-t and PyAIM-t). The AIM transport worked correctly, and the ICQ one was flaky (solution later on).

Then Rob Page asked me to take a look at Openfire (previously called Wildfire). It sounded cool, and since I was having a problem with the ICQ transport, I figured I’d give it a shot. Man, it installed so easily from RPM, didn’t touch a single file on the system, could be uninstalled trivially, etc. In summary, I liked it instantly. I wasn’t crazy about running a JVM on the system full time, but the load would be negligible, so I decided to switch to it. Of course, while it worked well, and the administration was wonderful, the ICQ plugin was experimental (the AIM one is production), and it behaved like an experimental plugin, which put me where the other one did. There were a few other small annoyances in Openfire as well.

That made me decide to go back and beat my head on the ejabberd server and transports. Long story short, after investigating my setup on the old machine (prompted by Z_God in the Python Transports conference room), I noticed that I didn’t understand how transports speak to the main server. I had them both speaking on the same port (which the sample config file showed!), but on the working server, each transport spoke to the server on its own port! I switched ICQ and AIM to speak to ejabberd on separate ports, and voila, it has been rock solid ever since. I have retired Openfire, and am a very happy ejabberd and python-transports customer! 🙂

That’s pretty much it (at least at a high level). I’m happy with the machine. As usual, more twists and turns than one hopes for, but also more learning experiences than I expected, and interesting ones at that, mostly ending in success. Now if I can only figure out that one SIP provider audio problem, I could get back to some serious poker playing. 😉