Is there a way to get spamassassin to score the top lines of a message body more heavily?
A lot of spam is getting through the filter on the mail server I run with the relatively simple trick of starting with few lines of (incredibly obvious) weight loss or other scam text at the top, followed by a larger body of text from programming documentation — or, most evil of all, text scraped from Stack Exchange. At best, Spamassassin regards this as BAYES_50, and it happens that the rest of the messages are constructed carefully enough that they don't hit other triggers. (For example, the headers are minimal and correct.) Often, the included excerpts align closely enough with my legitimate interests that the message overall is scored as BAYES_00, because the very spammy tokens are just overwhelmed by juicy nuggets of sysadmin problem-solving.
The top part is so obviously spammy (and in fact tends to be very similar to previously-received and trained as spam messages) that I'm kind of amazed that it's getting through — but clearly it is. It seems like a separate pass which scored the top 25 (or so) lines of the message and weighed that heavily would solve the problem. Is there a way to do this?
Several people have suggested writing custom regular expressions. I do not want to get into this, as this is a constant losing battle. It's what people did before Bayesian spam sorting came into widespread use, and it was generally terrible. No human can keep up. It's not much more effective than just hitting the delete key for each spam message, and a lot more work on my part.
Bayesian spam filtering works. It even works on this spam, if I split out the "above the fold" portion and just analyze that part, with the decoy / chaff removed. The question is: how can I get Spamassassin to do that?
Solution 1:
I am a (little) vivid anti-spam fighter myself. And because of many problems as you encounter, I ended up doing the dirty things myself, years ago.
Now, this is not an answer to your particular question, but to your particular problem. So please don't downvote because of this.
How I solved this problem was to modify the sa_filter-post.pl script, used by XMail server, which calls spamc on the email file and does some minor stuff there, to process not the entire file, but specific parts of it, based on some specific rules (hardcoded by me). yes, regex'es but so far they work for me (I do have a bunch of other scripts before and after this one so that may play a role)
For example, I have a regex that fishes out phonenumbers. The spammer left that in full, so that goes straight out to process only the middle 400 chars of the file (I got to 400 by trial and error really, started from 200). Note that it's pretty hard to pick out the middle of what you see, compared to what is in the file.
There is another one that has the same structure of the html table with the "products", a dummy header and not usable footer, so I strip those out, I strip the "products" comments column out and then pass that on to spamc.
And so on, you get the picture.
But not all rules are perfect, so I do a little magic here by assigning a private score to each rule, which I hardcode and tune up or down when needed, based on how the rule behaves (and sometime I end up deleting rules all togethe). I then modify the SA score by the private score. The reason I did this was because for some reason SA only gave scores like 4. something to stuff clearly spam on rules that I also had strong feelings to catch them right. So I gave them just a little boost to go over 5.0, coupled with some post-processing scripts that take some other variables into consideration (source of email, target of email, structure of header, etc), it more or less kills the spam out.
Now I realize this isn't what you were hoping for, but in my case it gives me a whole lot of power over what gets scanned, it's just that I need to set things up manually and then every now and then do little touch-ups on the values/regex'es.
But in your case things are a lot easier as all you have to do is use a simple bash script that will be called by your MX instead of spamc and have that script use head command to only get the first whatever number of bytes you want and pass that temporary file to spamc.
The contents of the script will depend a bit on your mail server, but that shouldn't be hard to figure out.
(Note that I only talked that much of my setup so that you can see the possibilities of this option)
PS: I personally never got this kind of spam emails (with programming related goodies in them), so I wonder if you haven't pissed someone and now you're targeted. That would explain the specially crafted emails. The reason I think about this possibility is that years ago, when I was very active on various IT forums and groups, I did piss some people off and every now and then I used to get various types of attacks on my server, including email spamming. But back then the idiots weren't this smart :)