How do you test a new email filtering system?

Solution 1:

Testing a email filtering product seems to be a lot harder that it sounds on the surface. As I thought about it, I outlined my ideas and broke them down into specific problem-domains.

Delivery

The first part of testing is associated with mocking-up delivery of email to the filtering system. Potentially, a filtering system can take any of the following (and probably more things I'm forgetting) into account:

  • Specific details of the SMTP protocol (did the sender do an HELO, EHLO, and what with name or IP, etc)
  • Attributes of the sending domain's DNS (does it resolve, SPF records, domain-keys records)
  • Reachability of the sender's MX (for reverse-path verification)
  • Source IP address of the server delivering the message
  • Content of the message
  • An "opinion" that third-party databases or "signatures" give about any of the above (DNSBLs, proprietary "reputation" databases, etc)

Mocking up delivery of messages to a filtering tool becomes less representative of reality if any of these factors aren't the same as what would occur during "real" delivery. Sending captured incoming email to a filtering rig in a manner that doesn't simulate the source IP address of the sending server, for example, impedes that filtering product's ability to act on that factor (running it by a DNSBL, etc). Likewise, sending delayed messages to a filtering (say you have a "canned" corpus of messages to test with) will give a false impression of behavior because the attributes of the sender's DNS and of any third-party databases or signatures may have changed since the time the message was originally sent (sender altered their SPF records to prevent false positives, a third-party service "blacklisted" the sender, etc).

I'd argue that it's impossible to completely mock-up reality, as it comes to delivery. Getting close going to be fairly difficult if you intend to simulate the sending server's IP address (and I'm not aware of any "off the shelf" solution that does that... and having gotten fairly good results from DNSBLs, I'd be concerned about not simulating the sending server's IP address.) Ultimately, you'll just have to get as close as you can afford.

Storage

You've got to stored the filtered email somewhere if you intend to analyze the results. Without storage of some kind there's no good way to actually view the results. Sure, the filtering software generates statistics, but it would certainly improve my comfort-level if I could see the filtered messages.

Some filtering products have a storage capability built-in (like MailMarshal, for example). Other products expect to have an email system to deliver into and don't have any storage capability. To test those products, and so as not to disrupt your production email system, you'll have to create up some kind of secondary email infrastructure to store the test filtering results.

If licensing expense is a concern, you can use free and open source tools to prevent incurring licensing expense. That may present a learning curve for the testing personnel.

More convoluted "groupware"-type email systems may present a challenge in mocking-up because of their dependencies on other services. Exchange Server will require you to mock-up an Active Directory infrastructure to host the mailboxes. Other "groupware"-type email systems (Notes, Groupwise, etc) will have their own associated degrees of difficulty in creating parallel infrastructures.

Analysis

On top of mocking-up delivery and storing the results you also need to have human analysis of the accuracy of filtering (flagging as spam, deleting, filing to a "Junk E-mail" folder, etc). It probably goes w/o saying, but a human needs to verify the results of filtering since the whole point of this exercise is to test the computer's ability to filter results. (I know, that went w/o saying... but I can just hear somebody saying "But wait! We'll write a script to test the accuracy...")

The filtering analysis is problematic foremost, to my mind, as a matter of scale. For any large group of end-users, filtering analysis is going to be beyond the abilities of an individual or small group of testers (or, more likely, they'll just "spot check" and hope for the best). Moreover, the tester may have a different idea of what should be considered "spam" than the real receiving user (who really does want to receive those stupid promotional emails from their car rental company, or the "vocabulary word of the day"). Identifying obviously failed messages is probably pretty easy (poronographic emails in the "Inbox", messages from known Customers in the "Junk E-mail" folder, etc), but there's certainly the potential for subtlety.

I don't think there's a 100% substitute for reality-- you're just going to have to get close and hope that your testing setup reflects reality closely enough to give you an accurate assessment of the filter's performance.

Solution 2:

Our company fits your criteria very well. For about a month before testing the new spam filter I gathered up all the spam our system was receiving by asking the users to forward it to a special account instead of just deleting it. Spam detected by our third party filter service was also forwarded to the spam account. That spam was then resent through the new filter to a temporary dummy account to see how it performed. As it was all sent in one big batch this also tested the load handling capacity of the filter.

Some adjustments were made as a result of the test and the same spam was also fed into the learning system. Users still send me any spam that gets through but over time the quantity has lessened to the point where users complain if they get just one or two spams a week.

Solution 3:

You can test by sending a mail with a GTUBE (Generic Test for Unsolicited Bulk Email).

Simply create a mail containing this :

XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X

This is working fine with spamassassin for example. You may find other GTUBE for specific antispams