The niflheim.net domain used to belong to somebody else. You can tell, because of the volume of spam mail that is sent to addresses that no longer exist. Spam is also sent to addresses that never existed in the first place, or to no address at all.
All of this is considered “unrouted mail” by my host. I dump it all into a special account that I set up for this purpose, with a 5 meg disk quota, and never check. (I had cleaned it out once, a month or so after I set it up, but ignored it ever since.) The idea was that it would eventually fill up and start bouncing.
It turns out that the mail software ignores the disk quota, so the mailbox just continues to grow. One day last week, I decided to check the account. It has been accumulating mail ever since.
10953 messages.
50 megs of my host's disk space.
All of it, every last piece, spam. I do not like Green Eggs and Spam. I do not like them, Sam-I-am.
Then it hit me. Recent versions of Mozilla contain Bayesian junk mail filters. (Mail filters that can be trained on the specific mail that you get to recognize what most frequently distinguishes good from bad mail.) 10953 spam messages is a lot of training.
Now, I know that the Bayesian filter in Mozilla isn't the best in the world yet. And I don't actually get a lot of spam at my real email addresses anymore (I've been canny about posting my addresses, and I can change them on a whim.) But on a lark, I decided to download and clear out all of that mail, and punch it through the filters. Several hours later (even on a cable modem … mail servers are slow) my junk mail training file was 11000 messages richer (I also marked all of my email list and ordinary email as Not Junk). The training file is 2.6 megs. Does it help?
Well, before I trained it, it let through all 11000 messages. Afterwards, it lets through 1 or 2 pieces out of each batch of 50-70 spam mails. <=2% false negative rate. Since I went through and classified all of my known good mail, it hasn't yet incorrectly marked one of those. 0% false positive rate (so far).
Since mozilla's bayesian filter doesn't work as well as it most, these numbers are likely to improve as the software gets improved.
Posted by Dyne on June 23, 2003 04:09 PM