"In het verleden behaalde resultaten bieden geen garanties voor de toekomst"
Powered by Blosxom &Perl onion
(With plugins: config, extensionless, hide, tagging, Markdown, macros, breadcrumbs, calendar, directorybrowse, entries_index, feedback, flavourdir, include, interpolate_fancy, listplugins, menu, moreentries, pagetype, preview, seemore, storynum, storytitle, writeback_recent)
Valid XHTML 1.0 Strict & CSS
Retraining the Spamassassin Bayes filter with recent messages

On my mailserver, I'm using Spamassassin with a Bayes filter to detect spam. Such a filter needs to be trained with samples of spam and ham (non-spam) messages to let it learn what spam and ham looks like, but it also needs to be retrained when the spam or ham changes over time. I have some automatic training set up, but since a while I've seen the bayes filter being completely wrong (showing a confident ham score for something that is very clearly spam), so I decided to retrain the filter from scratch, using the spam and ham messages I collected over the last time (I don't really throw away any e-mail).

Since training with all my e-mail is not productive (more than 5,000 messages aren't really helpful AFAIU, and training with old messages is not representative for current messages), I decided to just take all of my e-mail and take the last 2,000 spam and ham messages and train with that. My spam is neatly collected in 2 mailboxes (Spam for obvious spam and ProbablySpam for messages that need an occasional review to find false positives), but my ham is sorted out in dozens of different mailboxes. Hence, I needed some find magic to get a list of the most recent spam and ham messages. So, I built these commands:

# find Spam ProbablySpam -type f \( -path '*/cur/*' -o -path '*/new/*' \) -printf "%T@ %p\n"
  | sort -n | cut -d' ' -f 2 | tail -n 2000 > spam
# find . -type d \( -path ./Spam -o -path ./ProbablySpam -o -path ./Bulk -o -path ./Sent \) -prune -o \
         -type f \( -path '*/cur/*' -o -path '*/new/*' \) -printf "%T@ %p\n" \
  | sort -n | cut -d' ' -f 2 | tail -n 2000 > ham
# sa-learn --progress --spam -f spam
# sa-learn --progress --ham -f ham

After retraining with recent spam, the results were a lot better, so I'm not longer spending time every day deleting a couple dozens spam e-mails :-D

Copyright by Matthijs Kooijman