"In het verleden behaalde resultaten bieden geen garanties voor de toekomst"
About this blog

These are the ramblings of Matthijs Kooijman, concerning the software he hacks on, hobbies he has and occasionally his personal life.

Most content on this site is licensed under the WTFPL, version 2 (details).

Questions? Praise? Blame? Feel free to contact me.

My old blog (pre-2006) is also still available.

Sun Mon Tue Wed Thu Fri Sat
Powered by Blosxom &Perl onion
(With plugins: config, extensionless, hide, tagging, Markdown, macros, breadcrumbs, calendar, directorybrowse, entries_index, feedback, flavourdir, include, interpolate_fancy, listplugins, menu, pagetype, preview, seemore, storynum, storytitle, writeback_recent, moreentries)
Valid XHTML 1.0 Strict & CSS
Retraining the Spamassassin Bayes filter with recent messages

On my mailserver, I'm using Spamassassin with a Bayes filter to detect spam. Such a filter needs to be trained with samples of spam and ham (non-spam) messages to let it learn what spam and ham looks like, but it also needs to be retrained when the spam or ham changes over time. I have some automatic training set up, but since a while I've seen the bayes filter being completely wrong (showing a confident ham score for something that is very clearly spam), so I decided to retrain the filter from scratch, using the spam and ham messages I collected over the last time (I don't really throw away any e-mail).

Since training with all my e-mail is not productive (more than 5,000 messages aren't really helpful AFAIU, and training with old messages is not representative for current messages), I decided to just take all of my e-mail and take the last 2,000 spam and ham messages and train with that. My spam is neatly collected in 2 mailboxes (Spam for obvious spam and ProbablySpam for messages that need an occasional review to find false positives), but my ham is sorted out in dozens of different mailboxes. Hence, I needed some find magic to get a list of the most recent spam and ham messages. So, I built these commands:

# find Spam ProbablySpam -type f \( -path '*/cur/*' -o -path '*/new/*' \) -printf "%T@ %p\n"
  | sort -n | cut -d' ' -f 2 | tail -n 2000 > spam
# find . -type d \( -path ./Spam -o -path ./ProbablySpam -o -path ./Bulk -o -path ./Sent \) -prune -o \
         -type f \( -path '*/cur/*' -o -path '*/new/*' \) -printf "%T@ %p\n" \
  | sort -n | cut -d' ' -f 2 | tail -n 2000 > ham
# sa-learn --progress --spam -f spam
# sa-learn --progress --ham -f ham

After retraining with recent spam, the results were a lot better, so I'm not longer spending time every day deleting a couple dozens spam e-mails :-D

Related stories
0 comments -:- permalink -:- 21:03
Copyright by Matthijs Kooijman