Friday, November 2, 2007

Filtering Image Spam With FuzzyOCR And SpamAssassin

Struggle against a spam.... An episode the third.

This article describes how to scan emails for image spam with FuzzyOCR. FuzzyOCR is a plugin for SpamAssassin which is aimed at unsolicited bulk mail containing images using different methods, it analyzes the content and properties of images to distinguish between normal and spam mails. Installation will reviewed on Debian(Etch).

I assume that SpamAssassin (and MailServer) is already installed and working :) and exist symlink /etc/mail/spamassassin (other case ln -s /etc/spamassassin /etc/mail/spamassassin ).
For the beginning We install necessary dependences:
aptitude install netpbm gifsicle libungif-bin gocr ocrad libstring-approx-perl libmldbm-sync-perl imagemagick tesseract-ocr libdbd-mysql-perl libdbi-perl libtie-cache-perl
Next step We download unpack and install the latest FuzzyOCR :
cd /usr/src/
tar -zxvf fuzzyocr-3.5.1-devel.tar.gz
cd FuzzyOcr-3.5.1/

cp -r FuzzyOcr* /etc/spamassassin/ (include directory FuzzyOcr/ !!! )

Source directory /usr/src/FuzzyOcr-3.5.1/ contain directory sapmples/ with sample spam emails, that we need later for testing.
So, installation finished, now we start to configure it. All configs are in /etc/spamassassin/
In the file /etc/mail/spamassassin/
uncomment follow line:
focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words

The file /etc/mail/spamassassin/FuzzyOcr.words
is the predetermined list of words which goes with FuzzyOCR. You can recustomize оr to add it under the needs.
Replace these two lines
focr_bin_helper pnmnorm, pnminvert, pamthreshold, ppmtopgm, pamtopnm

focr_bin_helper tesseract
on following
focr_bin_helper pnmnorm, pnminvert, convert, ppmtopgm, tesseract

Finally we add or uncomment next lines:
focr_path_bin /usr/local/netpbm/bin:/usr/local/bin:/usr/bin

focr_preprocessor_file /etc/mail/spamassassin/FuzzyOcr.preps
focr_scanset_file /etc/mail/spamassassin/FuzzyOcr.scansets
focr_enable_image_hashing 2
focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb
focr_db_hash /etc/mail/spamassassin/FuzzyOcr.db
focr_db_safe /etc/mail/spamassassin/
The last 4 lines - adjustment hashing instead of MySQL.
Now we can feed all samples-spam mails to spamassassin, for check of its connected with Fuzzy.
/usr/bin/spamassassin --debug FuzzyOcr < /usr/src/FuzzyOcr-3.5.1/samples/ocr-gif.eml > /dev/null

As you see FuzzyOCR is working.

Now restart Spamassassin and closely check (tail -f /var/log/ on presence of errors from spamassassin or Perl modules.
........Your SpamAssassin is now able to recognize image spam!