apt-get install netpbm bzip2 libmldbm-perl libstring-approx-perl libmldbm-sync-perl
apt-get install liblog-agent-perl libdbi-perl libdbd-mysql-perl libtie-cache-perl
Download, extract, patch, compile and install
libungif
(even if you have libungif-bin installed):
cd /usr/local/src
wget http://internap.dl.sourceforge.net/sourceforge/giflib/libungif-4.1.4.tar.gz
tar xzvf libungif-4.1.4.tar.gz
cd libungif-4.1.4/util
wget http://users.own-hero.net/~decoder/fuzzyocr/giftext-segfault.patch
patch giftext.c < giftext-segfault.patch
cd ..
./configure --prefix=/usr && make && make install
Download, extract, compile and install gocr:
cd /usr/local/src
wget http://www-e.uni-magdeburg.de/jschulen/ocr/gocr-0.43.tar.gz
tar xzvf gocr-0.43.tar.gz
cd gocr-0.43
./configure --with-netpbm=/usr/lib --prefix=/usr && make && make install
At this point make we should have all of these programs installed in /usr/bin:
which gifsicle
which giffix
which giftext
which gifinter
which giftopnm
which jpegtopnm
which pngtopnm
which bmptopnm
which tifftopnm
which ppmhist
which pamfile
which ocrad
which gocr
which pnmnorm
which pnminvert
which ppmtopgm
Visit
http://fuzzyocr.own-hero.net/wiki/
If you are using SpamAssassin 3.1.x, install FuzzyOcr 3.5.1 - this document is based
on that version. If you are using SpamAssassin 3.2.x, do not use these commands,
jump ahead to the next set of commands:
cd /usr/local/src
wget http://users.own-hero.net/~decoder/fuzzyocr/fuzzyocr-3.5.1-devel.tar.gz
tar xzvf fuzzyocr-3.5.1-devel.tar.gz
cd FuzzyOcr-3.5.1
If you are using SpamAssassin 3.2.x, then install FuzzyOcr from SVN:
apt-get install subversion
cd /usr/local/src
test -e FuzzyOcr-3.5.1 && mv FuzzyOcr-3.5.1 FuzzyOcr-3.5.1-old
test -e devel && mv devel devel-old
svn -r 131 co svn://svn.own-hero.net/fuzzyocr/trunk/devel
mv devel FuzzyOcr-3.5.1
cd FuzzyOcr-3.5.1
If you are using netpbm < 10.34 (Debian uses 10.0-10.1) you need to apply these
patches. They disable some features only available in newer versions:
wget http://www200.pair.com/mecham/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch
wget http://www200.pair.com/mecham/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch2
wget http://www200.pair.com/mecham/spam/FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch3
patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch
patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch2
patch -p0 < FuzzyOcr-3.5.0-rc1.netpbm_less_than_10.34.patch3
Our Debian version of Netpbm (10.0) may not contain both of these:
which pamtopnm
which pamditherbw
Those programs are used together; therefore we need to disable any scansets that use them
and we will also remove them from the preprocessors file. If you have both of these
programs, you do not need to apply these patches. If you are missing either one of them,
apply these patches:
wget http://www200.pair.com/mecham/spam/gary.3.5.0-rc1.old.netpbm.patch1
wget http://www200.pair.com/mecham/spam/gary.3.5.0-rc1.old.netpbm.patch2
wget http://www200.pair.com/mecham/spam/gary.3.5.0-rc1.old.netpbm.patch3
patch -p0 < gary.3.5.0-rc1.old.netpbm.patch1
patch -p0 < gary.3.5.0-rc1.old.netpbm.patch2
patch -p0 < gary.3.5.0-rc1.old.netpbm.patch3
Now that we are all patched up, we can place the files:
cp -r FuzzyOcr /etc/mail/spamassassin
cp FuzzyOcr.cf /etc/mail/spamassassin
cp FuzzyOcr.pm /etc/mail/spamassassin
cp FuzzyOcr.preps /etc/mail/spamassassin
cp FuzzyOcr.scansets /etc/mail/spamassassin
cp FuzzyOcr.words /etc/mail/spamassassin
Configure FuzzyOcr.cf:
vi /etc/mail/spamassassin/FuzzyOcr.cf
Set log level to 2 (only while we test):
#focr_verbose 3
focr_verbose 2
Enable logging by uncommenting this:
#focr_logfile /tmp/FuzzyOcr.log
uncomment:
#focr_timeout 15
Some people think the default of 0.25 here is too fuzzy, so uncomment
(and maybe set to 0.21 or 0.22):
#focr_threshold 0.20
Set focr_base_score to 3 (this is my personal choice):
#focr_base_score 5
focr_base_score 3
I change focr_add_score from the default of 1, to 0.5:
#focr_add_score 0.375
focr_add_score 0.5
I lower focr_corrupt_score:
#focr_corrupt_score 2.5
focr_corrupt_score 1.5
I lower focr_corrupt_unfixable_score:
#focr_corrupt_unfixable_score 5
focr_corrupt_unfixable_score 2.5
Save and exit the file, then we test. Start by linting spamassassin:
spamassassin --lint
If you get this error:
Subroutine FuzzyOcr::O_NONBLOCK redefined at /usr/share/perl/5.8/Exporter.pm line 65.
at /usr/lib/perl/5.8/POSIX.pm line 19
it appears to be related somehow to Net::Ident. If you run spamd with the --auth-ident
option then you need this module and will have to deal with the harmless error message
(actually it may not be harmless if you depend on a clean --lint).
If you don't need Net::Ident (you don't have any programs that use it), then I
suggest you remove it:
apt-get remove libnet-ident-perl
Once you have resolved any (serious) lint errors, we do some more testing.
This assumes you are still in the /usr/local/src/FuzzyOcr-3.5.1 directory:
cd samples
spamassassin -tD < ocr-animated.eml
I got:
5.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"price" in 1 lines
"company" in 1 lines
"alert" in 1 lines
"news" in 1 lines
(6 word occurrences found)
If you did not get something similar, check the log for the last error message (if any).
For example, on a low powered machine you may have to increase focr_timeout in
/etc/mail/spamassassin/FuzzyOcr.cf:
cat /tmp/FuzzyOcr.log
Continue on to the next test:
spamassassin -tD < ocr-gif.eml
I got:
1.5 FUZZY_OCR_WRONG_CTYPE BODY: Mail contains an image with wrong
content-type set
Image has format "GIF" but content-type is
"image/jpeg"
2.5 FUZZY_OCR_CORRUPT_IMG BODY: Mail contains a corrupted image
Corrupt image: GIF-LIB error: Image is
defective, decoding aborted.
8.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"target" in 1 lines
"service" in 1 lines
"stock" in 2 lines
"price" in 2 lines
"company" in 1 lines
"recommendation" in 1 lines
(12 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-jpg.eml
I got:
5.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"levitra" in 1 lines
"cialis" in 1 lines
"viagra" in 2 lines
(6 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-obfuscated.eml
I got:
3.0 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"profit" in 1 lines
"profit" in 1 lines
(2 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-png.eml
I got:
15 FUZZY_OCR BODY: Mail contains an image with common spam text inside
Words found:
"buy" in 1 lines
"target" in 2 lines
"service" in 1 lines
"stock" in 1 lines
"investor" in 1 lines
"price" in 3 lines
"company" in 2 lines
"trade" in 1 lines
"software" in 1 lines
"recommendation" in 1 lines
"news" in 3 lines
(25.5 word occurrences found)
Continue on to the next test:
spamassassin -tD < ocr-wrongext.eml
I got:
1.5 FUZZY_OCR_WRONG_CTYPE BODY: Mail contains an image with wrong
content-type set
Image has format "GIF" but content-type is
"image/jpeg"
1.5 FUZZY_OCR_WRONG_EXTENSION BODY: Mail contains an image with wrong
file extension
Image has format "GIF" but file extension is
"jpeg"
2.5 FUZZY_OCR_CORRUPT_IMG BODY: Mail contains a corrupted image
Corrupt image: GIF-LIB error: Image is
defective, decoding aborted.
8.0 FUZZY_OCR_KNOWN_HASH BODY: Mail contains an image with known hash
Words found:
"target" in 1 lines
"service" in 1 lines
"stock" in 2 lines
"price" in 2 lines
"company" in 1 lines
"recommendation" in 1 lines
(12 word occurrences found)
Reload amavisd-new (or spamd, for those that use spamd) and send a test message through.
You will have to give ownership of the log file to the amavis user (or whatever user
is sending the message through spamd or spamassassin):
chown amavis:amavis /tmp/FuzzyOcr.log
You can grab host.gif
from me and attach it to a message and send it through the spamfilter.
Tail the log file as you send it through:
tail -f /tmp/FuzzyOcr.log
I got:
2006-12-25 19:50:47 [15341] Processing Message with ID
"<1499292447.20061225195047@example.net>" (Reporter
-> garyv@example.com)
2006-12-25 19:50:47 [15341] GIF: [192x361] host.gif (3696)
2006-12-25 19:50:47 [15341] Found: 1 images
2006-12-25 19:50:47 [15341] Found GIF header name="host.gif"
2006-12-25 19:50:47 [15341] Image is single non-interlaced...
2006-12-25 19:50:47 [15341] Image hashing disabled in configuration, skipping...
2006-12-25 19:50:47 [15341] Scanset Order: ocrad(0) ocrad-invert(0) gocr(0) gocr-180(0)
2006-12-25 19:50:47 [15341] Scanset "ocrad" found word "drugs" with fuzz of 0.0000
line: "ed drugs "
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "cialis" with fuzz of 0.0000
line: "viagrd cialis letra "
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "viagra" with fuzz of 0.1667
line: "viagrd cialis letra "
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "price" with fuzz of 0.0000
line: "iowest onlie price garanteedi"
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "profit" with fuzz of 0.1667
line: "w guarantee oo topqalityofthe prodit we oi"
2006-12-25 19:50:48 [15341] Scanset "ocrad" found word "prescription" with fuzz of 0.0000
line: "uick here wo prescription re uiredi "
2006-12-25 19:50:48 [15341] Scanset "ocrad" generates enough hits (6),
skipping further scansets...
2006-12-25 19:50:48 [15341] Message is spam, score = 6.500
2006-12-25 19:50:48 [15341] Words found:
"drugs" in 1 lines
"cialis" in 1 lines
"viagra" in 1 lines
"price" in 1 lines
"profit" in 1 lines
"prescription" in 1 lines
(9 word occurrences found)
Edit FuzzyOcr.cf, turn off verbose logging and set focr_autodisable_score score back to
a suitable level:
vi /etc/mail/spamassassin/FuzzyOcr.cf
focr_verbose 0
I set the focr_autodisable_score to the same value as my
$sa_kill_level_deflt in amavisd.conf:
focr_autodisable_score 8
Once again reload amavisd-new (or spamd):
amavisd-new reload
And keep an eye on the mail.log for a while:
tail -f /var/log/mail.log
In FuzzyOcr.cf I suggest you do not configure an image hash database. If you decide to,
remember that the database must be writable by the user running SA. For example, if you are
running amavisd-new, the user (on a Debian system) would be 'amavis'. The easiest way to allow
the amavis user to write to database files would be to place the files in the amavis
home directory or subdirectory (/var/lib/amavis or /var/lib/amavis/db for example).
This also applies to the log file. Do not leave the log file in the /tmp directory
if you plan on doing any logging at all or you risk filling up the /tmp directory.
Move it into /var/log or the home directory of the user running SA and keep an eye on it.
If you want to rotate the log, study 'man logrotate' and any examples you might find in
/etc/logrotate.d/.
focr_verbose 0
focr_logfile /var/log/FuzzyOcr.log
touch /var/log/FuzzyOcr.log
chown amavis:amavis /var/log/FuzzyOcr.log
This seems to work:
contents of /etc/logrotate.d/FuzzyOcr:
/var/log/FuzzyOcr.log {
rotate 7
daily
compress
delaycompress
copytruncate
notifempty
}
If using amavisd-new at log levels higher that 0, you may see something similar
to this in your log file:
Feb 4 11:40:21 sfa amavis[13467]: (13467-01) extra modules loaded:
/etc/spamassassin/FuzzyOcr.pm, /usr/lib/perl/5.8/auto/Storable/autosplit.ix,
/usr/share/perl5/auto/Log/Agent/Priorities/autosplit.ix,
/usr/share/perl5/auto/Log/Agent/autosplit.ix, FuzzyOcr/Config.pm,
FuzzyOcr/Deanimate.pm, FuzzyOcr/Hashing.pm, FuzzyOcr/Logging.pm,
FuzzyOcr/Misc.pm, FuzzyOcr/Preprocessor.pm, FuzzyOcr/Scanset.pm,
FuzzyOcr/Scoring.pm, Log/Agent.pm, Log/Agent/Formatting.pm,
Log/Agent/Message.pm, Log/Agent/Priorities.pm, MLDBM.pm, MLDBM/Sync.pm,
MLDBM/Sync/SDBM_File.pm, SDBM_File.pm, Storable.pm,
String/Approx.pm, Tie/Cache.pm
If you are
using amavisd-new 2.4.3 or newer you can do something about this. Note that
FuzzyOcr will work fine even if you don't do this. We can however get rid of the
extra log entry and optimize loading of FuzzyOcr and other modules (which would be nice).
From the amavisd-new RELEASE_NOTES
- added a global configuration variable @additional_perl_modules, which
is a list of additional Perl module names or absolute file names that
should be compiled/executed (by calling 'require') at a program startup
time by a master parent process, before chroot-ing and before changing
UID takes place. Its purpose is to pre-load additional non-standard
SpamAssassin plugins and similar modules that a standard SpamAssassin
initialization would miss, causing them to be loaded later by each
child process, which is inefficient and may not work in a chrooted
process. Example:
@additional_perl_modules = qw(
/usr/local/etc/mail/spamassassin/FuzzyOcr.pm
/usr/local/etc/mail/spamassassin/ImageInfo.pm
/usr/local/etc/mail/spamassassin/WebRedirect.pm
String::Approx Net::HTTP Net::HTTP::Methods
URI URI::http URI::_generic URI::_query URI::_server
HTTP::Date HTTP::Headers HTTP::Message HTML::HeadParser
HTTP::Request HTTP::Response HTTP::Status
LWP LWP::Protocol LWP::Protocol::http
LWP::UserAgent LWP::MemberMixin LWP::Debug
);
Make sure these files are owned by root and not writable by unprivileged
users such as amavis!
So, while transferring the log entry nearly literally (and placing this in amavisd.conf
and reloading amavisd-new) seems to work:
@additional_perl_modules = qw(
/etc/spamassassin/FuzzyOcr.pm
/usr/lib/perl/5.8/auto/Storable/autosplit.ix
/usr/share/perl5/auto/Log/Agent/Priorities/autosplit.ix
/usr/share/perl5/auto/Log/Agent/autosplit.ix
FuzzyOcr/Config.pm FuzzyOcr/Deanimate.pm
FuzzyOcr/Hashing.pm FuzzyOcr/Logging.pm
FuzzyOcr/Misc.pm FuzzyOcr/Preprocessor.pm
FuzzyOcr/Scanset.pm FuzzyOcr/Scoring.pm
Log/Agent.pm Log/Agent/Formatting.pm
Log/Agent/Message.pm Log/Agent/Priorities.pm
MLDBM.pm MLDBM/Sync.pm MLDBM/Sync/SDBM_File.pm
SDBM_File.pm Storable.pm String/Approx.pm
Tie/Cache.pm
);
I imagine it should (or at least could) be rephrased:
@additional_perl_modules = qw(
/etc/spamassassin/FuzzyOcr.pm
/usr/lib/perl/5.8/auto/Storable/autosplit.ix
/usr/share/perl5/auto/Log/Agent/Priorities/autosplit.ix
/usr/share/perl5/auto/Log/Agent/autosplit.ix
FuzzyOcr::Config FuzzyOcr::Deanimate
FuzzyOcr::Hashing FuzzyOcr::Logging
FuzzyOcr::Misc FuzzyOcr::Preprocessor
FuzzyOcr::Scanset FuzzyOcr::Scoring
Log::Agent Log::Agent::Formatting
Log::Agent::Message Log::Agent::Priorities
MLDBM MLDBM::Sync MLDBM::Sync::SDBM_File
SDBM_File Storable String::Approx
Tie::Cache
);
One last note. If you run amavisd-new in debug mode, when SpamAssassin
initializes you will see entries such as this:
[...] /usr/sbin/amavisd-new[13545]: SpamControl: initializing Mail::SpamAssassin
Subroutine new redefined at /etc/spamassassin/FuzzyOcr.pm line 48.
Subroutine dummy_check redefined at /etc/spamassassin/FuzzyOcr.pm line 59.
Subroutine fuzzyocr_check redefined at /etc/spamassassin/FuzzyOcr.pm line 63.
Subroutine fuzzyocr_do redefined at /etc/spamassassin/FuzzyOcr.pm line 101.
[...] /usr/sbin/amavisd-new[13545]: SpamControl: init_pre_fork done
I don't believe
there is much significance to these somewhat unexpected messages.