E-Mail should be obsolete by now according to many social media proponents, yet we still rely on e-mail as a means of communicating complex concepts and expanding our network of contacts. It remains an indispensable tool at work and home, allowing us to filter and prioritize communications and manage actions against them accordingly. This makes e-mail an irresistible medium for scammers and con-artists to try to snare new victims, hence Spam volumes are always rising. We introduced some pretty aggressive filtering via Spamassassin to address this trend, but we’d lose an occasional mail response and needed a solution to allow responses to pass amidst the torrent of unwanted Spam.
Our core messaging infrastructure is based on open source software, which is eminently customizable, but generally not very well integrated as distributed. We needed a way to automatically white list responses to mail sent from our servers, but mail was sent through sendmail, cataloged in dovecot, and incoming mail was filtered through spamassassin. The simplest way of addressing this was to extract the mail “To” and “BCC” addresses in all the “Sent” folders and add them to a whitelist that is referenced by spamassassin.
Initially, we thought of reading the mail from the mailbox files for those folders on the filesystem, but this relies on the folders remaining in an mbox format. We have been considering moving to an mdbox format for more efficient indexing and higher performance and ruled out reliance on the mbox formats on concerns of future compatibility.
We settled on using the doveadm search and fetch tools available in dovecot to get the envelope information on emails in the IMAP Sent folders maintained in dovecot. We could then extract the relevant header information, apply the data to the whitelist, then restart sendmail. We implement this as a script (see below) and run it in a scheduled job on a daily basis An optimization to only read from mail sent since the last whitelist update allows for more frequent execution without introducing extreme overhead on the system.
Initial load took a very long time – in retrospect, running the sed filter in the script against an extract of all the sent mail in mbox file format reduces the initial load time over 95% by eliminating the individual header extract on each mail independently by dovecot.
SPAMASS_DIR=/etc/mail/spamassassin
SM_FILE=${SPAMASS_DIR}/sent_whitelist.cf
IMAP_DIR=/var/spool/imap
TMPSENT=`mktemp`
# Get time marker from last file – set marker for next
touch .marker.new
if [ -f${SM_FILE} ] ; then
SINCET=`stat -c %Y ${SM_FILE}`
else
SINCET=0
fi
# Dovecot command to get “To” addresses – sed wraps correctly for long target lists
accts=`ls -1 ${IMAP_DIR}/`; for foo in $accts; do doveadm search -u $foo SENTSINCE $SINCET mailbox Sent | while read guid uid; do addrlst=`doveadm fetch -u $foo hdr mailbox-guid $guid uid $uid | sed -n ‘/^To/s/To:\(.*\)/\1 /;Tc;:b;N;s/\n / /;tb;s/\n\(.*\)//;s/\([<>,]\)//g;s/\”/ /g;s/'”‘”‘/ /g;p;:c;d’` ; for a in ${addrlst} ; do echo ${a} | grep @ | grep -v “.*@\(psind\.com\)” | sed -e ‘s/\(.*\)/whitelist_from \1/g’ ; done; done; done 2>/dev/null > ${TMPSENT}
cat ${SM_FILE} >> ${TMPSENT}
rm -f ${SM_FILE}
cat ${TMPSENT} | sort -u -f -t @ -k 2 -k 1 -o ${SM_FILE}
touch -r .marker.new ${SM_FILE}
rm -f .marker.new ${TMPSENT}