Updated: 2003-01-06; 10:10:37 PM
Doug's Inner Net News
    News and views from a software developer's perspective

daily link  Monday, August 05, 2002

More on my ideas to build effective spam filters...

What we want, is to identify the text of the convincing section, the action section, and the excuse section.  How can we do this?

Here's my idea.

From a selection of normal English language text -- however that is defined -- create a database that contains a count of the frequencies of one-, two-, or three-word combinations.  Create a similar database from spam text.  We use these databases to compute the "surprise" value of the one-, two-, and three-word combinations in the text we are testing.  Common word combinations have a low surprise value.  The word "the", for example, has almost no surprise value.  The word "viagra" has high surprise value in normal text, but low surprise value in spam text.  (This is basic information theory.)  When we test the text of a message, we look for those word combinations that have a high surprise value against normal English text, but that have a low surprise value against spam text.

As protection against false positives, we could also update a third database, getting the word combinations from mail received from known non-spam sources.  There would be many word combinations that have a high surprise value against normal English text -- first names of your friends, relatives, or co-workers, for example -- and a low surprise value against the text in legitimate mail.  If the spam detector finds too many of these word combinations, then it would classify the mail as legitimate.  Of course, this part of the spam filter works best if it is used for individuals, rather than the general public.  However, it may also be effective for certain groups, such as family group or a work group.  The more that group has in common, the more effective this phase of the detector will be.

We could refine this basic technique.  If we manually "teach" the detector, we could be careful to separate the spam text into convincing text, action text, and excuse text, then create a separate database for each section.  Then, we would have separate computed numbers for matching the convincing text, the action text, and the excuse text.  We could weight these numbers differently when computing the overall score.

This is all theory.  In practice, a lot of work must be done to tune the parameters.  My guess is that this kind of analysis in a spam detector would work pretty well, assuming the learning phase was high quality.

 
2:14:55 PM  permalink 


But, in fact, most spammers are pretty dumb.  Most of them wouldn't have a clue about how to get past spam filters.  I mean, putting 'viagra' in the subject line?!  That's a dead give away!  I suspect the reason there are so many ignorant spammers is because they buy the do-it-yourself spam software.  There is no one to help them carefully craft their message so as to get past spam filters.  There are professional spamming services that can help businesses with a spam campaign.  But they probably cost quite a bit of money.  I believe it will eventually get to the point where do-it-yourself spam just doesn't work -- it can't get past the spam filters.  In that case, "effective" spam will require professional help.  And that will cost the businesses that use spam.  As a result, only businesses that can afford to spam will do so.  (I assume the cost would be over $1000 per mailing.)  That would make spam much more predictable.  The products offered would be mortgages, spy equipment, fake diplomas, dirty pictures, etc.  As the selection of products offered becomes smaller, that makes the job of filters much easier, because there is a much smaller "vocabulary".

So, again, I am optimistic about the possibility of getting spam under control.

 
4:57:44 AM  permalink 


Rick's Spam Filters Very interesting.  Simple filters that one can use in Eudora to eliminate a lot of spam.

It's a late night, and I just can't stop thinking about possible techniques to eliminate spam.  I'm a believer.  I believe that spam can be filtered effectively.  Maybe it will take some sophisticated artificial intelligence, but I believe it can be done.

Consider the random tags that spammers put at the end of the subject line.  It's a pretty clear give away that the email is spam.  Why do they put those tags there?  My guess is to randomize the subject line.  Probably there are server operated filters that decide a message is spam if it sees the same subject line more than N times, where N is some very large number.  Perhaps the filter computes a checksum and the random string alters the checksum.  So, is it possible to detect these random tags?  Absolutely!  Apply basic information theory.  Compute the frequency of all two letter combinations in English words.  Then use these frequencies to compute the "information" in the last word of the subject line.  Because of the unusually low frequencies of the two letter combinations of the random string, the "information" will be very great.  In plain English, this just means that there is not sufficient redundancy in the random string to make it look like an English word.

But subject lines won't do.  How about those spammers that just put no subject line at all?  Or else their subject line is "hi"?  To really fight spam, we must look at the text of the message.

Surely, there are lots of tricky things we could do to detect spam.  For example, if the message is malformed, it's spam.  If there are a lot of recipients in the TO or CC line, then it's spam.  If it contains an HTML table, then it's spam.  The problem with all these "tricks", is that once spammers catch on to them, they just change their messages to defeat the spam detectors.  It's spy vs. spy.

So, we have to be smarter.  We want to find spam detectors that are very difficult to defeat, because they do not rely on any "tricks".  To do that, we have to look at the text of the message.

But first, let's consider that every spam filter should have a white list.  Email that comes from your co-workers or relatives should always get through.  Let's consider the possibility that spammers that have amassed huge lists of email could forge the sender, and make the sender an email address from the same domain as the recipient.  So, maybe we need to also consider a super whitelist that contains senders who are authenticated via digital IDs.  The whitelist will always take precedence over other decisions.

Next, let's consider that spammers often use fake sender addresses.  So, a good spam filter should try to verify the sender's email address.  If it's not a valid email address, let's declare it to be spam.  We can verify the email address by connecting to the SMTP server for the sender's domain and proceding to where we would send the DATA command.  A that point, we send an RSET and QUIT.  If the SMTP server will not accept the RCPT command, then the email address is invalid.  An alternative is to use the VRFY command.  Some ISPs block port 25, and some ISPs redirect it to their own SMTP server, so it may not be possible to verify the email address directly.  Perhaps an email address verifier would make a good web service.

Finally, we get down to looking at the text of the message.  In order for a spammer to succeed, he has to ask you to do something in response to the message.  Before he asks you, he probably tries to convince you.  And, of course, many spammers also excuse themselves.  So, there are two required sections: a convincincing section (he tries to convince you of the value of whatever he is offering) and an action section (he asks you to do something in response).  And there is one optional section: an excuse section (he tells you this is not spam because you put your name on an opt-in list, he tells you how to get off the list, he explains he is in compliance with the law, he asks your forgiveness for the intrusion, etc).

The easiest section to deal with must certainly be the action section.  Most commonly, the action is to click on a hyperlink.  I think this could be handled in two different ways.  One way is to get everyone to comply.  The spam filter software should make a request for the URL.  With a little luck, the volume of requests would overwhelm the server and take it down.  It certainly would cost the spammer more in fees to the hosting provider.  A big problem with this approach, however, is that smart people could use it to get information that we may not want them to get. They might send just a few messages knowing that a request will automatically be made on that URL. (What a great way to know that your message was received!)

A different way to handle the action hyperlink would be to check the hostname of the URL against a list of hostnames.  This could be done with a centralized server.  Or it could be done via a P2P dissemination of the list.  There could be a whitelist of hostnames and a blacklist.  The whitelist means that if I mail a URL to a friend for an interesting article on the web, that the message won't be classified as spam.  The blacklist means that if a URL contains a blacklisted hostname, it is classified as spam.

I know there are efforts underway to create checksums of messages and compare them to a list.  I think that if we compare just the hostnames of any URLs against a list, that should be sufficient to filter spam.

Certainly, there are many techniques we could use to filter spam.  Ultimately, though, the best filters will try to make sense of the actual meaning of the message.  In order to make progress toward this goal, I think breaking the message down into the three sections that I mentioned above is the starting point.  There are only so many words that are used to try to convince someone of a point.  There are only so many words that are used to tell someone how to take an action.  Can we find a way to analyze the words in the message to discern a convincing section and an action section?

 
4:40:10 AM  permalink 


Copyright 2003 © Doug Sauder