How spam may feed the thinking machine

Spam may be the ultimate network pollutant, but cleaning it up may teach us more than patience
Written by Rupert Goodwins, Contributor
It is hard to find a good word to say for spam. Incoherent, unpleasant and unwanted, it slimes through cyberspace on the backs of zombies and oozes into our inbox with the stench of month-old haddock. Yet far from fatally clogging up our information arteries, spam may provide the impetus for a true revolution in information technology -- one we've been expecting for more than fifty years.

All the problems caused by the stuff can be solved if we can answer one simple question: what is spam? You and I know within a second of opening a piece of email whether it's spam or not - but computers are terribly bad at replicating the task. All spam-filters suffer from two problems, the false negative and the false positive. We can -- we do -- put up with the false negatives, the spam written cleverly enough to bypass whichever tests are flavour of the month.

False positives, when a real email is junked before we read it, are potentially ruinous. Unless filters are absolutely sure, they err on the side of slackness. They are never absolutely sure: some always gets through. And, because spam works on the law of averages, as long as some gets through, the spammers will ramp up the rate to make sure that enough hits to make the sums work. The pressure on our systems is immense.

So what's so hard about spotting spam? By common consent, the first serious spammers were Laurence Canter and Martha Seigel, who started sending out mass postings in 1994 advertising immigration services. At once, the battle was joined: people started writing filters and ditching missives from Canter and Seigel's ISP -- as the only spammers on the planet, they were easy to find. They changed ISP (not entirely voluntarily) and the arms race between spammers and filters had begun.

Since then, spam-filter software has learned -- for example -- that spam looks very similar, so the spammers learned to include different random text in each message. Then the filters found that some fairly simple tests for basic English construction spotted the randomness, so the spammers learned to construct fake English sentences or include snippets of surreally inappropriate text. Key words were a giveaway, so the spammers learned to misspell and punctuate violently.

By now, the whole business resembles a planetwide reverse Turing test. Instead of human arbiters deciding whether their interlocutor is man or machine, uncountable thousands of filtering robots anxiously scan gigabytes of chatter to fish out the spawn of their evil cousins. It turns out that the only way to be sure whether something is spam is to look at it like a human, with all our knowledge of context, language, meaning and intent. In short, you must be truly intelligent to do the job. Suddenly, the mildly moribund field of AI has a real job to do: saving the world.

Evidence of this can be found as far afield as the University of Melbourne, where programmers Matthew Sullivan and Guy Di Mattina, together with mathematics lecturer Dr Kevin Gates, have stapled a Support Vector Machine to an email firewall to get a claimed rate of 90 emails a second with one error every 25,000 messages. Support Vector Machines are fearsome mathematical constructs that have only just escaped from the lab. As far as I can make out, they seek non-linear hyperplanes in Hilbert space using Lagrangian transforms - check http://www.kernel-machines.org/ if you don't believe me.

Whatever the details, a SVM looks at data in lots of ways at once - it extends the variables in the data into many dimensions -- and then learns which characteristics mark out members of one set from another. The eponymous support vectors are the dividing lines between the two sets: once the machine has established these, filtering is a matter of finding out which side of the lines the messages fall. Performance is predictable and prone to optimisation: in short, this is one of the most powerful methods of handling real-world data within a computer that has yet been developed.

With spam estimated to be costing tens of billions of dollars worldwide each year, the motivation to develop really effective filtering is intense -- and that's before the fact that whoever defeats the Spam Monster will be crowned God-Emperor for Life and forever be preceded by dusky maidens and/or oiled hunks (delete according to taste) casting rose petals in their path. If the evil of spam leads to a renaissance of well-funded research into fundamental knowledge systems -- nothing else will do -- it could be the final kick we need to create truly intelligent machines. What they say when they find out they're being fed a diet of pure rubbish will be another matter: we'd better get our excuses ready, and fast.

Editorial standards