Building a Better Spam Trap

Content analysis of image spam is particularly difficult.

According to Wikipedia, there are about 3,500 species of cockroaches. Once your house is infested, cockroaches are near impossible to get rid of. Spam is to e-communication as cockroaches are to dwellings. 

One species of spam is called image spam. Email messages can contain a link to the spammy image in the email's SMTP header. When the message is opened, the link accesses the image that is rendered in an HTML email reader (e.g., Outlook on your desktop or a Web email client like Gmail). The image could also be embedded directly in the message body. Content analysis of image spam is particularly difficult.

I remember early work in image content analysis. A vendor had developed an application to identify and do policy-based filtering of images embedded in the message body. During a demo of the beta, the application filtered out a message containing an image of a pig. The application log stated that the image was pornographic -- too much skin showing.

I spend time each morning scanning the Internet for news, which is how I came across New Algorithms from UCSD Improve Automated Image Labeling today. As the article notes:

Scientists have previously built image labeling and retrieval systems that can figure out the contents of images that do not have captions, but these systems have a variety of drawbacks. Accuracy has been a problem.

The UCSD system uses a clever image indexing technique that allows it to cover larger collections of images at a lower computational cost than was previously possible. While the current version would still choke on the Internet’s vast numbers of public images, there is room for improvement and many potential applications beyond the Internet, including the labeling of images in various private and commercial databases.

The UCSD's image labeling work is in an early stage. Professor Nuno Vasconcelos discusses the background and future of this work in this video

Image labeling can be used for many purposes -- one of them being identification of image spam. By the time the UCSD's work is solidified, I suspect that images will not solely be visually based. An "image" could be some level of imprint across many types of media. Perhaps... this form of image labeling may even be able to distinguish a pig from pornography.