wesley tanaka

Image Spam CAPTCHAs

‹ Poor Little Bunny | recursively ignoring a filename or pattern with svn ›
()

There's been a recent spate of articles on various sites complaining about the recent flood of pump and dump spam. Some of the articles poke fun at Bill Gates's prediction that the spam problem would be solved by 2006. ("There's a month left!"). Although they share some traits in common, the thing that I find interesting is that in at least one way, the current email image-spam problem is the inverse of the web spam problem.

Many websites use a captcha on their comment, signup, or blog posting forms to determine if the thing using the website is a person or a computer. The most common version of this are variants of "gimpy," which presents you with an image of some distorted letters that you need to type in. Gimpy is useful to webmasters because humans can read the text in the image whereas spam-bot software has difficulty.

Recent email spam, on the other hand, has the opposite features. Text ("buy this stock now!") is embedded in an image, and the image is sprinkled with various lines, dots, gif animation layers or other noise. Image spam is useful for spammers because humans can read the text in the image whereas spam-scanner software has difficulty.

Luckily, the situation is not quite symmetric1.

With a gimpy style captcha, a spam-bot needs to read all of the text in the image to be able to input it correctly. Although spam scanners currently try to read the text in images2, they actually have an easier problem to solve, "is this image a spam image," which probably does not require reading all of the text in the image.

In my personal case for example, there's probably already a nice correlation between the existence of a single image embedded in the email and the email being spam, and that test would barely require opening the file. But one could imagine looking at (at least for the current generation of email spam) features like:

  • Is the color histogram of the image "spikey" or "smooth" (likely text rather than likely photo)
  • Does OCR extract at least one dictionary word from the file (text in the image somewhere)

In any case, one day once image spam is defeated, what this all means is that the spam text will just be layered on top of other people's photos downloaded off the web. Or spammers will start sending audio or video files.3 Or something else.

Update (2006 Dec 16): I thought that this was the first news of exciting spam-via-video, but it turns out just to be an account of "private message" spam on youtube.

----------

1 for if it were, any advancements in captchas for blocking web spam would benefit email spammers, and any advancements in detecting email image spam would benefit web comment spammers.

2 to feed them into the bayesian network -- a method already shown to work well for text, and probably the source of Bill Gates's optimistic predictions.

3 "don't stuff beans up your nose"

Suggested Links

Syndicate content