Bug, joke, secret code - or just spam clogging up the web?

Google Translate's lorem ipsum poetry show the problems of big data correlation
Written by Mary Branscombe, Contributor

When researchers at security experts FireEye found some strange results in Google Translate, they wondered if it was a sophisticated system spies or activists were using to communicate secretly, or if someone had managed to game Google. The results of translating certain phrases were so striking and unusual that they looked more like haiku, only they were full of words like China, NATO and The Company. Security writer Brian Krebs documented the translations before Google turned them off.

Here's my take - the mysterious translations aren't a code being used to communicate in secret, or a hacker prank on Google. They're what happens when automated big data meets spammers, and the way they mess up the assumptions we make about the web and other forms of human communication.

The strange results used to crop up when you tried to translate the words lorem ipsum, or other phrases in the standard dummy copy that designers have used since the sixteenth century to show what the words will look like without having to use the actual words. It's a deliberately mangled passage of Cicero (the first word should actually be dolorem, meaning sorrow, pain or distress) and it goes on for longer than those two words, but if it's not long enough you just repeat it. Or you copy and paste just enough extra words to fill up the space you have to fill with words, or you put in a couple of jokes (like mixing in phrases from Bacon Ipsum or the Samuel L Ipsum generator or any of the modern variants) or you just delete a letter here and there. So you get subtle variations between different passages of lorem ipsum, but any human would recognise them as placeholder text.

To Google's machine translation learning algorithms, they’re all just source data though. The way that Google Translate works is that it learns how one language translates to another based on how those languages have already been translated by humans who have put those translations on the web. Peter Norvig, Google's director of research, described the fundamentals at the Emerging Technology conference back in 2008, explaining that Google started by collecting "parallel texts" from web sites like hotel information sites and news sites that had similar pages in two or more languages. They assumed that the content on two similar pages in two different languages was the same, just translated - so the description of a hotel in English, French and Italian would say much the same thing, just in different languages.

The system uses sophisticated statistical models to deal with things like the way word order changes between languages, the way symbols in Chinese mean one thing on their own and another when you combine them, and the other complexities of language.

But at the most basic level, Google Translate relies on comparable source documents to learn how words have been translated before. The idea is to get language tools "from data rather than by the sweat and tears of linguists," as Norvig put it. (He also noted that this wasn't the way to get a perfect translation for poetry and fiction where you want the experience of the language, not just to extract the information.)

And the assumption is that the documents are created by people who know the language they're writing it and use it correctly, to convey actual information.

But that's not always true.

At one point, the Google Translate API was available free to developers but it was removed almost overnight when spammers started using it to translate the nonsense sentences they were padding emails with (in an attempt to beat spam detecting rules that blocked short messages without any complex sentences in).

Those nonsense words were tainting the machine learning algorithm.

The same thing accidentally happened with the phrases in lorem ipsum documents, because there are millions of examples but very few actual translations of them; instead, the placeholder text will get matched up with documents that just look similar to the algorithm but aren't actually connected.

That would explain why you got different translations if you capitalised the words differently or duplicated them, resulting in translations like China, the Internet, NATO, the Company, China's Internet, Business on the Internet, Home Business, Russia might be suffering, he is a smart consumer, the main focus of China, department and exam. Those are all common phrases - and you might recognise some of them from spammy web sites promising thousands of dollars for working from home or offering you answers to exam questions.

There might be a few activists posting the crib sheet for how they use the phrases of lorem ipsum to pass messages about China or other controversial topics. There might be a handful of hackers offering fake crowd sourced translations of lorem ipsum to prank Google. But those would be outweighed by the millions of documents using lorem ipsum placeholder text that Google Translate will have found on the web (and in Google Drive) that it's been busy matching up to other, completely unrelated, documents.

Google has fixed the bug so that lorem ipsum translates as lorem ipsum (the way it already does in Bing Translate) so the accidental poetry and the conspiracy theories are both history. But the underlying problem remains; that correlation in a large data set can be meaningless or misleading.

Drowning and ice cream sales both go up at the same time; in summer when we're more likely to go to the beach. Indeed, in the 1980s in New York, serious crime and ice cream sales went up at similar rates (perhaps because when you're at the beach, it's easier for someone to break in to your home).

Global warming is not the reason there are fewer pirates, tempting as the graph makes that conclusion.

Fatalities on US highways between 1996 and 2000 fell at the same rate as lemon imports from Mexico rose (and not because lemons have anything to do with safe driving). The number of Nobel prizes won by a country correlates well with how much chocolate the population of that country eats; and so do the number of Nobel laureates and road fatalities, as do chocolate consumption figures plotted against serial killers. That doesn't mean chocolate makes you smart, or that it turns you into a crazed maniac, or that Nobel winners are terrible drivers.

We've known for a while that data mining can be misleading. If your data set only covers 1983 to 1993, you'll find that the annual closing price of the S&P 500 perfectly matches the combination of butter production and sheep population in the US and Bangladesh. That's a deliberately bogus counter-example created back in 1995 to show that statistical regression models should be read with the warning 'future results may not match past performance'. It's enough to make you apply Twyman’s law, which says "Any figure that looks interesting or different is usually wrong".

When we build models with big data, we're going to get oddities like these. That doesn't mean big data and machine learning aren't useful; it means we need to design systems carefully, add in lots of heuristics (those common sense rules of thumb that people know and machines don't), filter out spam and fake content that distort the system and watch out for bizarre results.

In fact, if you're worried about the future, it's quite reassuring. Algorithms aren't going to put data scientists out of a job and the more useful AI gets, the more it will need people to interpret and sanity check the results. 

Further reading

Editorial standards