Bug, joke, secret code - or just spam clogging up the web?

Bug, joke, secret code - or just spam clogging up the web?

Summary: Google Translate's lorem ipsum poetry show the problems of big data correlation

SHARE:

When researchers at security experts FireEye found some strange results in Google Translate, they wondered if it was a sophisticated system spies or activists were using to communicate secretly, or if someone had managed to game Google. The results of translating certain phrases were so striking and unusual that they looked more like haiku, only they were full of words like China, NATO and The Company. Security writer Brian Krebs documented the translations before Google turned them off.

Here's my take - the mysterious translations aren't a code being used to communicate in secret, or a hacker prank on Google. They're what happens when automated big data meets spammers, and the way they mess up the assumptions we make about the web and other forms of human communication.

The strange results used to crop up when you tried to translate the words lorem ipsum, or other phrases in the standard dummy copy that designers have used since the sixteenth century to show what the words will look like without having to use the actual words. It's a deliberately mangled passage of Cicero (the first word should actually be dolorem, meaning sorrow, pain or distress) and it goes on for longer than those two words, but if it's not long enough you just repeat it. Or you copy and paste just enough extra words to fill up the space you have to fill with words, or you put in a couple of jokes (like mixing in phrases from Bacon Ipsum or the Samuel L Ipsum generator or any of the modern variants) or you just delete a letter here and there. So you get subtle variations between different passages of lorem ipsum, but any human would recognise them as placeholder text.

To Google's machine translation learning algorithms, they’re all just source data though. The way that Google Translate works is that it learns how one language translates to another based on how those languages have already been translated by humans who have put those translations on the web. Peter Norvig, Google's director of research, described the fundamentals at the Emerging Technology conference back in 2008, explaining that Google started by collecting "parallel texts" from web sites like hotel information sites and news sites that had similar pages in two or more languages. They assumed that the content on two similar pages in two different languages was the same, just translated - so the description of a hotel in English, French and Italian would say much the same thing, just in different languages.

The system uses sophisticated statistical models to deal with things like the way word order changes between languages, the way symbols in Chinese mean one thing on their own and another when you combine them, and the other complexities of language.

But at the most basic level, Google Translate relies on comparable source documents to learn how words have been translated before. The idea is to get language tools "from data rather than by the sweat and tears of linguists," as Norvig put it. (He also noted that this wasn't the way to get a perfect translation for poetry and fiction where you want the experience of the language, not just to extract the information.)

And the assumption is that the documents are created by people who know the language they're writing it and use it correctly, to convey actual information.

But that's not always true.

At one point, the Google Translate API was available free to developers but it was removed almost overnight when spammers started using it to translate the nonsense sentences they were padding emails with (in an attempt to beat spam detecting rules that blocked short messages without any complex sentences in).

Those nonsense words were tainting the machine learning algorithm.

The same thing accidentally happened with the phrases in lorem ipsum documents, because there are millions of examples but very few actual translations of them; instead, the placeholder text will get matched up with documents that just look similar to the algorithm but aren't actually connected.

That would explain why you got different translations if you capitalised the words differently or duplicated them, resulting in translations like China, the Internet, NATO, the Company, China's Internet, Business on the Internet, Home Business, Russia might be suffering, he is a smart consumer, the main focus of China, department and exam. Those are all common phrases - and you might recognise some of them from spammy web sites promising thousands of dollars for working from home or offering you answers to exam questions.

There might be a few activists posting the crib sheet for how they use the phrases of lorem ipsum to pass messages about China or other controversial topics. There might be a handful of hackers offering fake crowd sourced translations of lorem ipsum to prank Google. But those would be outweighed by the millions of documents using lorem ipsum placeholder text that Google Translate will have found on the web (and in Google Drive) that it's been busy matching up to other, completely unrelated, documents.

Google has fixed the bug so that lorem ipsum translates as lorem ipsum (the way it already does in Bing Translate) so the accidental poetry and the conspiracy theories are both history. But the underlying problem remains; that correlation in a large data set can be meaningless or misleading.

Drowning and ice cream sales both go up at the same time; in summer when we're more likely to go to the beach. Indeed, in the 1980s in New York, serious crime and ice cream sales went up at similar rates (perhaps because when you're at the beach, it's easier for someone to break in to your home).

Global warming is not the reason there are fewer pirates, tempting as the graph makes that conclusion.

Fatalities on US highways between 1996 and 2000 fell at the same rate as lemon imports from Mexico rose (and not because lemons have anything to do with safe driving). The number of Nobel prizes won by a country correlates well with how much chocolate the population of that country eats; and so do the number of Nobel laureates and road fatalities, as do chocolate consumption figures plotted against serial killers. That doesn't mean chocolate makes you smart, or that it turns you into a crazed maniac, or that Nobel winners are terrible drivers.

We've known for a while that data mining can be misleading. If your data set only covers 1983 to 1993, you'll find that the annual closing price of the S&P 500 perfectly matches the combination of butter production and sheep population in the US and Bangladesh. That's a deliberately bogus counter-example created back in 1995 to show that statistical regression models should be read with the warning 'future results may not match past performance'. It's enough to make you apply Twyman’s law, which says "Any figure that looks interesting or different is usually wrong".

When we build models with big data, we're going to get oddities like these. That doesn't mean big data and machine learning aren't useful; it means we need to design systems carefully, add in lots of heuristics (those common sense rules of thumb that people know and machines don't), filter out spam and fake content that distort the system and watch out for bizarre results.

In fact, if you're worried about the future, it's quite reassuring. Algorithms aren't going to put data scientists out of a job and the more useful AI gets, the more it will need people to interpret and sanity check the results. 

Further reading

Topics: Big Data, Google, After Hours

Mary Branscombe

About Mary Branscombe

Mary Branscombe is a freelance tech journalist. Mary has been a technology writer for nearly two decades, covering everything from early versions of Windows and Office to the first smartphones, the arrival of the web and most things inbetween.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

4 comments
Log in or register to join the discussion
  • Great article

    And it clearly points out the problem with using statistics and probabilities and data correlation to "prove" or "translate" something. And so much "research" now is simply gathering statistics, applying a "rule" or hypothesis, and pushing the button. The translator was showing the same thing we are seeing way too much of now.

    Health information is rife with it. That is why we get the "coffee is bad" headline one day, followed by "coffee is the fountain of youth" the next day. Change the datasets, change the rule, push the button, spit out the results. Now you'd think in REAL science they would go back and then do an actual study to confirm those numbers. But they get out and the Internet and mass media (which are getting closer and closer to being the same) preaches it like gospel. Bet if you tried you could find statistic sets that would show cigarettes are the path to longevity. Fortunately there's enough real physical results that exist to disprove that one. But for a lot of this, not so. No wonder we live in a world where people would trust a late-night TV pitchman over a real doctor, and nothing can dissuade conspiracy theorists and just plain paranoid - who now have a platform to spread it and "statistics" to back them up.

    When will the "big data" crowd figure out that just sucking in information and interpreting it is NOT research and understanding. The way it is now it is like that translator - it took what it had and did what it knew, and the result was garbage. But people ran with it. Should have known better.
    jwspicer
  • Real Science and Junk Linquistics

    "Now you'd think in REAL science they would go back and then do an actual study.." In real science they DO exactly that, but the mass media jumps on the initial study saying "there seems to be a correlation between X and Y but we need further study" and turns it into "SCIENCE PROVES X CAUSES Y!" Or more to the point, non-scientific members of the public pass it around so that it stays in the mass media. Then when the "remember that correlation between X and Y? Turns out it was bogus just as we suspected" article comes out, the over-simplification of the media jumps on that, making science itself look stupid.

    And if someone has a vested financial and/or political interest (which often overlaps) in one direction or the other, then whichever initial correlation fits their agenda becomes one of their dogmas, which must not be questioned.

    Remember (I'm sure Americans will) the correlation between multiples of '20 and deaths or assassination of US Presidents who were elected in those years? 1840: William Henry Harrison, gave himself pneumonia making his inaugural address and died in a month; 1860: Abe Lincoln, assassinated shortly after his 1865 second inauguration; 1880: James Garfield, assassinated after a month or two in office; 1900: William McKinley, assassinated; 1920: Warren Harding, died in office; 1940: FDR elected for third term, died of complications of polio after fourth inauguration in 1945; 1960: John F. Kennedy, whose 1963 assassination sparked interest in the "curse of the 20s"; BUT 1980: Ronald Reagan, whose assassination attempt FAILED, and he served two full terms and retired. So the "curse" was broken, and George W. Bush, elected (or was he?) in 2000, also served two full terms and retired.

    Then there is the coincidence of Jonathan Swift's fictional astronomers in "Gulliver's Travels" who stated the existence of two moons of Mars, gave their orbital periods and sizes, and turned out to be uncannily, coincidentally accurate a century later, when telescopes improved to the level of being able to discover them in real life.

    And finally, the apocryphal Cold War story of the Pentagon project to translate between English and Russian in the 1950s with simple dictionary lookup. According to the story, they tested the E-R program with the Bible verse, "The spirit is willing, but the flesh is weak." Then they fed the Russian output of that test into the R-E program and they got, "The vodka is good, but the meat is rotten."
    jallan32
    • garbage in garbage out

      I do know Russian translators who claim to have seen the vodka translation in person ;) My favourite false correlation isn't actually backed up by the facts, but it claims that when the Welsh do well at rugby a pope is likely to die...
      mary.branscombe
  • Google Translate's problem...

    ...is that it is not based on a true AI with a robust knowledge base. For a very simple example, select Spanish-English and enter "buenas dias". It comes back "good days" (instead of the correct "good morning"), even though in English-Spanish if you enter "good morning" you get back "buenas dias".

    If it cannot handle one of the simplest, most basic phrases around, how can it be trusted for anything else?
    nfordzdn