Google's public image disconnect: Smart engineers and dumb algorithms

Google's search technologies struggle to identify original news stories.
Written by Tom Foremski, Contributor

Google looks smart and its people behave smart, but that doesn't mean its algorithms are smart. Machine learning works well when it comes to images, not language. Google's dirty little secret is that its algorithms are quite dumb and have trouble understanding what they see and read.

Take this example of Google recently saying that its search algorithm will be trained to highlight original news stories such as scoops and investigative pieces...

Marc Tracy in The New York Times reports:

"After weeks of reporting, a journalist breaks a story. Moments after it goes online, another media organization posts an imitative article recycling the scoop that often grabs as much web traffic as the original. Publishers have complained about this dynamic for years…"

This has been a problem since Google News launched in September 2002. Finally, the head of Google News, Richard Gingras, has responded:

"An important element of the coverage we want to provide is original reporting, an endeavor which requires significant time, effort, and resources by the publisher. Some stories can also be both critically important in the impact they can have on our world and difficult to put together, requiring reporters to engage in deep investigative pursuits to dig up facts and sources."

Foremski's Take: 

Why has it taken Google more than 17 years to deal with this? Why does Google's algorithm need thousands of "raters" to help train it to recognize original news?

Gingras said that Google has updated its manual that defines how more than 10,000 outside contractors who work as "raters" should identify original stories and how to classify them. That information will be used by software engineers to make changes to the search algorithm. 

Many of those raters are outside of the US. Google wants them to understand how news stories are created and what makes one story more original than another story and to fill out a large online form -- with hundreds of quality of content characteristics described in a 168-page document. And they are given just a few minutes per task.

Gingras claims it is difficult to identify original news stories, which is true if you are trying to teach a machine. But anyone that looks at several news stories can tell very quickly who broke the story and which ones have no new information. 

It's an example of the disconcerting fact that Google's algorithms are not that smart and remain inadequate despite decades of machine learning.

Websites have to markup their pages with special tags that tell Google how to index the content of the site, what's an ad, what is the main content, what links not to follow, what's spam, etc. In addition, the Googlebot has terrible problems understanding the quality of the content. 

Which is why Google search needs help from thousands of raters to perform basic tasks, such as figuring out original news stories. Google won't allow the raters to directly edit the search results, because they would be acting as editors, and Google would be legally responsible for its content like a newspaper. 

After more than two decades, the search algorithm continues to be a slow learner.

That means we are stuck with Google's poor-performing algorithms. And it means continued problems for democracies, as Google struggles to identify fake news, hate speech, and toxic content.

We never had this problem when people were in charge of publishing news. 

Editorial standards