We like big data because it can give us answers to complex issues that are easily digestible.
Except when algorithms that synthesize data give us wrong or misleading insights.
In a new study published in the journal Science, researchers analyzing Google Flu Trends (a tool to predict the flu that we've written about before) found that the service overestimated the prevalence of the influenza virus in the U.S. -- compared with data from the U.S. Center for Disease Control -- during the 2011-2012 and 2012-2013 flu seasons by more than 50 percent. The research also showed that from August 2011 to September 2013 Google Flu Trends said the flu was more prevalent than it actually was in 100 out of the 108 weeks.
How did this happen when Google gets some of its data from the CDC? Because the algorithm also relies on Google search terms related to the flu.
As Google explains: "We have found a close relationship between how many people search for flu-related topics and how many people actually have flu symptoms. Of course, not every person who searches for "flu" is actually sick, but a pattern emerges when all the flu-related search queries are added together."
The main issue with this, as Smithsonian points out, is that "Google hasn't taken into account the uptick in flu-related queries that occur as a result of the media-driven flu hysteria that occurs every winter."
But the larger issue the study points to is that we don't always know the details about the data being tracked by companies like Google.
"Many sources of 'big data' come from private companies, who, just like Google, are constantly changing their service in accordance with their business model," said Ryan Kennedy, one of the study's co-authors, from University of Houston Kennedy, in a press release. "We need a better understanding of how this affects the data they produce; otherwise we run the risk of drawing incorrect conclusions and adopting improper policies."
And even when we do know the data being measured, big data can be misleading. That's because it doesn't always take into account the proper context, as we've pointed out before.
Evan Selinger, a technology ethicist at Rochester Institute of Technology, puts the issue with big data this way, in an interview with New Scientist: "Algorithmic accountability is one of the biggest problems of our time. More and more decisions made about us are computed in processes we don't have access to."