Why big data evangelists need to be reprogrammed

Big data is a dangerous, faith-based ideology. It's fuelled by hubris, it's ignorant of history, and it's trashing decades of progress in social justice.
Written by Stilgherrian , Contributor
Has anyone got a pin?

The last time I wrote about big data, in July, I called it a big, distracting bubble. But it's worse than that. Big data is an ideology. A religion. One of its most important gospels is, of course, at Wired.

In 2008, Chris Anderson talked up a thing called The Petabyte Age in The End of Theory: The Data Deluge Makes the Scientific Method Obsolete.

"The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all," he wrote.

Declaring the scientific method dead after 2,700 years is quite a claim. Hubris, even. But, Anderson wrote, "There's no reason to cling to our old ways." Oh, OK then.

Now, this isn't the first set of claims that correlation would supersede causation, and that the next iteration of computing practices would "make everything different".

"Japan's Fifth Generation Project of the early 1980s generated similar enthusiasm, and many believed it would make Japan dominant in computing within a decade, based on parallel processing and an earlier iteration of 'massive' databases. Now, obviously that didn't happen, and it was an expensive and embarrassing failure," said Graham Greenleaf, professor of Law and Information Systems at the University of New South Wales, on Tuesday night.

Greenleaf was speaking at the launch of the latest UNSW Law Journal, to be posted on its website early next week, which includes a theme section on "Communications Surveillance, Big Data, and the Law". Greenleaf described that section as "pessimistic".

There's now an intense scrutiny of the actions and habits of employees and potential employees, in the hope that statistical analysis will reveal those who have desired workplace traits.

Privacy issues are obviously a concern. As I've said before, privacy fears could burst the second dot-com bubble. But the journal articles also cover issues of discrimination, automated decision making, democracy, and the public's right to access information.

As just one example, Mark Burdon and Paul Harpur discuss what they call the "talent analytics" used in the employment context. There's now an intense scrutiny of the actions and habits of employees and potential employees, in the hope that statistical analysis will reveal those who have desired workplace traits. Factors such as choice of web browser, or when and where they eat lunch, could affect their chances.

This process runs up against anti-discrimination laws in countries like Australia, where employers can't base their decisions on attributes such as race, sex, disability, age, and marital status.

"[Burdon and Harpur] argue that it's almost impossible for these laws to be applied when the decisions are made on the basis of talent analytics, because it's usually almost impossible for either data users (employers), or data subjects, to know even what data is being used to make decisions," Greenleaf said.

"This is very important if we're to preserve the hard-won social policies represented by anti-discrimination laws, and prevent the hidden heuristics and emerging employment practices starting to mean that 'data is destiny'."

Big data's approach of collecting as much data as you can, even if it seems irrelevant, because it may reveal a previously unknown correlation, also collides with the "data minimisation" principles of data privacy laws, which say that you only collect the data you need to do the job.

Writers elsewhere have explored the big data world view and found it lacking.

In their paper Critical questions for big data, danah boyd and Kate Crawford describe the core mythology of big data as "the widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy".

"Too often, big data enables the practice of apophenia: Seeing patterns where none actually exist, simply because enormous quantities of data can offer connections that radiate in all directions. In one notable example, Leinweber (2007) demonstrated that data mining techniques could show a strong but spurious correlation between the changes in the S&P 500 stock index and butter production in Bangladesh," they wrote.

In her paper The Surveillance-Innovation Complex: The Irony of the Participatory Turn, Julie Cohen noted that surveillance has become increasingly privatised, commercialised, and participatory. Surveillance is no longer something to fear and regulate. The big data ideology turns surveillance into a source of innovation. Even gamification is deployed as a psychological strategy to induce people to hand over more data.

Greenleaf said that in recent decades, every aspect of the life cycle of personal data now presents more dangers to privacy — but it's not all pessimistic.

Over the last four decades, more countries have adopted data protection laws, and more of those laws are including measures similar to the 1995 European Union Data Protection Directive rather than the 1980 OECD Privacy Guidelines. "We are proceeding toward global ubiquity of data privacy laws," Greenleaf said — although privacy standards in other countries don't matter much if personal data can be liberated to the US safe harbour.

Against the increasingly "Europeanised" data privacy laws, the US is the laggard. Greenleaf compares this with the situation a century ago, when the US was the pirates' harbour of the copyright world, while "international standard" copyright laws were being adopted everywhere else. The US did change, after 102 years, when it finally joined the Berne Convention in 1988. Now it's the copyright giant.

"The prevailing US model of an internet where 'the user is the product' is not necessarily permanent. However, to stop it becoming so, it'll take either a second internet bubble to burst, or a concerted effort by the rest of the world to reject privacy-invasive business practices, or both. Neither's impossible, but neither is likely to occur rapidly," Greenleaf said.

Personally, I think the two processes will reinforce each other. As I've said before, so much of the valuation of internet companies is based on the perceived business value of their holdings of personal data. If the collection and use of that data is restricted, its value plummets, and the bubble bursts sooner.

Has anyone got a pin?

Editorial standards