When Big Data is Bad Data

Big Data promises to revolutionize many fields, but only if correctly analyzed. A recent study shows that some 40,000 health care papers used statistical software that gives invalid results. This points to a larger problem.
Written by Robin Harris, Contributor

No one said this would be easy.

Photo by Robin Harris

You've heard of MRI machines for diagnostic imaging: big machines where the subject slides into the center of a noisy donut-shaped machine while a powerful magnetic field and radio waves create a picture of their insides.

There are many types of MRI scans. Functional MRI (fMRI) looks at how different parts of the brain respond to stimuli. A common fMRI use is for Alzheimers neuro-imaging. Some 40,000 published papers have used fMRI to delve into the human brain over the last 25 years.

However, until this study, the software packages used to analyze the data were never validated with real data.

The study

In the paper Cluster failure: Why fMRI inferences for spatial extent have inflated false-positive rates researchers Anders Eklund and Hans Knutsson, of Sweden, and Thomas E. Nichols, of the UK, ran almost three million random group analyses using real -- not simulated -- human data to compute actual false positive rates. They concluded:

. . . the parametric statistical methods are shown to be conservative for voxelwise inference and invalid for clusterwise inference.

The invalid techniques gave up to 70 percent false positives. That's bad. How bad?

These results question the validity of some 40,000 fMRI studies and may have a large impact on the interpretation of neuroimaging results.

The Storage Bits take

In a world of Big Data, statistical quality needs to be taken seriously. But statistics are complex, so even highly educated professionals rely on packages making assumptions they don't understand, trusting that the results are sound.

This study shows that for at least one important area of research the trust in the statistical validity is misplaced. The packages were "validated" with synthetic data, but:

. . . it is obviously very hard to simulate the complex spatiotemporal noise that arises from a living human subject in an MR scanner.

This isn't only a problem in brain research. In data storage, for example, long asserted RAID array data loss rates assumed that drive failures were independent.

It took more than a decade for research to find this wasn't true. Of course, during that decade, vendors sold billions of dollars worth of underperforming RAID arrays, while collecting the service data that should have shown them the truth.

But as with the irresponsible lending practices that led to the Great Recession, the companies profiting from the invalid assumptions didn't want to spoil the party. As Will Rogers said "It isn't what we don't know that gives us trouble, it's what we know that ain't so."

Even more true with Big Data.

Editorial standards