Lies, damned lies and big data: How firms get analytics wrong – and how to get it right

Lies, damned lies and big data: How firms get analytics wrong – and how to get it right

Summary: No one imagines big-data analytics is plain sailing, but the extent of the problems involved in implementing the technology may be wider than people think.

Fractal Analytics CEO Velamakanni: This is the dirty secret of the analytics world — that there are so many errors. Image: Fractal Analytics

Petty company politics, bad data and inept analysis are standing between many organisations and any hope of a big-data utopia, where analytics routinely improve the business.

In companies where internal politicking is rife, people will deliberately bend analytics so the figures back up the course of action they support, warns Srikanth Velamakanni, founder and CEO of Fractal Analytics.

Even where there's no bias from vested interests, it's common to find errors caused by poor data or flawed analysis, he said.

"If you don't do analytics in the right manner, you can come up with some very wrong conclusions. I've seen so many examples — tons and tons of examples where companies make those mistakes," Velamakanni said.

He cited the case from a few years ago of trying to build a predictive churn model for a very large telecoms operator using inadequate data.

"We were trying to predict who was likely to cancel their subscriber line and what we could do using profitability, risk and lifetime value to retain them proactively," Velamakanni said.

Some of the models produced initially seemed extremely promising.

"They were highly predictive. They were so predictive that it was suspicious," he said.

At the time telecoms operators charged a small deposit for handsets and other equipment.

"One of the variables that was highly predictive was that customers who did not have a deposit were likely to leave. It was a very strong predictor of attrition," Velamakanni said.

"This seemed too good to be true. So we investigated and realised this company had a single field that said whether or not there's a deposit. What would happen is that if a customer left, the deposit would go out of view. So it was an after-the-fact variable.

"If you left, your deposit would go out of view but looking at the data you're not seeing that. You thought if they don't have a deposit, they are likely to leave. It was just a question of what was cause and what was effect, and in data you can't tell unless you get down to the detail."

Along with failures to examine the data closely, problems with the data itself rank high among the factors that can derail any big-data initiative.

"You're really handling large amounts of data with lots of messiness in it. There can be lots of missing values and all kinds of issues with what appears to be conflicting data which, when you get into it, you realise there's a lot of mess," Velamakanni said.

However, the one big danger he feels exists with analytics in general is the old adage about lies, damned lies and statistics.

"You can use analytics to prove a certain point and yet it could be a very faulty way of coming up with the analysis. This happens often and especially in very political organisations," he said.

Some clients are aware of the issue, and one told Velamakanni that he didn't want to 'democratise' analytics inside the company because it would be used by staff to fight political battles and justify conclusions that they thought were right.

Velamakanni believes the hurdles of internal bias and flawed interpretation can be overcome through the use of strong, standardised processes across an organisation and plenty of automation to clean out data errors.

"In some sense an audit trail of sorts is required so that errors can be detected and minimised. This is the dirty secret of the analytics world — that there are so many errors," he said.

"Many companies that create an analytics team and just start doing stuff, they make so many errors they don't even realise it, and that's why it's critical to create a strong process and make it error-free to deliver the right results."

Velamakanni rejects the idea that analytics should be kept in small specialised teams.

"The overall adoption of analytics is more critical than this challenge of people interpreting analytics in their own manner. In the beginning it will happen — there will be some instances of this," he said.

"But eventually the only way that adoption of analytics will grow and companies will get smarter through the use of analytics is if it is democratised."

More on big data and analytics

Topics: Big Data, Enterprise Software, Business Intelligence

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Big Problem

    The biggest problem in my opinion is that the data is coming from groups who had specific needs and uses for the original data. The data collected may have been perfect for the original needs but now are being used possibly in a quite different manner, trying to answer a different question than was originally asked. Thus the data not correctly answer the question for a number of reasons and it can not be recreated to provide answers.
    • Absolutely True!

      And the author of the article did mention that, but only rather obliquely. The telecom example was an example of this, though I don't think he explained it very well.

      What is scary though, is that 'analytics' and "big data" people seem to have found some excuse for ignoring the old database adage "garbage in, garbage out": they have methods they CLAIM can compensate for the deficiencies of the data, but they don't really know that it works. They are guessing.

      Finally, about the other adage the author quotes, "lies, damned lies and statistics". That was attributed to Disraeli, though itis doubtful he said it. But it was from around his time.

      The point is that back then, statistics really was in its infancy. With the groundbreaking work of Fisher and Neyman-Pearson, we now have a MUCH better idea of how to use and do statistics.
      • Statistics

        I've always taken the quote to reference the easy manipulation of data to back up your own biases/claims. That is as true today as it has ever been. All you need to do is ask the right (wrong?) questions in certain ways to make sure you always get answers you like.
  • Analytics is mostly NOT Big Data

    This article makes it sound like all analytics is big data. In fact, analytics is a PROCESS that is and should be performed on all sorts of data. I realize it sounds much sexier to say "Big Data" ("ohhh, that new stuff, I must have it!!"). But most of the operational analytics like the example above is done on regular old structured data. Big data by nature is asking questions of unstructured data, or really definind the questions based on unstructured data. It's all that data you can't really capture due to time and capacity issues, but that you want to get into and see what's there.

    So it sells to say "Big Data Analytics", but most companies don't really do Big Data on any scale yet, and most are still struggling to figure out analytics across the board!
  • Where things go wrong...

    I completely agree with the essence of this post that analytical results can be manipulated to please the consumer. It is bound to happen as the consumer brings the moolah to the table when it is a question of a third party analytical solutions provider enabling the consumer to justify a decision.

    However, the prime example of how dirty data can actually give spurious results is actually more of a recurring situation than one would believe. Specially if it is big data. This can be overcome by performing extensive audits on the data before even thinking of using it; but therein lies the flaw again, how do we know what to look for in such huge datasets? Sampling might not be truly reflective of data characteristics, and correcting for anomalies in data is essential to understand what caused the anomaly in the first place.

    Secondly, most analytics is based on multiple assumptions. Most analytical providers fail to test assumptions exclusively for multiple reasons like crunched timelines, focus on primary solution rather than secondary validation, etc etc. Processes which seek to minimize such instances might also collapse due to incomplete or unimaginative thinking.

    And that is why analytics is an art as well as a science.
  • Big Data, big bad data

    All the hyped big data DBMSs - Hadoop, MongoDB, Cassandra etc have absolutely zero facilities for constraint checking.

    Therefore the quality of data you will find in such systems is inevitably extremely poor.

    Businesses would be much better served by an RDBMS implementation that did declarative constraints properly.