Hadoop's rise: Why you don't need petabytes for a big data opening

People are often hung up on the volume aspect of big data but other factors can be just as telling in the issues they raise for business.
Written by Toby Wolpe, Contributor

Big data isn't necessarily big and can be as much about the complexities of processing information as about volumes or data types.

Personal genetic-profiling services such as 23andMe, which charges $99 to sequence an individual's genome, illustrate the point, according to Forrester principal analyst Mike Gualtieri.

The resulting data from one individual's sequenced DNA is only about 800MB, he told an audience at last week's Hadoop Summit in Amsterdam.

"That's not a lot. Would you call that big data? If I said 800MB is big data, I'd get laughed out of the room," Gualtieri said.

"But within that, there are four billion pieces of information and lots of patterns. So it's a big processing challenge, it's a big compute challenge. You don't have to have petabytes of data to have a big-data opportunity or issue."

In fact, big data is a self-defining concept that Gualtieri described as the frontier of an individual company's ability to store, process and access data to achieve business outcomes — and those outcomes are mostly about understanding and serving customers.

"The term [big data] increasingly just means all your data. All of it. It's not a certain type of data, it's just all the data that you have," Gualtieri said.

"So when we say big data, we're just talking about data. You want to have a big data conversation? Let's have a data conversation."

At the moment, companies are nowhere near reaching their individual data frontiers according to a recent Forrester survey, which asked businesses how much of their existing data they use for analytics.

"It's only 12 percent. So if you do the math, what's your frontier just from the data you have? It's 88 percent. That's a big frontier already, not including the growth in the data, not including all the external sources you can have," Gualtieri said.

"So don't run out and try and get all this new data. Analyse the data that you have now. But why can't you? Well, the reason you can't is you have a portfolio of hundreds of applications."

One representative of a major company whom Gualtieri met recently had eight ERP systems alone.

"That's typical. If you go to a bank, there's a portfolio of 400 or 500 applications. They all have data, so it's really hard even to get this data together just to analyse it — and it's all siloed," he said.

"Now what is the problem of siloed data? It gives you an improper view of what is going on in your business. It gives you inaccurate view of what's going on with your customers."

Gualtieri likened the situation with siloed data to the old joke about the drunk who has dropped his keys in the street on his way home to bed and is spotted looking for them under a street light.

When asked why he's only looking for them there, he replies, "That's where the light is."

Gualtieri said that issue is the problem with siloed information: "You can't see outside. These silos are the darkness."

Hadoop can help illuminate corporate data by managing it across clusters of commodity hardware.

"It can bring light to all that darkness by being something that can gather all this data together, so that we can see it and analyse it," Gualtieri said.

"Hadoop is not big data. It's a big-data technology. You can break down the siloes but Hadoop is also a framework for processing the data.

"Hadoop is the first data operating system — that's what makes it so powerful, and 81 percent of large enterprises are interested in it. But maybe they're not all believers yet."

Research shows that 45 percent of big companies say they're doing a Hadoop proof of concept, with 16 percent using it in production.

"So there's not a huge percentage of enterprises in production yet but now the momentum is building, and a huge production wave is coming for Hadoop," Gualtieri said.

"When we look at that big-data trend, Hadoop is perfectly positioned to be a major and central big-data platform."

What is mostly driving the interest in big data is a desire to be able to treat customers as individuals. Hadoop can offer part of the technology solution but there is a third element in the equation.

"We're looking at how to treat customers as individuals, we've got all the data, we've got this great data operating system and that brings us to the next trend and that is data science," Gualtieri said.

Data science comes in a number of guises — data mining, predictive analytics, machine learning — but its aim is to find new knowledge in data and predictive models that reveal probabilities.

"A data scientist essentially uses a combination of statistical and machine-learning algorithms to do that analysis. Data science is very different from traditional analytics and this is what most people don't get," Gualtieri said.

Traditionally, analytics are based on managers' theories about, say, customer churn.

"This is a human-driven approach to traditional analytics. On the data science side, it's very different. We don't need a big meeting. We don't need your hypotheses. We don't need your ideas. What we need is all the data you've got," he said.

"One way to think of it is a traditional BI analyst is harvesting data that has been planted — a statistician slicing and dicing. A data scientist runs an army of algorithms against the data to find the meaning."

More on Hadoop and big data

Editorial standards