Hadoop's rise: Why you don't need petabytes for a big data opening

Hadoop's rise: Why you don't need petabytes for a big data opening

Summary: People are often hung up on the volume aspect of big data but other factors can be just as telling in the issues they raise for business.

SHARE:

Big data isn't necessarily big and can be as much about the complexities of processing information as about volumes or data types.

Personal genetic-profiling services such as 23andMe, which charges $99 to sequence an individual's genome, illustrate the point, according to Forrester principal analyst Mike Gualtieri.

The resulting data from one individual's sequenced DNA is only about 800MB, he told an audience at last week's Hadoop Summit in Amsterdam.

"That's not a lot. Would you call that big data? If I said 800MB is big data, I'd get laughed out of the room," Gualtieri said.

"But within that, there are four billion pieces of information and lots of patterns. So it's a big processing challenge, it's a big compute challenge. You don't have to have petabytes of data to have a big-data opportunity or issue."

In fact, big data is a self-defining concept that Gualtieri described as the frontier of an individual company's ability to store, process and access data to achieve business outcomes — and those outcomes are mostly about understanding and serving customers.

"The term [big data] increasingly just means all your data. All of it. It's not a certain type of data, it's just all the data that you have," Gualtieri said.

"So when we say big data, we're just talking about data. You want to have a big data conversation? Let's have a data conversation."

At the moment, companies are nowhere near reaching their individual data frontiers according to a recent Forrester survey, which asked businesses how much of their existing data they use for analytics.

"It's only 12 percent. So if you do the math, what's your frontier just from the data you have? It's 88 percent. That's a big frontier already, not including the growth in the data, not including all the external sources you can have," Gualtieri said.

"So don't run out and try and get all this new data. Analyse the data that you have now. But why can't you? Well, the reason you can't is you have a portfolio of hundreds of applications."

One representative of a major company whom Gualtieri met recently had eight ERP systems alone.

"That's typical. If you go to a bank, there's a portfolio of 400 or 500 applications. They all have data, so it's really hard even to get this data together just to analyse it — and it's all siloed," he said.

"Now what is the problem of siloed data? It gives you an improper view of what is going on in your business. It gives you inaccurate view of what's going on with your customers."

Gualtieri likened the situation with siloed data to the old joke about the drunk who has dropped his keys in the street on his way home to bed and is spotted looking for them under a street light.

When asked why he's only looking for them there, he replies, "That's where the light is."

Gualtieri said that issue is the problem with siloed information: "You can't see outside. These silos are the darkness."

Hadoop can help illuminate corporate data by managing it across clusters of commodity hardware.

"It can bring light to all that darkness by being something that can gather all this data together, so that we can see it and analyse it," Gualtieri said.

"Hadoop is not big data. It's a big-data technology. You can break down the siloes but Hadoop is also a framework for processing the data.

"Hadoop is the first data operating system — that's what makes it so powerful, and 81 percent of large enterprises are interested in it. But maybe they're not all believers yet."

Research shows that 45 percent of big companies say they're doing a Hadoop proof of concept, with 16 percent using it in production.

"So there's not a huge percentage of enterprises in production yet but now the momentum is building, and a huge production wave is coming for Hadoop," Gualtieri said.

"When we look at that big-data trend, Hadoop is perfectly positioned to be a major and central big-data platform."

What is mostly driving the interest in big data is a desire to be able to treat customers as individuals. Hadoop can offer part of the technology solution but there is a third element in the equation.

"We're looking at how to treat customers as individuals, we've got all the data, we've got this great data operating system and that brings us to the next trend and that is data science," Gualtieri said.

Data science comes in a number of guises — data mining, predictive analytics, machine learning — but its aim is to find new knowledge in data and predictive models that reveal probabilities.

"A data scientist essentially uses a combination of statistical and machine-learning algorithms to do that analysis. Data science is very different from traditional analytics and this is what most people don't get," Gualtieri said.

Traditionally, analytics are based on managers' theories about, say, customer churn.

"This is a human-driven approach to traditional analytics. On the data science side, it's very different. We don't need a big meeting. We don't need your hypotheses. We don't need your ideas. What we need is all the data you've got," he said.

"One way to think of it is a traditional BI analyst is harvesting data that has been planted — a statistician slicing and dicing. A data scientist runs an army of algorithms against the data to find the meaning."

More on Hadoop and big data

Topics: Big Data, CXO, Data Management, Enterprise Software, Open Source, Business Intelligence

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

8 comments
Log in or register to join the discussion
  • Do you really need to process all your data, though?

    "'The term [big data] increasingly just means all your data. All of it. It's not a certain type of data, it's just all the data that you have,' Gualtieri said."

    Do you really need to process all your data, though? Seems like a waste to me. It's like looking for a needle in a haystack when nobody put a needle in it.
    CobraA1
    • Depends on the value of the needle.

      If that needle is 15 million, then yes, it is worth it to find it, and that means that you potentially have to look at EVERY straw.
      jessepollard
      • As I said though . . .

        As I said though . . .

        . . . you don't even know if there's a needle in the haystack.

        . . . and I'd like to know how many $15 million needles have been found by "big data."
        CobraA1
  • Also sounds like a way to kill privacy and sharing requirements . . .

    Also sounds like a way to kill privacy and sharing requirements . . . there are sometimes limitations on what you can do with data (eg, for HIPPA requirements). I'd say avoid processing data that has privacy requirements.
    CobraA1
    • "I'd say avoid processing data that has privacy requirements."

      If the data isn't going to be processed, then it shouldn't be collected...

      Sometimes it is necessary... especially when you are actually searching for correlations to identify those individuals that need to be notified...
      jessepollard
      • thoughts

        "If the data isn't going to be processed, then it shouldn't be collected..."

        Agreed - although try telling the NSA that . . .

        "Sometimes it is necessary... especially when you are actually searching for correlations to identify those individuals that need to be notified..."

        If you know what you are looking for ahead of time, you can limit the data to what you need to identify the individuals. It's probably not considered "big data." At least not according to the definitions that bloggers seem to use.
        CobraA1
  • Actually, you need them even more . . .

    "We don't need a big meeting. We don't need your hypotheses. We don't need your ideas. What we need is all the data you've got,"

    Actually, you need them even more in order to explain the results of "big data." There's a difference between noticing a pattern or trend in your data and explaining why it exists. Without a good theory as to why you have a pattern, you're on dangerous ground if you assume the pattern will persist.

    There's the danger of noticing patterns that don't really exist as well: Chaos theory states that on occasion, what appears to be a pattern will pop up occasionally by chance, only to be broken and stop being a pattern. And of course, some patterns change over time as well.
    CobraA1
  • Replace Hadoop by DataWarehouses and you get the same article

    The main message of this article is the same proponents of DataMarts, DataWarehouses a few years ago used to shell their software: you need to get a unified view of data coming from different silos. So is Hadoop just a different kind of DataWarehouse?

    Sometimes it looks like "Big Data" is another hype world to replace "Business Inteligence"
    fernando8