Why is Big Data Revolutionary?

Why is Big Data Revolutionary?

Summary: Big Data is revolutionary, and not merely the evolution of BI and data warehousing technology. Here's why.

TOPICS: Data Centers

Last week, Dan Kusnetzky and I participated in a ZDNet Great Debate titled “Big Data: Revolution or evolution?”  As you might expect, I advocated for the “revolution” position.  The fact is I probably could have argued either side, as sometimes I view Big Data products and technologies as BI (business intelligence) in we-can-connect-to-Hadoop-too clothing.

But in the end, I really do see Big Data as different and significantly so.  And the debate really helped me articulate my position, even to myself.  So I present here an abridged version of my debate assertions and rebuttals.

Big Data’s manifesto: don’t be afraid

Big Data is unmistakably revolutionary. For the first time in the technology world, we’re thinking about how to collect more data and analyze it, instead of how to reduce data and archive what’s left. We’re no longer intimidated by data volumes; now we seek out extra data to help us gain even further insight into our businesses, our governments, and our society.

The advent of distributed processing over clusters of commodity servers and disks is a big part of what’s driving this, but so too is the low and falling price of storage. While the technology, and indeed the need, to collect, process and analyze Big Data, has been with us for quite some time, doing so hasn’t been efficient or economical until recently. And therein lies the revolution: everything we always wanted to know about our data but were afraid to ask. Now we don’t have to be afraid.

A Big Data definition

My primary definition of Big Data is the area of tech concerned with procurement and analysis of very granular, event-driven data. That involves Internet-derived data that scales well beyond Web site analytics, as well as sensor data, much of which we’ve thrown away until recently. Data that used to be cast off as exhaust is now the fuel for deeper understanding about operations, customer interactions and natural phenomena. To me, that’s the Big Data standard.

Event-driven data sets are too big for transactional database systems to handle efficiently. Big Data technologies like Hadoop, complex event processing (CEP) and massively parallel processing (MPP) systems are built for these workloads. Transactional systems will improve, but there will always be a threshold beyond which they were not designed to be used.

2012: Year of Big Data?

Big Data is becoming mainstream…it’s moving from specialized use in science and tech companies to Enterprise IT applications. That has major implications, as mainstream IT standards for tooling, usability and ease of setup are higher than in scientific and tech company circles. That’s why we’re seeing companies like Microsoft get into the game with cloud-based implementations of Big Data technology that can be requested and configured from a Web browser.

The quest to make Big Data more Enterprise-friendly should result in the refinement of the technology and lowering the costs of operating it. Right now, Big Data tools have a lot of rough edges and require expensive, highly-specialized technologists to implement and operate them. That is changing though, which is further proof of its revolutionary quality.

Spreadmarts aren't Big Data, but they have a role

Is Big Data any different from the spreadsheet models and number crunching we’ve grown accustomed to? What the spreadsheet jocks have been doing can legitimately be called analytics, but certainly not Big Data, as Excel just can't accommodate Big Data sets as defined earlier. It wasn't until 2007 that Excel could even handle more than 16,384 rows per spreadsheet. It can't handle larger operational data loads, much less Big Data loads.

But the results of Big Data analyses can be further crunched and explored in Excel. In fact, Microsoft has developed an add-in that connects Excel to Hive, the relational/data warehouse interface to Hadoop, the emblematic Big Data technology. Think of Big Data work as coarse editing and Excel-based analysis as post-production.

The fact that BI and DW are complementary to Big Data is a good thing. Big Data lets older, conventional technologies provide insights on data sets that cover a much wider scope of operations and interactions than they could before. The fact that we can continue to use familiar tools in completely new contexts makes the something seemingly impossible suddenly become accessible, even casual. That is revolutionary.

Natural language processing and Big Data

There are solutions for carrying out Natural Language Processing (NLP) with Hadoop (and thus Big Data). One involves taking the Python programming language and a set of libraries called NTLK (Natural Language ToolKit)   Another example is Apple’s Siri technology on the iPhone. Users simply talk to Siri to get answers from a huge array of domain expertise.

Sometimes Siri works remarkably well; at other times it’s a bit klunky. Interestingly, Big Data technology itself will help to improve natural language technology as it will allow greater volumes of written works to be processed and algorithmically understood. So Big Data will help itself become easier to use.

Big Data specialists and developers: can they all get along?

We don't need to make this an either/or question. Just as there have long been developers and database specialists, there will continue to be call for those who build software and those who specialize in the procurement and analysis of data that software produces and consumes. The two are complementary.

But in my mind, people who develop strong competency in both will have very high value indeed. This will be especially true as most tech professionals seem to self-select as one or the other. I've never thought there was a strong justification for this, but I’ve long observed it as a trend in the industry. People who buck that trend will be rare, and thus in demand and very well-compensated.

The feds and Big Data?

The recent $200 million investment in Big Data announced by the U.S. Federal government received lots of coverage, but how important is it, really?  It has symbolic significance, but I also think it has flaws. $200 million is a relatively small amount of money, especially when split over numerous Federal agencies.

But when the administration speaks to the importance of harnessing Big Data in the work of the government and its importance to society, that tells you the technology has power and impact. The US Federal Government collects reams of data; the Obama administration makes it clear the data has huge latent value.

Big Data and BI are separate, but connected

Getting back to my introductory point, is Big Data just the next generation of BI?  Big Data is its own subcategory and will likely remain there. But it's part of the same food chain as BI and data warehousing and these categories will exist along a continuum less than they will as discrete and perfectly distinct fields.

That's exactly where things have stood for more than a decade with database administrators and modelers versus BI and data mining specialists. Some people do both, others specialize in on or the other. They're not mutually exclusive, nor is one merely a newer manifestation of the other.

And so it will be with Big Data: an area of data expertise with its own technologies, products and constructs, but with an affinity to other data-focused tech specializations. Connections exist throughout the tech industry and computer science, and yet distinctions are still legitimate, helpful and real.

Where does this leave us?

In the debate, we discussed a number of scenarios where Big Data ties into more established database, Data Warehouse, BI and analysis technologies. The tie-ins are numerous indeed, which may make Big Data’s advances seem merely incremental.  After all, if we can continue to use established tools, how can the change be "Big?"

But the revolution isn’t televised through these tools.  It’s happening away from them.

We're taking huge amounts of data, much of it unstructured, using cheap servers and disks.  And then we're on-boarding that sifted data into our traditional systems. We're answering new, bigger questions, and a lot of them.  We're using data we once threw away, because storage was too expensive, processing too slow and, going further back, broadband was too scarce. Now we're working with that data, in familiar ways -- with little re-tooling or disruption.  This is empowering and unprecedented, but at the same time, it feels intuitive.

That's revolutionary.

Topic: Data Centers

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Nonsense.

    It's the same thing on a larger scale and when "Big ____" enters the room, they wipe out "Little ____" and claim it's all "capitalism" when it's not...
  • It really just a reduction in the barriers to entry

    Lower dollars to get the same results. Not really revolutionary. Just cheaper so more people can do it.
  • But if it's not transactional

    the data will be big but inconsistent.

    So all these startlingly new insights we are gathering will be based on false premises. In other words, you will get wrong answers.

    Why not just take advantage of the improved hardware capabilities to make transactional (and therefore reliable) systems faster and more scaleable?

    I still don't see the difference between "event driven datasets" and any other kind of data. Surely a cashier passing my bacon sandwich over the scanner in the supermarket is an event, making a phone call is an event. All these things are handled perfectly adequately by existing SQL-DBMSs.

    The Big Data technologies look incredibly clumsy and old fashioned compared to relational.
    • Transactional?

      jorwell...I understand your wanting to take advantage of improved hardware capabilities to solve hardware problems. However, what "problem" is solved by buyiung more hardware?

      I am assuming that you are referring to Flash PCIe slots memory cards like Fusion-IO and whatnot...especially considering their relationship with IBM.

      Great technology, but that doesn't solve the ELASTICITY imperative that is inherent in cloud/software solutions.
      • What problem do you solve

        if you get the wrong answer because the data is inconsistent because you have no transaction control?

        With the so-called elasticity of big data approaches you also sacrifice all forms of integrity checking, therefore further compromising the quality of your data.

        As Sting might have put it (to the tune of Roxanne) "Big data, you don't care if it's wrong or if it's right".
    • Garbage in, Garbage Out....

      is that what you're saying jorwell? if so, I agree.
  • Is this really revolutionary?

    Check this out, from 1998, http://robotics.stanford.edu/~ronnyk/kurt.pdf
    SGI's MineSet did foreshadow some of this work!
  • The billion row question

    Someone commented on this blog that for a billion row dataset you needed big data technology.

    I have just created a billion row table in a SQL-DBMS and get response times of under 50 milliseconds on indexed columns - with plan generation included in the timing.

    Of course if I were to query on a non-indexed column then response times are slower, but a key-value structure would also be very slow if you queried on value rather than key. At least in a SQL-DBMS I could put an index on the value - which I cannot do on in a key value structure.

    I don't see the performance argument being a telling one when it goes together with losing all the advanced functionality of a SQL-DBMS and the sound mathematical basis of the relational model.
    • Milliseconds?

      Dude...you need to get that down to nanoseconds.
      • Pointless

        If it means giving up the sophistication, reliability and sound mathematical basis of the relational model.

        Nobody sensible wants the wrong answer fast.

        Everybody wants to be able to get at their data in a flexible way, using a query language.

        Performance is a very low criteria compared to consistency and correctness - and if the data isn't consistent it cannot possibly be correct.

        I ran the test on a desktop PC, by the way, I'm looking forward to the results of your test for nanosecond performance with similar hardware from a big data technology.
  • Big Data Revolution

    Andrew, great insight on Big Data Revolution! I think it is worth mentioning HPCC Systems which provides a single platform that is easy to install, manage and code too. Their built-in analytics libraries for Machine Learning and integrations tools with Pentaho for great BI capabilities make it easy for users who do not hold a PHD degree or carry a title like "Data Scientist" to be able to easily analyze Big Data. For more info visit: hpccsystems.com
  • Thats a very good write up

    Thanks for the detailed analysis and report.
  • Big Data - Truly revolutionary

    The advances in technology (e.g. DW and processing power) will truly be revolutionary. Would professionals be able to kept up with the skills necessary to make sense of all this data and, more importantly, be able ask the right questions? See discussion www.intothecore.com The Big Bang of Marketing: Big Data. A dream come true and a nightmare.
  • NOT revolutionary, but the results...

    Utterly disagree. It took a long time to generate the amounts of data big data needs. It took a long time for cheap servers and storage options to emerge. It took at least one if not two decades to begin to get our arms around unstructured data. It took many years for the noSQL approach to get refined and yield something useable.

    Big data is a natural capstone on top of all the big trends of the past 10-15 years. The cherry on top of the other trends. Of all of the trends, mobile, cloud, virtualization... it is the most evolutionary. Okay, maybe virtualization is equally as evolutionary.

    But the INSIGHTS big data can yield (but does not always) can be revolutionary.
  • Big Data is indeed a revolution

    Good article, though I think the key aspect is not about storage (Cloud, etc.) or analysis (Mahout, Hadoop ecosystem, Mining tools?) but more about the enterprise readiness of technology, which, in my opinion, is not there yet. Cloudera, IBM, Karmasphere are doing their bits, but still cant be said to be mainstream.
  • In-RAM transactional Database Model

    is something that Oracle and MS have mistakenly ignored.
    There are huge potential for startups, because there are several technical aspects that need be solved to maximize IO to feed such database for working on Big Data, complex computational models, etc.
    Big Data or whatever you want to call it will still benefit hugely from transactional databases.
    In some cases the current Big Data technologies seem primitive compared to relational databases, it is just that you can throw so much hardware at it.
  • sgedrhrt

    Good news: this website (http://lnk.co/ILTHN ) we has been updated and add products and many things they
    abandoned their increases are welcome to visit our website. Accept cash or
    credit card payments, free transport. You can try oh, will make you satisfied.



  • Where does this leave us?

    Where does this leave us?

    Smack in the middle of Orewells 1984.