Bursting the Big Data Bubble

Bursting the Big Data Bubble

Summary: A Big Data analytics company CEO exposes some common, and potentially harmful, misunderstandings about Big Data.

SHARE:
TOPICS: Big Data
11
Stefan Groschupf headshot

This guest post comes from Stefan Groschupf, CEO of Datameer. The opinions expressed here are his, not mine. That said, I think Stefan makes some excellent points and provides a valuable, sober critique of today’s Big Data “goldrush.”

 

 

We’re in the middle of a Big Data and Hadoop hype cycle, and it's time for the Big Data bubble to burst.

Yes, moving through a hype cycle enables a technology to cross the chasm from the early adopters to a broader audience. And, at the very least, it indicates a technology’s advancement beyond academic conversations and pilot projects. But the broader audience adopting the technology may just be following the herd, and missing some important cautionary points along the way. I’d like to point out a few of those here.

Riding the Bandwagon
Hype cycles often come with a "me too" crowd of vendors who hastily rush to implement a hyped technology, in an effort to stay relevant and not get lost in the shuffle. But offerings from such companies may confuse the market, as they sometimes end up implementing technologies in inappropriate use cases.

Projects using these products run the risk of failure, yielding virtually no ROI, even when customers pony up significant resources and effort. Customers may then begin to question the hyped technology. The Hadoop stack is beginning to find itself receiving such criticism right now.

Bursting the Big Data bubble starts with appreciating certain nuances about its products and patterns. Following are some important factors, broken into three focus areas, that you should understand before considering a Hadoop-related technology.

Hadoop is not an RDBMS killer
Hadoop runs on commodity hardware and storage, making it much cheaper than traditional Relational Database Management Systems (RDBMSes), but it is not a database replacement. Hadoop was built to take advantage of sequential data access, where data is written once then read many times, in large chunks, rather than single records. Because of this, Hadoop is optimized for analytical workloads, not the transaction processing work at which RDBMSes excel.

Low-latency reads and writes won’t, quite frankly, work on Hadoop’s Distributed File System (HDFS). Mere coordination of writing or reading single bytes data requires multiple TCP/IP connections to HDFS and this creates very high latency for transactional operations.

However, the throughput for reading and writing larger chunks of data in a well-optimized Hadoop cluster is very fast. It's good technology, when well-understood and appropriately applied.

Hives and Hive-nots
Hive allows developers to query data within Hadoop using a familiar Structured Query Language (SQL)-like language. A lot more people know SQL than can write Hadoop’s native MapReduce code, which makes use of Hive an attractive/cheaper alternative to hiring new talent, or making developers learn Java and MapReduce programming patterns.

There are, however, some very important tradeoffs to note before making any decision on Hive as your big data solution:

  • HiveQL (Hive’s dialect of SQL) allows you to query structured data only. If you need to work with both structured and unstructured data, Hive simply won’t work without certain preprocessing of the unstructured data.
  • Hive doesn’t have an Extract/Transform/Load (ETL) tool, per se. So while you may save money using Hadoop and Hive as your data warehouse, along with in-house developers sporting SQL skill sets, you might quickly burn through those savings maintaining custom ETL scripts and prepping data as requirements change.
  • Hive uses HDFS and Hadoop’s MapReduce computational approach under the covers. This means, for reasons already discussed, that end users accustomed to normal SQL response times from traditional RDBMSes are likely to be disappointed with Hive’s somewhat clunky batch approach to “querying”.

Real-time Hadoop? Not really.
At Datameer, we’ve written a bit about this in our blog, but let’s explore some of the technical factors that make Hadoop ill-suited to real-time applications.

Hadoop’s MapReduce computational approach employs a Map pre-processing step and a Reduce data aggregation/distillation step. While it is possible to apply the Map step on real-time streaming data, you can’t do so with the Reduce step. That’s because the Reduce step requires all input data for each unique data key to be mapped and collated first. While there is a hack for this process involving buffers, even the hack doesn’t operate in real-time, and buffers can only hold smaller amounts data.

NoSQL products like Cassandra and HBase also use MapReduce for analytics workloads. So while those data stores can perform near real-time data look-ups, they are not tools for real-time analytics.

Three blind mice
While there are certainly other Big Data myths that need busting out there, Hadoop’s inability to act as an RDBMS replacement, Hive’s various shortcomings and MapReduce’s ill-suited-ness to real-time streaming data applications present the biggest stumbling blocks, in our observation.

In the end, realizing the promise of Big Data will require getting past the hype and understanding appropriate application of the technology. IT organizations must burst the Big Data bubble and focus their Hadoop efforts in areas where it provides true, differentiated value.

Topic: Big Data

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

11 comments
Log in or register to join the discussion
  • Good Article

    We need more investigations like this to cut through the hype.
    happyharry_z
  • ETL tools

    If Hadoop is to be used, data has to get into it somehow. Most ETL tool vendors have gotten the memo about Hadoop and typically have interfaces for it. Certainly Ab Initio, Informatica, and DataStage do.

    I appreciate an article that does not take the position that their product will instantly make everything perfect.
    pwatson
    • "No ETL!"

      Remember, Stefan's company pitches "No ETL!" which has forever been the dream of analytics vendors. To be fair, Hadoop interfaces provided by vendors like Informatica will not scale complex transformations on really big data in anything but nightly batch scenarios, hence the shift to ELT... Data preparation is one of the primary uses of Hadoop today, so ETL vendors are rightfully worried.
      mwaustin
  • •Hive doesn’t have an Extract/Transform/Load (ETL)

    Careful now.

    This sounds like a DBA defending his turf. ETL generally causes more problems than it solves. It's overused and abused architcturally several orders of magnitude more than hadoop has even been implemented.
    Tea.Rollins
  • Real-time hadoop

    I read what you said about hadoop, but have you checked out this? what are your thoughts? http://www.actian.com/products/vectorwise/vectorwise-hadoop From what I have read they have already taken into what you say is bad about real-time analysis.
    justink4
    • Re: Real-time hadoop

      Justink4,
      First lets define realtime. A pacemaker is realtime, it guarantees to send an impulse at a defined time, with only microsecond tolerance. No database or anything data can be realtime, it can only response in a few hundreds of milli seconds. Map Reduce by definition can not do anything like that, since the algorithm guarantees that the reduce stage is getting all values to a given key, what requires that the map stage is done, the data is sorted and partitioned etc. There are a few map reduce streaming technologies available but their use cases are very limited, since they using buffers in between and you buffer can be 10sec, 30 sec or a few minutes, but there is no realtime. So there is no real time analytce but maybe stream processing what is very limited in use cases.

      Does this make sense? #DontBelievetheHype :)
      StefanGroschupf
      • What is realtime? According to who?

        Your example of the pacemaker is good, Stefan. Both "realtime" and "near-realtime" have been hype market leaders. Realtime does not have any number associated with it. Near-realtime is near to what? Near to something else unmeasured.

        It is the customer's job to want it faster. The question remains as to what "it" is. The vendor interprets "it" for their focus. The time for a machine to complete a specific query is easy to focus on. The time to develop the right query and manage the operation is not quite the same thing.
        pwatson
  • At last some common sense on the subject

    Hadoop could never replace an RDBMS.

    It would be far more sensible to focus attention on better RDBMS implementations than pursuing Hadoop.

    More efficient RDBMS implementations than today's products are definitely possible - as are clearer, more logical (and probably faster) relational languages than SQL.
    jorwell
    • Re: better RDBMS implementations

      Jorwell,
      I tried to make a different point. Hadoop is very powerful and a great technology where RDBMS are miss used. RDBMS are great for transactions, random read and writes but using RDBMS for analytics is a stretch from my perspective. Sure there are analytics optimized RDBMS but from a technical perspective for straight analytics Hadoop is the better choice. I think what people need to understand is that for full table scans, joins, aggregations etc a sequential optimized system (Hadoop) is a much better choice than a b-tree (RDBMS) based system.
      One more thing, in almost all companies I ever worked for and with, the ETL job is batch anyhow, so no near realtime query was ever done, since the ETL job had to run first to load the data for a few hours.
      Makes sense?
      StefanGroschupf
      • I am not convinced you understand what a relational DBMS is

        One of the central points of an RDBMS is that the logical and physical layers are completely separate from each other.

        Current RDBMS implementations are physically based on b-tree indexes but there is no reason why this has to be so and there are potentially some far more effective physical models.

        The strength of the relational model is that it is based on a sound, proven mathematical basis. I see no corresponding model for Hadoop.

        I see no sense in throwing away the flexibility and logical consistency of the relational model. What is needed is a simpler, more logical relational language than SQL and more efficient physical implementations.

        I don't see Hadoop as having any long term future whatsoever.
        jorwell
  • Big Data Solution

    I agree we are seeing a lot of hype around trying to make Hadoop a mature and complete solution. As an alternative to Hadoop, LexisNexis has open sourced the HPCC Systems platform that is a complete enterprise-ready solution. Designed by data scientists, it provides for a single architecture, a consistent data-centric programming language (ECL), and two data processing clusters. Their built-in analytics libraries for Machine Learning and BI integration provide a complete integrated solution from data ingestion and data processing to data delivery. This all in one platform means only one thing to support and from a significant lower number of resources. In contrast, the complexity of the Hadoop ecosystem requires a huge investment in technology and resources up front and throughout. More at http://hpccsystems.com
    H-M