Big Data: a matter of opinion

Big Data: a matter of opinion

Summary: At a panel discussion in Manhattan, we learn that BI, Hadoop, NoSQL and Data Analytics companies look at the same issues and technologies, sometimes through very different lenses.

TOPICS: Big Data, TechLines
Panel members and moderator
Photo credit: Yuriy Michael Goldman

Last night, I moderated a panel discussion, "Enterprise Insights: Go Big (Data) Or Go Home."  About 80 people paid $5 each to attend the event at Microsoft’s offices in Manhattan, exceeding my expectations and those of my co-organizers, Yuriy Michael Goldman and Conrad Wadowski who lead the New York Business Intelligence and Enterprise Tech Innovation Meetups in NYC, respectively.  The audience consisted of IT and marketing workers as well as a large group of data analysts.  The audience was engaged and interested…I might even say concerned; getting a better understanding of Big Data is seemingly an urgent priority for a great many professionals.

Who’s Who
The panel included representatives from the biggest companies in Hadoop, Open Source BI, NoSQL and Data Analytics.  Specifically we had Richard Daley, co-founder and CSO of Pentaho, Patrick Angeles Director of Field Architecture at Cloudera, Edouard Servan-Schreiber, Director for Solution Architecture at 10gen (the company behind MongoDB) and Kathleen Rohrecker, Director of Marketing at Revolution Analytics (the principal company behind the R Project).

As a primary objective, I wanted to investigate the commonalities and conflicts between Big Data and Business Intelligence (BI).  Are the two fields essentially the same, or is Big Data replacing/disrupting BI?  Why are the tools so different and why are the practitioners different too?  Will Big Data technologies ever have the enterprise readiness that BI products do right now?  Secondarily, I was especially interested to hear each panelist’s definition of Big Data and where to apply it.  I learned a lot form the panel, both about the issues at hand, and differing industry attitudes around them.

Can we all get along?
Panel members seemed to agree that BI and Big Data were complimentary and would coexist.  I was surprised that there was such consensus on that key point, so I pushed a bit further. I asked if the current popular model of using Hadoop to process gobs of unstructured data, and then push the results into conventional data warehouse and BI systems for analysis, was a temporary stop-gap or a permanent necessity.

Pentaho’s Daley saw the approach as natural and sensible; Cloudera’s Angeles was far less convinced.  Servan-Schreiber of 10gen saw the whole process of moving data to specialized analytical databases as inefficient, at the very least and, for a growing number of customers, simply unacceptable.  Certainly, MongoDB’s new Aggregation Framework, which allows for in-situ analysis of data in an operational NoSQL database, is consistent with this point of view.

Photo credit: Sophia Dominguez

Defining Big Data
Each panel member had a different definition for Big Data. Cloudera’s Angeles defines it (rightly so, in my opinion) as work with data of a scale where traditional technologies break down, or cease to be effective.  Servan-Schreiber explained that 10gen measures the Bigness of Data by its velocity, the scale and performance it demands, and the need for real-time analytics. 

Also Read: Big Data: Defining its defnition

Revolution Analytics felt more comfortable talking about getting the best analytical value from data, regardless of data set size (thereby sidestepping the question somewhat).  Pentaho puts a sale in the Big Data bucket if it's for a deployment that runs on top of Hadoop, NoSQL or a Data Warehouse appliance like Vertica or Greenplum.

To Hive or not to Hive
Another difference of opinion centered on the way BI tools and Big Data integrate.  Daley of Pentaho felt strongly that jamming Hive in the middle to make the two talk to each other is not sufficient.  Cloudera's Angeles felt that Hive works very well, going so far as to deny my accusation of it being "Rube Goldberg."  10gen finds it inappropriate to impose the SQL query paradigm on unstructured data, but also felt that writing MapReduce code in Java was impractical.  In incontrovertible support of that point, Servan-Schreiber asked audience members if they had MapReduce skills or experience and only 2 hands went up.  Revolution’s Rohrecker, meanwhile, didn’t have too much to say on the Hive issue.  And given the option to write MapReduce jobs in the R programming language, that makes perfect sense to me.

Guild membership required?
One of my last questions, addressed primarily at Rohrecker, was whether analytics work would remain in the hands of data scientists and other specialists, or if it would become more accessible to downstream business users.  Rohrecker felt that such work is genuinely difficult and not readily delegated to non-specialists.  Others on the panel seemed to agree.  It was striking to me that we ended with strong consensus, just as we had started with one, especially because I happen to think that analytics capabilities will be brought downstream to business users, and probably in the short to medium term.

In general, it’s clear that different data economy “subcultures” have different definitions of Big Data and different opinions on how best to work with it.  The different opinions and definitions are in large part self-serving.  None of that is surprising, but it is a hallmark of fragmentation for an early-stage technology cycle.  I would expect to see companies, opinions, approaches and ecosystems coalesce and, in part, commoditize.  But from yesterday’s panel, it’s clear that cohesion is still a ways off.

Topics: Big Data, TechLines

Andrew Brust

About Andrew Brust

Andrew J. Brust has worked in the software industry for 25 years as a developer, consultant, entrepreneur and CTO, specializing in application development, databases and business intelligence technology.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • What is Microsoft's take on the Big Data question?

    The event was hosted at Microsoft’s offices in Manhattan. I would expect at least to hear Microsoft's opinion on the matter. In particular, what is the direction MS will take on SQL Server and SQL Azure toward Big Data. Can you please share some? Thanks.
  • The lack of cohesion shows how regressive "Big Data" is

    It is the same situation as before the invention of the relational model by Edgar Codd.

    All of the big data technologies represent a massive step backwards from relational and re-introduce problems that the relational model solved forty years ago.

    As the relational model is a mathematical model of data saying it isn't scalable doesn't make sense. Scalability is a function of the implementation, not of the model.
    • tradtional RDBs just doesn't scale

      At DataWeek in SF the other day the Google BigTable guys did 100+TB full table scans in about a minute. No indexing. Yes you can get a RBD to scale like that, but not at a cost anyone is willing to pay for.

      This is why polygot data strategies are all the rage. RDB, NOSQL, Graph DBs, and Map-Reduce all have their spots. Choose wisely.
      • RDBMS is the wise choice

        The relational model is a mathematical model of how to represent data not how to store it.

        So to say an RDBMS doesn't scale doesn't really make sense. It is a little like saying long division doesn't scale because your only implementation is pencil and paper.

        Also RDBMSs do far more than the big data technologies, like constraint checking for example. It doesn't matter how fast your DBMS is if you get inconsistent answers out of it.

        The main problem facing data management is that people don't understand the most advanced technology (RDBMS) very well. It is well worth learning more - you could avoid some expensive mistakes that way.
  • RDBMS the right choice ?

    How can the RDBMS can be the right choice ???
    If we have to sum it up, the 60 years of R&D made in this area are 90% related to solving the following problem : "How can two simultaneous transactions apply changes on theb Data without compromising the DataBase consistency?"
    All is about changing the data value...
    In the Big Data area and specifically in the event log management area, there is No change of any value at all.
    Moreover, if you have a Database you got a database schema, which means that the only data you are able to process is structured data.
    We need to process structured data but also unstructured data.
    I am not saying that Data Base is not OK. I am just saying that in some fields, the RDBMS is not the best solution. Some other solutions fit much better with the customer objectives.
    • There is no such thing as "unstructured data"

      Something that has no structure, isn't data, it's noise.

      The advantage of a schema is that it means your data is logically consistent - whether you have one user or 10,000 users. This isn't a question of concurrency but of logic.

      RDBMS will still be around when the currently hyped big data trends are long dead.

      Any time invested in understanding the relational model, logic and set theory is time well spent. Big Data isn't really worth the trouble, as it is mainly a revival of methods that have already been shown to be flawed in theory and unmanagable in practice.
    • I don't want to process data

      Processing is something that happened on mainframes in the 1970s.

      I want to be able to perform logical inference on data. The DBMS will be doing some processing at the physical level, but no user or programmer wants to know about this low-level stuff.

      Big data just ain't modern.
  • On Big Data - and Guild Membership.

    Andrew, Another great blog. A couple of thoughts: From a customer's perspective Big Data is data that they don't have the expertise or compute resources to use effictively. A lot of "Big Data" complexity isn't because of Volume, Variety, or Velocity it is because the early big data tools are cumbersome to use unless you have a "guild membership". For Big Data to match up to its marketing hype, we have to create tools that your average MBA can use. Before Excel was on every desktop, even basic statistics was the job of specialists. Now office workers routinely do moderately complex statistical analysis as part of their daily job - no data scientist required. Except for the most advanced problems, the issue for big data is less a lack of personnel but a crappy tools. The good news is a lot of big data 2.0 vendors are working on fixing this, including ours - PatternBuilders.
    Andrew are you going to be at Strata East? If so let me buy you a beer.
  • Big Data is about the Data not about the Tools

    It is strange that a lot of discussions on Big Data end up being about the Tools. But the issue is not whether Hadoop/Map Reduce is better or worse than RDBMS. The customer wanting to profit from Big Data knows he will need to pay a Data Expert. But he will not wait for months of development & pay the incurred costs even before he can process a single byte of data. Now both these tools are not ready to market, however there are flat file based ones out there that are, like Secnology or Spluunk.