It's happening: Hadoop and SQL worlds are converging

It's happening: Hadoop and SQL worlds are converging

Summary: Guest blogger Tony Baer looks at the slew of Hadoop-related news coming out of multiple conferences centering on it's convergence with SQL in handling big data.

TOPICS: Big Data

This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.

By Tony Baer


With Strata, IBM IOD, and Teradata Partners conferences all occurring this week, it’s not surprising that this is a big week for Hadoop-related announcements. The common thread of announcements is essentially, “We know that Hadoop is not known for performance, but we’re getting better at it, and we’re going to make it look more like SQL.” In essence, Hadoop and SQL worlds are converging, and you’re going to be able to perform interactive BI analytics on it.

Tony Baer

The opportunity and challenge of Big Data from new platforms such as Hadoop is that it opens a new range of analytics. On one hand, Big Data analytics have updated and revived programmatic access to data, which happened to be the norm prior to the advent of SQL. There are plenty of scenarios where taking programmatic approaches are far more efficient, such as dealing with time series data or graph analysis to map many-to-many relationships.

It also leverages in-memory data grids such as Oracle Coherence, IBM WebSphere eXtreme Scale, GigaSpaces and others, and, where programmatic development (usually in Java) proved more efficient for accessing highly changeable data for web applications where traditional paths to the database would have been I/O-constrained. Conversely Advanced SQL platforms such as Greenplum and Teradata Aster have provided support for MapReduce-like programming because, even with structured data, sometimes using a Java programmatic framework is a more efficient way to rapidly slice through volumes of data.

Until now, Hadoop has not until now been for the SQL-minded. The initial path was, find someone to do data exploration inside Hadoop, but once you’re ready to do repeatable analysis, ETL (or ELT) it into a SQL data warehouse. That’s been the pattern with Oracle Big Data Appliance (use Oracle loader and data integration tools), and most Advanced SQL platforms; most data integration tools provide Hadoop connectors that spawn their own MapReduce programs to ferry data out of Hadoop. Some integration tool providers, like Informatica, offer tools to automate parsing of Hadoop data. Teradata Aster and Hortonworks have been talking up the potentials of HCatalog, in actuality an enhanced version of Hive with RESTful interfaces, cost optimizers, and so on, to provide a more SQL friendly view of data residing inside Hadoop.

But when you talk analytics, you can’t simply write off the legions of SQL developers that populate enterprise IT shops. And beneath the veneer of chaos, there is an implicit order to most so-called “unstructured” data that is within the reach programmatic transformation approaches that in the long run could likely be automated or packaged inside a tool.

At Ovum, we have long believed that for Big Data to crossover to the mainstream enterprise, that it must become a first-class citizen with IT and the data center. The early pattern of skunk works projects, led by elite, highly specialized teams of software engineers from Internet firms to solve Internet-style problems (e.g., ad placement, search optimization, customer online experience, etc.) are not the problems of mainstream enterprises. And neither is the model of recruiting high-priced talent to work exclusively on Hadoop sustainable for most organizations; such staffing models are not sustainable for mainstream enterprises. It means that Big Data must be consumable by the mainstream of SQL developers.

Making Hadoop more SQL-like hardly new

Hive and Pig became Apache Hadoop projects because of the need for SQL-like metadata management and data transformation languages, respectively; HBase emerged because of the need for a table store to provide a more interactive face – although as a very sparse, rudimentary column store, does not provide the efficiency of an optimized SQL database (or the extreme performance of some columnar variants). Sqoop in turn provides a way to pipeline SQL data into Hadoop, a use case that will grow more common as organizations look to Hadoop to provide scalable and cheaper storage than commercial SQL. While these Hadoop subprojects that did not exactly make Hadoop look like SQL, they provided building blocks from which many of this week’s announcements leverage.

Progress marches on

One train of thought is that if Hadoop can look more like a SQL database, more operations could be performed inside Hadoop. That’s the theme behind Informatica’s long-awaited enhancement of its PowerCenter transformation tool to work natively inside Hadoop. Until now, PowerCenter could extract data from Hadoop, but the extracts would have to be moved to a staging server where the transformation would be performed for loading to the familiar SQL data warehouse target. The new offering, PowerCenter Big Data Edition, now supports an ELT pattern that uses the power of MapReduce processes inside Hadoop to perform transformations. The significance is that PowerCenter users now have a choice: load the transformed data to HBase, or continue loading to SQL.

There is growing support for packaging Hadoop inside a common hardware appliance with Advanced SQL. EMC Greenplum was the first out of gate with DCA (Data Computing Appliance) that bundles its own distribution of Apache Hadoop (not to be confused with Greenplum MR, a software only product that is accompanied by a MapR Hadoop distro).

Teradata Aster has just joined the fray with Big Analytics Appliance, bundling the Hortonworks Data Platform Hadoop; this move was hardly surprising given their growing partnership around HCatalog, an enhancement of the SQL-like Hive metadata layer of Hadoop that adds features such as a cost optimizer and RESTful interfaces that make the metadata accessible without the need to learn MapReduce or Java. With HCatalog, data inside Hadoop looks like another Aster data table.

Not coincidentally, there is a growing array of analytic tools that are designed to execute natively inside Hadoop. For now they are from emerging players like Datameer (providing a spreadsheet-like metaphor; which just announced an app store-like marketplace for developers), Karmasphere (providing an application develop tool for Hadoop analytic apps), or a more recent entry, Platfora (which caches subsets of Hadoop data in memory with an optimized, high performance fractal index).

Yet, even with Hadoop analytic tooling, there will still be a desire to disguise Hadoop as a SQL data store, and not just for data mapping purposes. Hadapt has been promoting a variant where it squeezes SQL tables inside HDFS file structures – not exactly a no-brainer as it must shoehorn tables into a file system with arbitrary data block sizes. Hadapt’s approach sounds like the converse of object-relational stores, but in this case, it is dealing with a physical rather than a logical impedance mismatch.

Hadapt promotes the ability to query Hadoop directly using SQL. Now, so does Cloudera. It has just announced Impala, a SQL-based alternative to MapReduce for querying the SQL-like Hive metadata store, supporting most but not all forms of SQL processing (based on SQL 92; Impala lacks triggers, which Cloudera deems low priority). Both Impala and MapReduce rely on parallel processing, but that’s where the similarity ends. MapReduce is a blunt instrument, requiring Java or other programming languages; it splits a job into multiple, concurrently, pipelined tasks where, at each step along the way, reads data, processes it, and writes it back to disk and then passes it to the next task.

Conversely, Impala takes a shared nothing, MPP approach to processing SQL jobs against Hive; using HDFS, Cloudera claims roughly 4x performance against MapReduce; if the data is in HBase, Cloudera claims performance multiples up to a factor of 30. For now, Impala only supports row-based views, but with columnar (on Cloudera’s roadmap), performance could double. Cloudera plans to release a real-time query (RTQ) offering that, in effect, is a commercially supported version of Impala.

By contrast, Teradata Aster and Hortonworks promote a SQL MapReduce approach that leverages HCatalog, an incubating Apache project that is a superset of Hive that Cloudera does not currently include in its roadmap. For now, Cloudera claims bragging rights for performance with Impala; over time, Teradata Aster will promote the manageability of its single appliance, and with the appliance has the opportunity to counter with hardware optimization.

SQL/programmatic convergence

Either way – and this is of interest only to purists – any SQL extension to Hadoop will be outside the Hadoop project. But again, that’s an argument for purists. What’s more important to enterprises is getting the right tool for the job – whether it is the flexibility of SQL or raw power of programmatic approaches.

SQL convergence is the next major battleground for Hadoop. Cloudera is for now shunning HCatalog, an approach backed by Hortonworks and partner Teradata Aster. The open question is whether Hortonworks can instigate a stampede of third parties to overcome Cloudera’s resistance. It appears that beyond Hive, the SQL face of Hadoop will become a vendor-differentiated layer.

Part of conversion will involve a mix of cross-training and tooling automation. Savvy SQL developers will cross train to pick up some of the Java- or Java-like programmatic frameworks that will be emerging. Tooling will help lower the bar, reducing the degree of specialized skills necessary.

And for programming frameworks, in the long run, MapReduce won’t be the only game in town. It will always be useful for large-scale jobs requiring brute force, parallel, sequential processing. But the emerging YARN framework, which deconstructs MapReduce to generalize the resource management function, will provide the management umbrella for ensuring that different frameworks don’t crash into one another by trying to grab the same resources. But YARN is not yet ready for primetime – for now it only supports the batch job pattern of MapReduce. And that means that YARN is not yet ready for Impala or vice versa.

Of course, mainstreaming Hadoop – and Big Data platforms in general – is more than just a matter of making it all look like SQL. Big Data platforms must be manageable and operable by the people who are already in IT; they will need some new skills and grow accustomed to some new practices (like exploratory analytics), but the new platforms must also look and act familiar enough. Not all announcements this week were about SQL; for instance, MapR is throwing a gauntlet to the Apache usual suspects by extending its management umbrella beyond the proprietary NFS-compatible file system that is its core IP to the MapReduce framework and HBase, making a similar promise of high performance.

On the horizon, EMC Isilon and NetApp are proposing alternatives promising a more efficient file system but at the “cost” of separating the storage from the analytic processing. And at some point, the Hadoop vendor community will have to come to grips with capacity utilization issues, because in the mainstream enterprise world, no CFO will approve the purchase of large clusters or grids that get only 10 – 15 percent utilization. Keep an eye on VMware’s Project Serengeti.

They must be good citizens in data centers that need to maximize resource (e.g., virtualization, optimized storage); must comply with existing data stewardship policies and practices; and must fully support existing enterprise data and platform security practices. These are all topics for another day.

This guest post comes courtesy of Tony Baer's OnStrategies blog. Tony is senior analyst at Ovum.

You may also be interested in:

Topic: Big Data

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.


Log in or register to join the discussion
  • Big Data is trying to solve all the wrong problems

    The future is clearly declarative, so accessing data directly from programs makes no sense whatsoever in the modern world.

    SQL is a declarative language, but not a very good one and is a very poor implementation of the relational model.

    It would make much more sense to forget about Hadoop and concentrate on building a better relational language and DBMS.
    • Please explain how you scale up your relational DB?

      Take a volume of data big like Google, and please show how a relational DB can find the resultset in a split of a second. Take your own relational DB, multiply the number of your DB servers by 10. Can you guarantee that the responsiveness will be x10 without doing a lot of works?

      Let's take a massive task which is not necessarily related to data retrieval / indexing: weather forecast, genomics calculation, may be even password cracking. Can your "better relational language and DBMS" do that?

      Unless you give some convincing arguments, I have to conclude that you have no idea about Big Data.
      • You don't understand what a relational DBMS is

        As the relational model is a purely mathematical model of how to represent data saying it isn't scalable is about as sensible as saying long division isn't scalable because your only implementation is pencil and paper.

        In the relational model the logical and physical layers are strictly separated so that the physical implementation can be changed without effecting the logical layer. This is not the case for NoSQL/Big Data techniques, which have no logical model comparable to relational.

        NoSQL/Big Data in no respect new. They are a step backwards to techniques that have already been shown to be flawed in theory and only half way workable in practice with enormous effort and resources.

        In the old days there were rooms full of programmers developing reports. With NoSQL/Big Data presumably there will be rooms full of Java programmers developing reports - what a huge waste of company resources and human ingenuity.

        Big Data techniques are a dead man walking.
      • A further point about Google

        Google's search engine only needs to produce half way plausible results.

        I presume even Google doesn't use big data techniques for its bookkeeping: "Your request for last year's revenue produced the first 20 of 5,000,000 possible results, now pick the one that you like best". Or maybe they do and Google doesn't actually have any money?

        For any applications where precision is required maybe Google can get away with big data approaches basically because they have virtually endless resources to throw at the problem. Most companies don't have that kind of money and so consequently are not in the position to re-invent the data management wheel for every single application.
        • Big Data != Relational

          Hum ... I wonder if you are implying that Hadoop is trying to compete with Relational. They are totally different tools with no overlap.

          Relationnal DB is based on the premises of set theory. SQL is the high level query language to allow to apply the maths underneath. The domain of validity of the relational algebra requires a certain format of the data, to be structured in row column and to be normalized. Outside of that, SQL is either clumsy or powerless. I suppose you already have your fair share of troubles writing queries on badly normalized tables or unefficient free text search.

          Data cannot be always formatted in a nice table structure: web logs, user blog contents, etc. Parsing these massive data real time to perform prediction analytics, trends detection, pattern recognition, etc. is completely out of the capability of relational & SQL even if you have the most powerful server and the best query language. Simply because these scenarios are outside of the applicable domain of relational algebra. Something like trying to use a bicycle to do the job of a submarine.

          Relational, SQL, OLAP are extremely good in their domains that Hadoop or Big Data solutions will never be able to compete or replace. However, as soon as you step into the Big Data area, SQL is mosquito toys.
          • No, all data can be expressed as relations

            That's the whole strength of the relational model.

            If you can't express the data as relations (propositions from a logical perspective) then you cannot perform logical reasoning on them.

            Analytics to be meaningful must be based on logic, if not then you are going to get the wrong answers (though if your data is big enough and you have no query language then maybe no one will ever rumble you).

            What exactly is it about a web log that cannot be expressed as a relation? Surely it is just some relation between various attributes of HTTP calls - IP address, URL and so on. There is nothing that cannot be expressed as a relation there.

            The performance and scalability (as I have already explained) is a complete red herring. The is nothing inherently unscalable about the relational model.
    • Re: accessing data directly from programs makes no sense whatsoever in the

      Let me guess: does your job ti‌tle have the word "excellence" in it?

      (And yes folks, you can't even say "ti‌tle" without triggering the fuc‌king profanity filter now!)
      • Well

        Using systems programming languages like Java, C#, Python or PHP for application programming doesn't make much sense.

        None of these languages has a long term future in business situations.
      • Nice one, ldo17

        I've been nailed 3 times this morning with nothing even close to profanity in my posts. You however, managed to get actual profanity through the filter.
        Rabid Howler Monkey
      • Perhaps your use of profanity

        is a clear signal that you don't have any well reasoned arguments on the subject?
  • No mention of Microsoft

    The partnership with Hortonworks and release of Hadoop on Azure and Windows with Hive support into Excel is going to be one of the biggest enablers for Data Science / Big Data to go mainstream.
  • Marrying SQL to NoSQL database management systems makes sense

    Most data scientists (in the broad sense) and most database engineers are fluent in SQL.

    P.S. There is plenty of room for a variety of database management systems in the marketplace. Relational and NoSQL aren't even close to the only ones (e.g., object, object-relational, XML). Pick the right tool for the job as no single model is always the best choice.
    Rabid Howler Monkey
    • Plenty of room in the marketplace

      but no very much room in logic and usability.

      Relational will be with us long after NoSQL, object-relational and XML are dead, precisely because it is based on a sound mathematical model and the others are not.

      NoSQL, object-relational and XML are revivals of long dead hierarchical and graph based methods of the 1960s and 1970s (under new marketing) that have already been shown to be flawed in theory and only half-way workable in practice with enormous resources and effort.
  • Sound mathematical models...

    ...are all well-and-good but they're not appropriate for every situation. Ever tried solving time-series problems with a relational database?