Why Hadoop is hard, and how to make it easier

Hadoop is here to stay. But it's mature analytics tools for Hadoop, DBMS abstraction layers over it and Hadoop-as-a-Service cloud offerings that will make the open source Big Data platform actionable.
Written by Andrew Brust, Contributor

As my colleague Toby Wolpe wrote about earlier today, Gartner released a survey of its Research Circle members today showing that corporate adoption of Hadoop hasn't kept up with the hype.

First of all, to use a technical term: "no duh." For almost any new technology, there's typically a big differential between what the tech journalists and analysts are implying everybody's doing with that technology and what...everybody's doing with that technology.

Second, while the context is that Gartner's survey found that -- to use Toby's wording -- "Just 26 percent are already deploying, piloting, or experimenting with Hadoop" (emphasis mine), I happen to think that's a very promising number. In fact, I would have guessed something a bit lower. Why? Because Hadoop's legacy is that of a specialist's tool, not an Enterprise tool. That's changing, but the process isn't done yet. With that in mind, 26% penetration is pretty good, and it's going to get better.

Hadoop and the mainstream database
Last week, at Microsoft's Ignite conference, the Redmond-based software giant announced the upcoming release of SQL Server 2016 (see Mary Jo Foley's day-and-date coverage here), the future new version of its flagship relational database management system (RDBMS). A big part of that announcement was that PolyBase, which serves a s a bridge from SQL Server to Hadoop, will be available in the mainstream release of SQL Server, rather than only in the Analytics Platform System release and the cloud-based Azure Data Warehouse (which itself was only announced the week prior).

In other words, Microsoft is bringing the ability to map data stored in the Hadoop Distributed File System (HDFS) as external tables in SQL Server, and making that available as a feature to enterprise RDBMS customers. Bear in mind, SQL Server one of the top RDBMSes on the market in terms of units installed and revenue. Giving everyone in that very large ecosystem access to data in Hadoop via their existing skill sets (i.e. the Transact SQL query and programming language) is a pretty big deal.

Opposing view
It's also a counterpoint to the interpretation of Gartner's survey that says Hadoop is somehow languishing. What's languishing is the Enterprise's willingness to invest in a new, premium skillset, and the low productivity involved in working with Hadoop through its motley crew of command-line shells and scripting languages. A good data engine should work behind the scenes and under the covers, not in the spotlight. Microsoft's SQL Server PolyBase technology is but one architectural approach to making Hadoop a workhorse instead of something with which customers need to get up-close and personal.

There are other approaches to this, both from the point of view of implementing a Hadoop cluster itself, as well as to working with it. Companies like Qubole and AltiScale address the former and, to a less abstracted extent, so do Amazon Web Services, Microsoft and Google. Other products and tools address the Hadoop front-end, sometimes with a SQL interface and sometimes without.

Hadoop is here, it's real, get used to it
Storing data in HDFS can be very compelling economically. In many ways, HDFS is Hadoop's killer app. If for no other reason than that, Hadoop is here to stay. But it's mature analytics tools for Hadoop, DBMS abstraction layers to Hadoop, and Hadoop-as-a-Service cloud offerings that will make Hadoop actionable for the majority of technology users. Making them go to a character-mode terminal window isn't going to cut it anymore.

Editorial standards