Hadoop and Big Data, "Stratafied"

As Hadoop moves beyond MapReduce, an Enterprise focus, in-memory technology and accessible machine learning are the next frontiers.
Written by Andrew Brust, Contributor
Strata + Hadoop World Thumbnail

Today was the last day of Strata/Hadoop World in NYC, a show that just keeps growing.  If I gathered up all of the emails, press releases and briefing notes related to the event this year, I'd probably need a Hadoop cluster to chug through it.  Writing a post per news item would be impossible.  In fact, even a news roundup would likely end up being a laundry list of announcements, and I'm betting it would be pretty tedious to read.  

Far more valuable, and hopefully not too pretentious, would be to synthesize what I've heard, read and seen into a short list of trends that came out of the show and, in some ways, sum up where the analytics industry is right now.  So here goes...Strata/Hadoop World NYC 2013...in four simple themes.

Hadoop beyond MapReduce
The big news in the Hadoop world right before Strata was the general availability (GA) release of Hadoop 2.0.  This new version of Hadoop retains the capabilities of previous versions but removes one important requirement: using the two-pass, batch-driven MapReduce algorithm to process data.

MapReduce is good for some problem domains but it's lousy for many others...in fact, I have always thought it bad for the majority business analytics use cases.  But since MapReduce was the way to do things in the land of Hadoop, people and vendors made do, and learned how to fit various analytics square pegs into the MapReduce round hole.

I don't think it's much of a stretch to say the MapReduce dependency has held Hadoop back.  And now that Hadoop 2.0 is out, we're going to see it go much more mainstream.  It will take a while, because the ecosystem around the YARN component of Hadoop (which makes non-MapReduce processing possible) has to develop, but once more products interface with YARN and a few open source killer projects around it emerge, Hadoop adoption will likely accelerate.

And we're already off to the races with the GA releases of Hortonworks Data Platform (HDP) 2.0 announced last week and that of Cloudera's Distribution of Apache Hadoop (CDH) 5.0 announced yesterday at Strata. Both distributions are based on the Hadoop 2.0 code base.

Over a year ago, I met an engineer at Microsoft who told me MapReduce would recede from dominance in the Hadoop world.  At the time, I thought he was overstating matters.  No I am certain his assessment was actually quite understated.

Speaking of Microsoft, it used Strata as the forum to announce the GA release of its cloud-based Hadoop offering, HDInsight.  Microsoft's Hadoop distro is based on Hortonworks' HDP for Windows, the Apache 2.0-based version of which is not yet out.  It is expected to drop next month though, and ostensibly it should find its way into HDInsights shortly thereafter.

Two words: In-memory
In-memory is actually an abused term, so I hesitate to use it to define a single category.  But I am going to do so anyway, because companies and products that self-identify under the label do in fact fall into a category, even if just attitudinally.

Let's start with SAP, a company that was well in evidence at Strata, and which continues to bang on the HANA drum.  I'm still rather skeptical of a model that would have me use RAM as the storage medium for my database...modern servers top out at about 256GB of RAM right now and even if that quadruples, it will still take 1,024 boxes to get to a petabyte, which seems unwieldy.  But SAP has a lot of ERP customers, and is migrating them to the HANA platform, giving HANA critical mass and ownership of valuable transactional data, the analysis of which is crucial to the business.  In other words, SAP is putting HANA in the middle of the action, which makes it a strategically important platform...regardless of its technical merits (or lack thereof).

So when SAP announces it's pushing ahead with a HANA-first strategy, that's news, and it blazes a trend's trail.  Add to that the new, HANA-based Customer Engagement Intelligence application suite announced by SAP, and its "hot, warm, cold" strategy of storing data in HANA, Sybase IQ and Hadoop, and you can discern the company's message: HANA's the crown jewel, the data warehouse is still important, and the best way to acknowledge Hadoop is to incorporate it in your stack...at the bottom of the hierarchy.

Other in-memory companies and products in evidence at Strata included GridGain and ScaleOut Software whose products, among other tricks, can act as in-memory workspaces for Hadoop processing, which both companies say they accelerate immensely; Kognitio's Analytical Platform (the 8.1 release of which was announced at Strata today), and even a new capability of Cloudera's CDH 5.0: the ability to "pin" data in memory (something relational databases have offered for years now).  Then there's the upcoming version of Microsoft's flagship database, SQL Server 2014, which will include a new in-memory OLTP engine.

Now that I have conflated all these products though, let me tease them apart.  Kognitio is a mature product that uses memory not for data storage, but for processing.  It also compiles SQL queries into machine code, and the combination of machine-level code running against data in-memory can make things very fast indeed.  In fact, SQL Server's in-memory OLTP uses a similar strategy.

GridGain and ScaleOut Software combine in-memory processing with grid/cluster computing.  And, in a way, Hadoop processing is merely a bonus feature for both companies' products.  Each company's technology can work independently of Hadoop, and provide a lot of value on its own.

Cloudera's ability to pin data into memory is really just a spin on caching.  Normally, cached data is subject to being "flushed" from memory, and at somewhat indeterminate times.  Pinning allows a developer or database administrator to specify that certain data should be persisted in the cache and not flushed.  If you have a big enough cache, and you pin big chunks or all of your database, then technically you do have your data in-memory.  But that's quite different from working with products whose very architectures are built around the assumption of exclusive in-memory operation.

Enterprise or bust
I used the phrase in the above subhead in my article on Cloudera's CDH 5.  It's clear that Cloudera has the Enterprise customer in its sites.  In general, this is the year where start-ups have to start making money if they are to survive, and Enterprise customers are the way to get there.

This means adding boring, but necessary features to Hadoop stacks.  That's Why Cloudera added memory pinning.  It's also why MapR announced on Monday at Strata a security beta, featuring of HTTPS/certificate-based and Kerberos authentication, integrated with Active Directory and LDAP, at the cluster level, in its own Hadoop distribution.  It's why the SQL-on-Hadoop craze that was started by the introduction of Cloudera's Impala at last year's Strata has resulted in most data industry players now offering comparable solutions.

The Enterprise drive also explains why MetaScale, a wholly owned subsidiary of Sears Holdings, offers strategy, advisory and implementation expertise around Hadoop for enterprises...and why, in many cases, it's helping companies move COBOL code to Apache Pig and old-school EBCDIC files to ASCII files in HDFS.  Perhaps that's not sexy, but it is a huge help to customers, addressing their pain points, reducing costs, speeding up jobs, and bringing legacy code — whose developers may well be retired — into a more modern language that works with file-based data.

Self-service analytics
The next and last area to cover is that of data mining, machine learning and predictive analytics.  Yes, Revolution Analytics announced the release of version 7 of the Revolution R Enterprise product at Strata on Monday, but it goes beyond that.  I've been saying for a while that data scientists don't scale, and that we'll need to make analytics accessible by business users if the true benefit of modeling and predictive analytics is to pervade the business world.  Well, we now have a few start-ups in that game specifically.

SkyTree and Alpine Data Labs each offer products that provide graphical user interface front-ends for such analytics work.  The term "data scientist in a box" is sometimes applied to what products like this do, but I might rather be more comfortable with applying the term "self-service" here.  Both of these products layer on top of Hadoop to get their processing done, but to a large extent, that ends up being an implementation detail, as it should be.  Actian's ParAccel data platform, through integration of the DataRush product it acquired along with Pervasive Software earlier this year, now has its own DataFlow engine, that offers both machine learning/analytics and ETL (extract-transform-load) over Hadoop, and can even combine them in the same orchestration.

And though they were not at the event, I must point out Predixion Software, if I am to cover the self-serve analytics space comprehensively.  Predixion also provides a GUI over analytics, but does so with a twist: it's native environment is Microsoft Excel, and it has the ability to work over many data sources, including Hadoop, data warehouse appliances, standard relational databases, and more.  Maybe that's why Accenture now uses Predixion as a standard tool in the Accenture Analytics Platform, and has invested in Predixion, too.

My Hadoop's all growed up
Predictive analytics in Excel?  Running Pig code that was ported from mainframe COBOL?  Getting away from the relatively obscure MapReduce programming skill set to do Hadoop work?  This stuff would have been hard to imagine at Strata two years ago.  Why is it happening?  Because Hadoop is maturing, and so are the companies behind it.

Editorial standards