The fall Strata conference is when Big Data makes it to Broadway. And the week was very much a blur. We used to come away from Strata with the memory of one or two overriding themes; last year it was machine learning and the new infatuation with Spark, before that it was about Hadoop opening up the opportunity for exploratory analytics and for Hadoop to disappear behind a veneer of familiar SQL.
But life in the big city has gotten too complex. And so, while this year the parlor discussion seems to be about whether AI is taking over the world, the rest of us are still wondering about the basic blocking and tackling for making big data work and figuring out how to install, operate, and deploy use cases on Hadoop.
The week after Strata, the issues that have hit us are, in no particular order: how Hadoop providers will accommodate the cloud; how to make streaming data and classic batch and interactive workloads respectful neighbors; how to govern the data; and how to make real people productive. None of these are small orders. We'll tackle the first couple issues in this post.
The inevitability of cloud
Let's start with the cloud. Ovum clients have been telling us that Hadoop remains too difficult to implement. The next wave of Hadoop adopters are not likely to have the HPC/grid compute backgrounds of the pioneers and neither will they have the expertise for handling scale out cluster deployments.
Aside from Amazon Elastic MapReduce (EMR), on-premise still predominates, as most Hadoop providers report only about 15 - 20 percent of their paid base deploying in the cloud. While we don't expect that new Hadoop deployments will hit the 50 percent tipping point for cloud next year, the writing's still on the wall.
MapR and Hortonworks have long had strategic OEM deals with AWS and Microsoft Azure, respectively. While Hadoop platform providers have long coexisted with Amazon, its dominance of the cloud business has made its own EMR Hadoop implementation the elephant in the room.
So the challenge to Cloudera and Hortonworks is making their Hadoops better on Amazon than EMR. They must make deployment as straightforward as EMR, support the same pricing options, and then add the premium stuff: better performance and more granular security.
At Strata, Cloudera announced streamlining of deployment on AWS and, significantly, integration of Impala on AWS S3 storage that it claims is 2x cheaper and 10x faster compared to Redshift on EBS block storage (Redshift doesn't run on S3). Meanwhile, Hortonworks just updated its technical preview for a data cloud implementation on AWS supporting on-demand deployment that we expect will go GA soon.
No silver bullet for deployment
Another takeaway from Strata was that Hadoop customers now face stark deployment choices that in some cases may drive their platform decisions. The versatility of Hadoop in the YARN era and emergence of new engines like Spark means that there is no default option for how you're going to handle the variety of workloads. It's a far cry from the days when you only ran MapReduce and the Hadoop distros varied primarily by what version of Hive was supported.
Those choices were reflected in debates we heard on the expo floor from vendors on whether to centralize or distribute your deployment.
For starters, there is the choice of whether to run batch and real time workloads together or as separate tiers.
Conventional wisdom has favored the Lambda Architecture, which splits incoming data streams to populate separate batch and real-time ("speed") targets that generate materialized views. Lambda, first developed by Nathan Marz while he created Apache Storm at Twitter, was premised on the idea that real-time and batch workloads would prove noisy neighbors.
But today, the founders of Confluent, who created the distributed PubSub messaging engine Apache Kafka, now are promoting the Kappa architecture. It's premised on the assumption that processing speeds, bandwidth, and availability of high-performance in-memory or Flash storage, have changed the equation to the point where you can generate either view from the same target. They claim that flattening the architecture to eliminate the duplicate tiers simplifies deployment, reducing the likelihood of errors from data views and code bases that may have fallen out of sync.
Kappa leverages compute engines like Spark, which support batch and real time in the same code base, and in-memory databases like MemSQL (which announced at Strata exactly once processing of streaming data). The choice of Kappa vs. Lambda may hinge on whether the code used for streaming and offline (batch) model development is the same or different.
But the question of Kappa vs. Lambda may be child's play compared to competing positions of Hortonworks and MapR: Hortonworks promotes a Connected Data Architecture while MapR is focusing on a Converged Data Platform. The differences are more than just plays on words starting with the letter 'C.' It's all about whether the PubSub (publish/subscribe) pipeline is subsumed into the data platform or operated at arms length. There's something ironic as PubSub is hardly a new technology, but it's the one that's driving real architectural differentiation in Hadoop.
Hortonworks contends that diverse workloads and distributed data sources, such as from the Internet of Things, will demand loosely-coupled platforms separating management of data in motion from data at rest. Its Hortonworks DataFlow product steps into the space of directing and managing how and where data streams land, which could include the Hortonworks Data Platform or any other target.
By contrast, MapR integrates Hadoop, Spark, and its own dataflow engine (MapR Streams) on the same cluster, promising a more compact footprint and superior performance -- it claims that its own PubSub streaming engine is far more scalable than Kafka. Meanwhile, Cloudera takes the more distributed route, embracing Kafka as the data flow pipeline.
And by the way, the Hadoop folks are not the only ones getting in on the PubSub fun; Teradata is promoting Listener for managing the distribution and ingestion of data in motion to Hadoop or its own data warehouse platform targets.
What's significant is not that the three principal Hadoop platforms differ -- as of Hadoop 2.x, they have had variants of the same platform with different mixes of projects or modules (open source and/or proprietary) that largely performed the same mix of security, platform management, and housekeeping tasks.
Instead, what's significant is that streaming has provided the stake for each player to offer products that are now architecturally distinct. This is no longer your parents' Hadoop.
This is the first of two posts reviewing our take of Strata Fall 2016. Our next post will explore our impressions on how AI is driving the big data agenda, and the hurdles that data scientists continue to face in working with Hadoop.