It's a bit of an adage in the software world that when a product gets to its third version, it really hits its stride. First versions are usually what we now call minimally-viable product (MVP) releases; 2.0 releases typically add enough functionality to address some of the more egregious v1 pain points. But the 3.0 goods often tend to fit and finish, and often bring one or two important new feature sets.
Such is the case with version 3.0 of Hortonworks Data Platform (HDP), being announced this morning at Hortonwork's DataWorks Summit in San Jose, CA. HDP 3.0 is itself based on version 3.1 of Apache Hadoop, which does indeed include important new areas of functionality.
The elephant in the container
The bit that may grab the most headlines is that jobs dispatched to Hadoop's YARN resource manager can now consist of entire Docker container images. While YARN has had its own container format for some time, that's been more about a code and dependencies packaging format than a full machine environment format like Docker.
Among other things, dispatch of Docker images means that code that relies on particular versions of certain software (for example, a specific version of Python) can be assured of running well, even if the developer has no control over, or insight into, what's installed on the Hadoop cluster's worker nodes.
Bear in mind, Hadoop's (and HDP's) support for Docker isn't designed to turn Hadoop into a generic high-performance environment for executing arbitrary code. Nor does Docker support imply Kubernetes container orchestration support, at least not yet. Instead, Docker support assures dynamic control over runtime environments for the kind of jobs Hadoop has always run.
HDP 3.0 also includes support for GPUs (graphics processing units) in execution of Hadoop jobs involving Deep Learning and other AI workloads, as well as enhanced security and governance capabilities, based on the Apache Ranger and Atlas projects.
Hive 3.0: The bee goes columnar
As cool as container technology is today, Hadoop has always been about getting work done, and much of that work has been around aggregation/summarization of massive data sets. A lot of that work has been delegated to Apache Hive, the original SQL-on-Hadoop component included in most Hadoop distributions, including HDP.
But Hive's initial reliance on MapReduce and even its more recent integration with Apache Tez, including the LLAP ("Live Long and Process" or, sometimes, "Low Latency Analytical Processing") implementation, has been, in a word, slow. Compared to most data warehouse and OLAP (OnLine Analytical Processing) technologies, Hive just hasn't felt fast enough to support truly interactive data exploration. And that has engendered competitors, like Spark SQL and Apache Impala. It's often felt like magic would be required to make Hive fast enough for Business Intelligence (BI) workloads.
But HDP 3.0 includes Hive 3.0 and the latter now features integration with Apache Druid, a column store data access and storage system geared towards BI/OLAP querying of time series data. Now Hive users can believe in magic, as this integration looks to be a real win-win: Hive gains an interactive column store BI engine and Druid gains a SQL query abstraction over its heretofore exclusively JSON + REST API interface. Druid also gains the ability to use Hive to generate indexes instead of having to use MapReduce jobs for that task.
Druid tables in Hive 3.0 are external tables, so the integration avoids an architecture reliant on the inefficiencies of data movement. Hive will also push down as much of the query as it can to Druid itself. And while we didn't necessarily need more complexity in the SQL-on-Hadoop world, anything that makes Hive live up to its self-proclaimed role as a Hadoop-based data warehouse platform could ultimately bring some simplicity to the Hadoop world.
Do-si-do your partner
Beyond the new HDP release, Hortonworks has another 3.0 under its hat, in the form of three partnerships announcements -- with Microsoft, Google and IBM -- all of them cloud-focused.
Let's start with Microsoft, the company most often cited in reference to the version 3 effect. The two companies are promoting the availability of Hortonworks' three distributions: HDP, HDF (Hortonwork DataFlow) and DPS (Hortonworks DataPlane Service) on Microsoft's Azure IaaS (Infrastructure as a Service) offering.
This is somewhat counter-intuitive, given HDInsight, Microsoft's PaaS (Platform as a Service) Hadoop offering is actually an HDP derivative. Ultimately, it means that Hortonworks' cloud go-to-market initiatives will be based around its own first-party distributions and Microsoft gets to tout customer choice.
Speaking of choice, while the above announcement means that HDP, HPF and DPS are now available on Azure as well as Amazon Web Services (AWS), onboarding to the Google Cloud Platform (GCP) is in Hortonworks best interest, especially given the "three" theme. And that very onboarding is being announced by Hortonworks today, with the availability of HDP and HDF on GCP. The integration will include more than just the availability of Hortonwork's technology though: it also includes native access to Google Cloud Storage from Hadoop jobs on HDP, which joins similar support for Amazon Simple Storage Service (S3) and Azure Blob Storage.
The third and final announcement involves one more three-letter acronym: IBM. Big Blue is announcing, in its own blog post, a brand new service, called IBM Hosted Analytics with Hortonworks (IHAH). The service with a fittingly four-letter acronym, for a service offered on Hortonworks' fourth public cloud, will combine HDP, IBM's Db2 Big SQL and the IBM Data Science Experience, an AI-oriented offering.
Hadoop is in the house
Hadoop has been a dirty word, of sorts, in the last year, but it ought not be so. While the industry focuses its hype machine on AI, core analytics tasks are still the bread and butter of the Enterprise. Bringing Hive up to snuff as an interactive engine on which BI tools can carry out those workloads is an important development -- one that would be ignored at the observer's peril. And modernizing the underlying platform to accommodate containerization and GPU execution shows Hadoop's keeping up with the Big Data (and AI) Joneses.
A lot of companies have made big investments in Hadoop. Now Hortonworks -- the company formed from the spin-off of the original Hadoop development team at Yahoo -- is optimizing Hadoop to help customers leverage better returns on those investments. That's a significantly positive development, for the Hadoop ecosystem, and for the data world overall.