This being Hadoop's tenth birthday (and Hortonworks' fifth), it's not surprising that both platform and company have grown a fair bit -- and of course still have some growing up to do.
The platform itself has come a long way. From Hadoop's earliest days, when it was defined as simply storage (HDFS) and compute (MapReduce), today's platform has dozens of core and competing open source components addressing many of the housekeeping features associated with databases, from operations management to security, data protection, and data governance.
And Hortonworks has come a long way from the single product pure open source company.
Open source has become the default delivery model for emerging data platforms, as we'll discuss in an upcoming post. But the pure open source model, as espoused by Hortonworks, has been rare because of the question over where their unique IP is. For Hortonworks, the answer has been that they have, depending on who's counting, the largest block of committers to the Apache Hadoop community of projects.
They won't admit it, but of late Hortonworks is looking a lot more like its rivals, Cloudera and MapR, in offering content that is vendor-specific. That's actually a good thing, especially if you're a customer who's looking to implement a data lake, and who wants assurance that your technology provider will have the unique IP (and business sense) to be a long-term player.
The first cracks in the wall are via an OEM arrangement that Hortonworks now has with AtScale, Syncsort, and Pivotal for data warehouse optimization use cases. They are reselling AtScale to provide an OLAP face to Hadoop, for improving performance of BI query and reporting; Syncsort DMX-h for ETL processing; and Pivotal's HAWQ interactive SQL technology (this one actually just became open source). The resale strategy makes sense given that data warehouse optimization is a mature market with an identifiable, and sufficiently sizable target base. Debatable are features such as SmartSense that surface cluster health statistics in Ambari, which is only available via Hortonworks Data Platform subscription.
But as an enterprise customer, you won't care which open source model your technology provider has; you care whether their business model is viable.
And reflecting Hadoop's growing maturity as an enterprise platform, key themes for enhancements unveiled at Hadoop Summit were over connecting the dots with data governance, performance improvement, and ease of use. Among the announcements, Hortonworks expanded the capability of Atlas, the data lineage tool, from support of Hive (where the data sits) to upstream ingest processes including Kafka (for message queuing) and Storm (for streaming). This means that data can be tagged in Atlas, not just when it arrives in Hive, but at the point of ingest, if you use one of Hortonworks' supported streaming engines.
With Atlas providing the metadata for data lineage, Ranger can implement data security; just added are capabilities for dynamically masking columns and filtering rows in Hive to determine how people of different roles can and will see the data. In turn, Zeppelin, Hortonworks' entry into the crowded data scientist notebook space, now integrates credentials with Ranger to enforce access control for practitioners using Spark.
Hortonworks has also been working to drop ACID on Hive - although we're not talking about the strict ACID associated with transaction systems. In this case, we're talking about the ability to update and delete data from Hive. That's something that, until now, was only possible with MapR's underlying proprietary file system. The significance is not simply bragging rights, but reducing the overhead of updating Hive, especially when data is streaming in at high rate. A technology preview was announced at the conference.
A related project to improve interactive query performance on Hive leverages an emerging in-memory caching technology, branded LLAP (a term that will be familiar to Star Trek fans) that also includes fine-grained preemption capabilities to ensure that long-running batch jobs won't bottleneck higher-priority interactive query requests. Another related project is the new query server for Phoenix, the project to put a SQL face on HBase. The irony of the query server is that, while Phoenix was designed to make HBase friendlier to SQL, the new query server focuses on APIs to programming language alternatives such as C++, .NET, and Python.
Maybe the impression is subjective, but making Hadoop a better governed place is a direct response to enterprises that are planning data lakes. By definition, data lakes are enterprise resources, much like their predecessors enterprise data warehouses, and therefore need more capabilities that help you understand exactly what data is in there. In another post, we'll discuss data lake governance. Suffice it to say that from the latest announcements from Hortonworks, Hadoop vendors are listening.