A set of announcements from the Apache Software Foundation (ASF) this week provides an interesting view into the world of Big Data and how it is changing. There's news out of the ASF every week, but this week's combination of announcements caught my eye. Two projects -- related to data governance and optimized columnar storage -- pushed out new releases, while an early project related to Hadoop in the cloud was retired.
Governance, by the people
First, the details. Atlas, an Apache Incubator project at the heart of Hortonworks' Data Governance Initiative, announced its 0.5 release. The details of the deliverable make it clear that 0.5 is a minor, mostly administrative release. But the "half way" version number and the changes over to the ASF source code repository ("repo") are hallmarks of something important. The collection of Hadoop-related ASF projects now includes one that, according to its overview statement, focuses on "enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem."
Things like metadata and master data management, cross-stack security and data lineage were practically scoffed at a couple of years ago. They were considered dull Enterprise niceties that were of little import, relative to the power of Hadoop and Big Data overall. But now, as Hadoop has become an early-stage mainstream technology and is fighting hard to get to the next level, where it is used in mission-critical projects, attributes like manageability and auditability have come front and center.
Columns as pillars
So too has performance, not just at the macro level of scale-out, but also for query performance on individual nodes in the cluster. The data warehousing world has valued and invested in columnar storage and processing -- where values for a given data column are stored and processed together -- for more than a decade. Since most analytics work involves aggregation of a few selected columns, this works much better than storing all row values together, and having to skip over most of them as data is read in. Columnar technology is now a big deal for the Hadoop Distributed File System (HDFS) and for just about every processing engine that is compatible with it.
And so the Apache Parquet project, which brings a columnar file format to HDFS, and which thus helps columnar engines like Cloudera's Impala and even Apache Hive work more efficiently, has been one to watch. With support from Cloudera, Twitter and even Hadoop-independent projects like Apache Drill and Spark, Parquet has become a very important standard, industry-wide.
The new 1.80 release of Parquet MR (a set of Java Libraries for working with Parquet files) is really just a maintenance release addressing two bugs. But these bugs were causing corruption in Parquet files, and unless bugs like that are addressed quickly, confidence -- and adoption -- in the format could suffer non-trivially.
Cloud clusters, via script
Back in 2012, when I was new to covering Big Data, I interviewed folks at Cloudera for the first time. Somewhat naively, I asked Todd Lipcon, the Cloudera engineer I chatted with, why a company whose name started with the word "cloud" offered a product that was essentially designed for use on-premises. The answer Lipcon shared with me, some three and a quarter years ago, was to use an open source tool called Whirr.
An ASF project, Whirr was introduced to automate the deployment of Hadoop nodes on cloud-based infrastructure, through an API that was intended to be cloud platform-independent. That hardly seemed elegant even then. And now that we have Hadoop-as-a-service companies like Qubole and Altiscale, and simple cloud marketplace-based deployment for the major Hadoop distributions for Amazon EC2 and Microsoft Azure, Whirr has become obsolete. As such, the project has officially been retired to the ASF "Attic." That means it can still be used, but it won't be developed any further.
It's not you, it's me
We'll miss Whirr, and a simpler time (all the way back in 2012) when data governance and node-level query performance weren't so important. But if we want Hadoop, and schema-variable Big Data analytics, to become commonplace, we need to move on. This week, with a few ASF announcements, it seems the Big Data world is doing just that.