It's been no secret lately that Apache Hadoop, once the poster child of big data, is past its prime. But since April 1st, the Apache Software Foundation (ASF) has announced the retirement to its "Attic" of at least 19 open source projects, 13 of which are big data-related and ten of which are part of the Hadoop ecosystem.
While the individual project retirement announcements may seem insignificant, taken as a whole, they constitute a watershed event. To help practitioners and industry watchers appreciate the full impact of this big data open source reorg, an inventory seems in order.
With that in mind, the list of big data-relevant retired Apache projects is as follows:
- Apex: a unified platform for big data stream and batch processing, based on Hadoop YARN
- Chukwa: a data collection system for monitoring large distributed systems, built on the Hadoop Distributed File System (HDFS)
- Crunch, which provided a framework for writing, testing, and running MapReduce (including Hadoop MapReduce) pipelines
- Eagle: an analytics solution for identifying security and performance issues instantly on big data platforms, including Hadoop
- Falcon: a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery
- Hama: a framework for Big Data analytics, which runs on Hadoop, and is based on the Bulk Synchronous Parallel paradigm
- Lens, which provides a Unified Analytics interface, integrating Hadoop with traditional data warehouses to appear like one
- Marmotta: an open platform for linked data
- Metron: focused on real-time big data security
- PredictionIO: a machine learning server for managing and deploying production-ready predictive services
- Sentry: a system for enforcing fine grained authorization to data and metadata in Apache Hadoop
- Tajo: a big data warehouse system on Hadoop
- Twill, which uses Hadoop YARN's distributed capabilities with a programming model that is similar to running threads
The elephant in the room
The above list is a long one, and is part of a bigger list that includes non-big data projects as well. Clearly ASF is doing some housekeeping. Furthermore, Sentry and Metron have essentially been deprecated in favor of the comparable Ranger and Spot projects, respectively, due to the Cloudera-Hortonworks merger. Together, the two companies were backing all four projects and a single pair needed to emerge victorious.
That merger was itself rooted in the consolidation of the big data market. And, arguably, that very big data consolidation also explains the entire list of retired projects, above. To have the retirement of all of these projects announced in a period of less than two weeks is noteworthy, to say the least.
I inquired with ASF about the clearing of the big data project deck. ASF's Vice President for Marketing & Publicity, Sally Khudairi, who responded by email, said "Apache Project activity ebbs and flows throughout its lifetime, depending on community participation." Khudairi added: "We've...had an uptick in reviewing and assessing the activity of several Apache Projects, from within the Project Management Committees (PMCs) to the Board, who vote on retiring the Project to the Attic." Khudairi also said that Hervé Boutemy, ASF's Vice President of the Apache Attic "has been super-efficient lately with 'spring cleaning' some of the loose ends with the dozen-plus Projects that have been preparing to retire over the past several months."
Despite ASF's assertion that this big data clearance sale is simply a spike of otherwise routine project retirements, it's clear that things in big data land have changed. Hadoop has given way to Spark in open source analytics technology dominance, the senseless duplication of projects between Hortonworks and the old Cloudera has been halted, and the Darwinian natural selection process among those projects completed.
Let's be careful out there
It's also clear that the significant number of vendors and customers in the big data world who invested in Apache Sentry will now need to account for their losses and move on. And with that harsh reality comes the lesson that applies to almost every tech category hype cycle: communities get excited, open source technology proliferates and ecosystems establish themselves. But those ecosystems are not immortal and there's inherent risk in almost any new platform, be it commercial or open source.
In the words of ASF's Khudairi: "it's the community behind each Project that keeps its code alive ('code doesn't write itself'), so it's not uncommon for communities to change pace on a project." In other words, bleeding edge technology is exciting but early adopters beware: it's also volatile. Watch your back, and manage your risks.