It's hard to believe, but it's true. The Apache Hadoop project, the open source implementation of Google's File System (GFS) and MapReduce execution engine, turned 10 this week.
The technology, originally part of Apache Nutch, an even older open source project for Web crawling, was separated out into its own project in 2006, when a team at Yahoo was dispatched to accelerate its development.
Proud dad weighs in
Doug Cutting, founder of both projects (as well as Apache Lucene), formerly of Yahoo, and presently Chief Architect at Cloudera, wrote a blog post commemorating the birthday of the project, named after his son's stuffed elephant toy.
In his post, Cutting correctly points out that "Traditional enterprise RDBMS software now has competition: open source, big data software." The database industry had been in real stasis for well over a decade. Hadoop and NoSQL changed that, and got the incumbent vendors off their duffs and back in the business of refreshing their products with major new features.
Sleeping giants awaken
Microsoft SQL Server now supports columnstore indexes in order to handle analytic queries on large volumes of data and its upcoming 2016 version adds PolyBase functionality for integrated query of data in Hadoop. Meanwhile, Oracle and IBM have added their own Hadoop bridges, along with better handling of semi-structured data.
Teradata has pivoted rather sharply towards Hadoop and Big Data, starting with its acquisition of Aster Data and continuing through its multifaceted partnerships with Cloudera and Hortonworks. Meanwhile, in the Hadoop Era, perhaps in deference to Teradata, virtually every megavendor acquired one of the data warehousing pure plays.
Cutting points out, also accurately, that the original core components of Hadoop have been challenged and/or replaced: "New execution engines like Apache Spark and new storage systems like Apache Kudu (incubating) demonstrate that this software ecosystem evolves rapidly, with no central point of control." Granted, both of these projects are heavily championed by Cloudera, so take the commentary with a grain of salt.
Salt or no salt though, Cutting's comment that the Hadoop ecosystem has "no central point of control" is one worth considering carefully; because, while it is correct, it's not necessarily good. The term "creative destruction" sometimes truly is an oxymoron. The Big Data scene's rapid technology replacement cycles leave the space stability-challenged.
Give peace a chance
Perhaps, but the moving technology target may also mean they get no software at all, because the current environment is sufficiently risk-prone as to hinder the growth of enterprise projects. We need some equilibrium if we want growth to be proportionate to the level of technological innovation.
Cutting concludes his post by declaring: "I look forward to following Hadoop's continued impact as the data century unfolds." While I'm not sure data and analytics will define the whole century, they probably have a good decade or two. Hopefully the industry can get a little better at developing standards that are cooperative and compatible, rather than overlapping and competitive. We don't want to go back to stasis, but more navigable terrain would suit the industry and its customers
Meanwhile, back in the competitive market
Speaking of the industry, there were a slew of announcements this week, beside (and even despite) Hadoop's birthday.:
- Pentaho introduced Python language integration into its Data Integration Suite
- Paxata launched its new Winter '15 release (albeit in 2016), which includes new auto number and fill down transformations, new algorithms to aid its data prep recommendations, and integration with LDAP and SAML, for enterprise security, single sign-on and identity management
- SkyTree, a predictive analytics vendor, discussed that it will soon launch a free single-user version of its product, which it will soon announce more formally (and RapidMiner, also in the predictive space, released its new version 7 last week, with a revamped UI)
- NoSQL vendor Aerospike launched a new release of its eponymous database, which now features geospatial data support, added resiliency in cloud-hosted environments and server-side support for list and map data structures
That's a pretty busy week. And I dare say, without Hadoop as a catalyst, it would have been much less so. As climate change, financial markets, geopolitics and the price of oil reach frightening new levels of volatility, the data sector of the technology industry is thriving. We might hope that the technology around Big Data could be deployed to help solve, or at least better understand, some of our world's truly big problems.
This won't be the century of data unless that in fact happens.