Cloudera and Hortonworks: Prodigal sons reunite

The two foundational Hadoop companies announce their intention to merge. Is it a whole greater than the sum of parts, or a shotgun wedding? And what's the combined company going to be called?
Written by Tony Baer (dbInsight), Contributor and  Andrew Brust, Contributor

With the advantage of two sets of eyes and ears, one of us got the news live while the other just saw it in a series of cryptic text messages upon landing at Newark Airport a couple hours later: Cloudera and Hortonworks are entering a merger of equals that sees Cloudera stockholders owning roughly 60% of the combined company.

Larry Dignan delivered the news flash yesterday: It puts together a company with roughly a $5 billion valuation and $750 million in revenues, with players that have been slowly advancing toward cash flow positive balance sheets. Until now, we thought that IBM would have been the more likely suitor for Hortonworks, given an OEM relationship that was finding increasing commercial traction. But as IBM of late has been busily pivoting the future of its business from Watson cognitive computing toward a boarder implementation of AI, not to mention the urgency of building the IBM Cloud business, there's been bigger fish to fry.

The deal brings together two formerly fierce rivals. The cofounders of both companies shared a common background at Yahoo inventing Hadoop, but then went on to forge separate paths that at some points became personal and contentious. But even during the height of rivalry, both worked together in the Apache community, sometimes on the same Hadoop projects, but many times, on competing ones. Through the years, Gartner analyst Merv Adrian has faithfully tracked the canonical history of Hadoop projects, which has tended to resemble more of a scorecard.

Arun Murthy and Doug Cutting being interviewed by Jeff Kelly at Hadoop Summit 2014

The passing of time (not to mention to some degree, turnover of staff) may have smoothed off some of the rougher edges of rivalry over the years, but there is the matter of overlapping projects to contend with. Like many mergers, there will always be the need for product rationalization. But there is so much history that the sorting out of Apache Sentry and Ranger; Spot and Metron; Atlas and Cloudera Navigator; Hive LLAP and Impala; and Ambari and Cloudera Manager will hardly be cut and dry. We'll likely see Doug Cutting and Arun Murthy reprising their Hadoop Summit 2014 stage joint appearance, but in the confines of conference rooms.

Not surprisingly, given the strong cultural identities of both players, the merger press release omitted any mention of what the combined company will be called. Yeah, there are some egos that likely need to be soothed. There are also business models to be harmonized, but the good news is that subtly, Hortonworks 100% pure open source model has been gradually converging to something more like Cloudera's open core. While the core Hortonworks platform has remained open source, it has entered reseller deals involving proprietary software.

But the world has moved on from when it was simply the battle of Hadoop platforms, open source or otherwise. At one time there were as many as a half dozen offerings to choose from, but that was when Hadoop was the only game in town for analyzing petabytes of data. Today the market landscape offers many more paths, many of them far less complex than marshaling all the dozen or so components of Hadoop clusters.

There's Spark, which can work on its own or as part of an analytic data warehousing platform, or as a supported project of Hadoop. There is a growing variety of choices for discrete machine learning and deep learning services for those wanting to develop and operationalize AI models. There are streaming and data flow systems that analyze data on the fly and bypass Hadoop by funneling straight into cloud storage.

And then there's SQL. Remember SQL? There are cloud-based services that let you run ad hoc SQL queries against Hadoop. In the latest twist, SQL Server 2019, just announced as preview by Microsoft, will introduce a big data edition that swaps out Hadoop on the compute node in favor of Microsoft's SQL database engine and Spark that runs directly against HDFS in data nodes (there's actually striking similarity to the way Cloudera deploys Impala daemons). Amazon Redshift Spectrum, Google Cloud BigQuery, and Azure SQL Data Warehouse can all run against cloud storage. So does Snowflake.There are also cloud-based services that let you directly run SQL queries against cloud object storage.

In this increasingly fragmented landscape, Hadoop's calling card is versatility and emerging governance that, with any of the aforementioned services, must implemented a la carte.

Did we say cloud-based services? The presence of Amazon EMR, Azure HDInsight, and Google Cloud Dataproc also skews the equation. Cloudera and Hortonworks must contend with the "nobody got fired for buying" the cloud provider's managed Hadoop service if they are already using the cloud. Aside from HDInsight (where Microsoft OEMs the Hortonworks platform), both have been pivoting toward more specialized services for data engineering, data warehousing, and data science to differentiate from the EMRs of the world -- that concedes that EMR et al are the default Hadoop platforms in the cloud and that it does not make sense for Cloudera or Hortonworks to go straight up against them with another general-purpose offering.

By the way, the cloud has also changed the definition of what is Hadoop, as upcoming versions of the Apache platform will make it easier to swap in cloud storage in place of HDFS. With cloud storage becoming the de facto data lake, refactoring of the Hadoop storage tier is coming not a moment too soon. And as for MapReduce, the other original pillar of Hadoop, it is becoming an endangered species.

On the commercial side, the landscape has both opportunities and challenges. On the plus side, both companies combined have roughly 2500 customers. Like any enterprise platform, the sales cycles are long and costly, meaning that the path to profits is the land and expand model. Given our estimates that 80% of this base is on premises, the cloud is not a direct competitor, and that provides the air space for the combined company to finally take advantage of the "expand" part of land and expand cycle as the road to higher margins. And with a base of 2500, there's enough scale to pave a path to profitability. And as both Cloudera and Hortonworks have targeted the top 3000 - 5000 companies based on complexity of analytic problems, their business resembles that of Teradata, which is now on its own path back to profitability.

As Ovum has predicted that next year, half of new big data workloads will be in the cloud, that's where the friction heats up. Hadoop's installed base consists of companies that have the IT skills to set up big data clusters. But most companies lack those skills, and for them, the simplicity of the cloud beckons. Can Hadoop be made simpler through a managed service that takes away all the messy provisioning and configuration, or will point services or familiar SQL relational databases be the most expedient ways to get results?

That's where Cloudera/Hortonworks faces its fiercest frenemies. For Cloudera and Hortonworks, the Amazon relationship has been more arms length and the Google relationship is at a much earlier stage. That leaves Azure, where Microsoft has put skin in the game with Hortonworks on HDInsight, and where Cloudera and Microsoft have been collaborating on more specialized Altus offerings. But then again, Microsoft is hedging its bets and has put significant marketing muscle behind Azure Databricks as the prime platform for Spark and AI. In the future, we expect that Microsoft will run Azure SQL Database also against Azure Blob storage or ADLS, tossing Spark in the deal, just like it is doing with SQL Server 2019 on HDFS.

And so, while Cloudera/Hortonworks' near term path to profitability lies with expanding the footprint in its existing on premise installed base, the hot growth part of the market will be in the cloud where it's much more of a jungle.

Editorial standards