For data platform providers, Amazon is the ultimate frenemy. If you're trying to have a major cloud market presence, the Amazon cloud is almost impossible to avoid. So it's not surprising that Hadoop providers are increasingly making friendly with Amazon AWS - and Microsoft Azure.
For Hortonworks, roughly a quarter of its customers are deploying in the cloud for some or all of their workloads. Until now, its primary cloud presence has been as the Hadoop engine of Azure's HDInsight big data service.
Hortonworks is the latest to join the fray with Amazon, announcing a new service that will be offered through the AWS marketplace while running natively with S3 storage and EC2 compute. The service, Hortonworks Data Cloud (HDCloud) for AWS, is a specialized service designed to handle the most popular Hadoop workloads: Spark and Hive.
The challenge for Hadoop providers is that, in the AWS cloud, Amazon's EMR service provides the most native, seamless experience. It is a managed service, meaning after you select the type and quantity of EC2 nodes, EMR provisions itself. By contrast, running the Hortonworks (or Cloudera) in the Amazon cloud as raw infrastructure-as-a-service (IaaS) requires customers to assume the burden of provisioning cloud infrastructure and managing workloads. Even with Hortonworks Cloudbreak or Cloudera Director, which helped automate provisioning, the playing field with EMR was not leveled when it came to ease of use, and it used HDFS instead of AWS's standard S3 storage.
That's where HDCloud offering comes in. Offered through AWS Marketplace, you get more of an EMR-like managed cloud experience, and like EMR, you pay Amazon, not Hortonworks (Hortonworks obviously gets a royalty from Amazon). It uses S3, so it also looks like a standard AWS service.
The new Hortonworks AWS offering is not a full implementation of the Hortonworks Data Platform (HDP), as the service caters only to the most popular workloads: Spark for analytics and machine learning, and Hive (with the new LLAP acceleration) for interactive SQL.
Also: AWS public cloud is twice as big as Microsoft, Google, IBM combined | Amazon hopes these tools can help fight budget blowouts | AWS cements hybrid cloud position with VMware partnership: Here's what it means
As a result, HDCloud is not a knockoff of HDInsight for AWS. By comparison, HDInsight is a broader service, offering a more complete edition of the Hortonworks Data Platform. And besides Spark and Hive, HDInsight also runs Storm and HBase. Furthermore, HDInsight is more fully managed than the new Hortonworks AWS offering; for instance, Azure handles all the upgrades, while on AWS, more manual intervention would be required.
The back story to all of this is that, increasingly, cloud customers are demanding fit-for-purpose alternatives rather than access to a full platform. And so, today you see specialized machine learning services that provide access to a handful of modeling algorithms from all major cloud providers, and you see Spark-only services from providers like Databricks, or from Qubole, which offers a choice of Spark-only or complete Hadoop. This has also been the issue that has fueled the Spark vs. Hadoop debate. Although HDP and HDInsight already have full Spark support, such demand for tailored cloud services for ephemeral workloads has drawn Hortonworks to narrow the focus of its new Amazon offering.
Back to AWS, the obvious question is why use HDCloud instead of defaulting to EMR? Hortonworks is differentiating by optimizing for Hive and Spark workloads by leveraging a feature borrowed from Ambari that optimizes configuring compute nodes. Hortonworks is also promoting its ability to provide more granular security to Hive at row and column level.
EMR has long had the edge with its own proprietary data access optimizations. HDCloud is leveraging recent enhancements that came with Apache Hadoop 2.7 to get in the same ballpark with EMR performance against S3.
Coming out of the gate, HDCloud will charge through annual contracts or by hourly rates. Since its existing Cloudbreak technology (some of which is used with the new AWS offering) already enables spot instances, we expect that, eventually, HDCloud will add spot pricing as well. And note the "for AWS" branding. We wouldn't be surprised if the HDCloud eventually becomes available through other public clouds.
Note: An earlier version of this post implied that Qubole only offers dedicated Spark services. In fact, Spark is part of a broader portfolio of cloud-based big data analytics that includes a full implementation of Hadoop-related workloads.