The combination of cloud, "serverless" computing, micro-services and container technology is making on-demand, snack-sized computing services more ubiquitous. The big data world has approached this model too, but only gradually. Amazon Web Services (AWS), Microsoft and Google have services for the provision of whole Hadoop/Spark clusters. Qubole and Altiscale (acquired by SAP last year) have offered services more oriented to executing individual jobs easily.
What's been lacking is a service that offers a generic Hadoop and Spark environment on the one hand, and a job-oriented pricing model and user interface on the other. But this morning, at its Strata Data Conference event in London, Cloudera has announced that it will now offer just such a service, which it calls Altus.
What it is
Altus isn't terribly fancy...but this offering is only an initial one, geared to data engineers. It allows for the submission and execution of Spark, Hive and MapReduce jobs (which, collectively, Cloudera likes to call data pipelines). Data engineers then write the code and package it up or, in the case of Hive, write an HQL query and specify whether it should be executed using Hive on MapReduce or Hive on Spark. The "pipeline" then runs on a cluster that is created on-demand.
While the job runs, it can be monitored, and trouble-shooting can be pursued if necessary. When it's done, output is accessed via AWS storage (S3 or Elastic Block Store -- EBS), and the customer gets billed for the job. Data engineers at the customer don't have to worry about cluster management or configuration, and yet the environment is still the same as other Cloudera Distribution including Apache Hadoop (CDH) or Enterprise Data Hub clusters, including those deployed on-premises.
How you pay; what you get
While this is a "Pipeline as a Service" offering (similar to Microsoft's U-SQL- and C#-based Azure Data Lake Analytics), the pricing models offer some flexibility. Yes, by default, you pay by the hour and the number of nodes, as well as the AWS virtual machine instance type they are based on. But you can also pay with credits that purchased in advance, at a discount. Or you can forget the usage-based pricing and just pay a flat annual subscription price, based the number of nodes in the cluster.
No matter how you pay though, the environment is the same: you get a user interface designed for job submission and monitoring, and you don't really have to think much about the deployment or management of a cluster. You do need to specify number and instance type for the nodes, but even there you or your IT staff can set up default cluster types for different workloads and then never have to worry about it again.
While multiple employees of the same customer may run jobs on the same cluster, no two customers would ever share that infrastructure. In fact, said infrastructure runs within the customer's own AWS account, where it will also be billed for the virtual machine (VM) resources. Yes, the nodes run directly on AWS VMs -- Altus does not use container technology, at least not yet.
Compatibility with apps, public clouds
Cloudera's Data Science Workbench is compatible with Altus, and Big Data ETL provider Talend has partnered with Cloudera to make its products work too. Of course, each of these runs on edge nodes...if you want to use other tools that run on the cluster itself, Cloudera says the installation of these can be implemented as scripted steps that are run when clusters are provisioned.
Cloudera intends to port Altus to other cloud platforms and, from the sound of it, Microsoft Azure would seem to be, at the very least, the first among equals (that's not official -- it's just my hunch).
Altus vs. EMR
But what's interesting about Altus, is that in many ways it sounds like AWS' own Hadoop service, Elastic MapReduce (EMR). While, yes, EMR is a cluster-based service, it was originally designed to run job "flows" (not unlike Cloudera's pipelines) that would read and write data from S3, execute code submitted with the job flow request, on clusters that were by default de-provisioned after the job ran and on which installation of tools could scripted on start-up. Used in this way, EMR would also bill by the job.
So what's the difference between Altus and EMR, really? To begin with, EMR runs on Amazon's own Hadoop distro (it can also be run on MapR's distro, though usually at additional cost). Altus, on the other hand, runs on Cloudera's distro which is in many ways a corporate standard. That's not a trivial detail, as the uniformity of environment will help Enterprise customers move more workloads to the cloud.
The other difference has us ending where we began. The data pipeline service is only the first offering on Altus, which is meant to be an umbrella brand that will offer services that go beyond both raw Hadoop/Spark jobs and the data engineering audience. We can imagine more self service-oriented offerings -- be they for analytics, IoT, machine learning or something else -- being added to the Altus platform.
In the end, Altus will increasingly use Hadoop and Spark like operating systems, on which various data services can be offered. And that, of course, is how Hadoop and Spark should be used, as a substrate on which value-added products and services can operate.