Kubernetes (K8s), the open source container orchestration platform, is a big deal, all around the industry. And beyond container technology per se, K8s is really a cluster computing platform, which has made it increasingly important in the big data space. Meanwhile, the major cloud big data services -- including Amazon Web Services' (AWS') Elastic MapReduce (EMR), Microsoft's Azure HDInsight (HDI) and Google Cloud Dataproc -- have heretofore each run Apache Spark on virtual machine-based Hadoop clusters. In this day and age, wouldn't running Spark directly on K8s clusters make more sense?
Not surprisingly, Google, the company that created K8s, thinks the answer to that question is yes. And so, today, the company is announcing the Alpha release of Cloud Dataproc for Kubernetes (K8s Dataproc), allowing Spark to run directly on Google Kubernetes Engine (GKE)-based K8s clusters. The service promises to reduce complexity, in terms of open source data components' inter-dependencies, and portability of Spark applications. That should allow data engineers, analytics experts and data scientists to run their Spark workloads in a streamlined way, with less integration and versioning hassles.
In a briefing with ZDNet, James Malone, Product Manager at Google Cloud, explained how Dataproc users will be able to advance past using static Hadoop/Spark distributions that run everything on Hadoop's YARN ("yet another resource manager") and instead run pure Spark jobs directly on K8s. This offering builds on the Kubernetes Operator for Apache Spark ("Spark Operator") that Google introduced back in January and makes Google the first major cloud provider to offer a Kubernetes-based big data PaaS (Platform as a Service) product.
Also read: Google announces Kubernetes Operator for Apache Spark
The Spark Operator made running Spark on K8s possible already, but Malone explained to me that there there are good and better ways of doing this. While the DIY approach of deploying Spark to your own K8s cluster is good, it's essentially an IaaS (Infrastructure as a Service) approach. As such, it requires a K8s skill set and puts the customer in charge of everything, including software deployment and cluster maintenance. K8s Dataproc is better because it offers Dataproc's service level agreement (SLA), Google Cloud Platform-optimized open source components and -- via the Dataproc API -- abstraction of the K8s details and skill set requirements, supplying integrated management and security.
Malone said there's even a multi-cloud support play here as well. While I didn't fully understand when or how that would kick in, it seems to mean that, in addition to GKE, Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Services (AKS) K8s clusters could be supported as well, with Google's recently-announced Anthos technology ostensibly playing a part. And there's an ecosystem play here as well, allowing 3rd party vendors to integrate their own components into K8s Dataproc container images and clusters.
Support for other open source analytics components, including Apache Flink, Presto and Apache Druid is planned. Malone told me that support for Apache Hive is also possible, but that accommodating the full Hadoop stack will be tricky. That said, if I understood Malone's broader point correctly, K8s Dataproc is meant to be a "post-Hadoop" offering, in any case.
Sharing is caring
K8s Dataproc is made possible by changes to the Dataproc service and Google-led changes to the open source analytics engines themselves. The latter are being checked in and committed to the mainstream open source branches of the engines, making it possible, in fact, for AWS and Microsoft to implement similar re-platforming of EMR and HDI, respectively. Malone said that Google would not be distraught should that happen, since it sees such evolution of cloud big data platforms as a boon to the industry overall. That, in turn, is consistent with the company's attitude toward adoption of K8s itself, which brings the whole thing full-circle.