Spark gets automation: Analyzing code and tuning clusters in production

Spark is the hottest big data tool around, and most Hadoop users are moving towards using it in production. Problem is, programming and tuning Spark is hard. But Pepperdata and Alpine Data bring solutions to lighten the load.
Written by George Anadiotis, Contributor

Hadoop and MapReduce, the parallel programming paradigm and API originally behind Hadoop, used to be synonymous. Nowadays when we talk about Hadoop, we mostly talk about an ecosystem of tools built around the common file system layer of HDFS, and programmed via Spark.

Spark is the new Hadoop. One of the defining trends of this time, confirmed by both practitioners in the field and surveys, is the en masse move to Spark for Hadoop users. Spark is itself an ecosystem of sorts, offering options for SQL-based access to data, streaming, and machine learning.

People are migrating to Spark for a number of reasons, including easier programming paradigm. Easier than MapReduce does not necessarily mean easy though, and there are a number of gotchas when programming and deploying Spark applications.

The problem with Spark and what to do about it

So why are people migrating to Spark? The top reason seems to be performance: 91 percent of 1615 people from over 900 organizations participating in the Databricks Apache Spark Survey 2016 cited this as their reason for using Spark. But there's more. Advanced analytics and ease of programming are almost equally important, cited by 82 percent and 76 percent of respondents.

All industry sources we have spoken to over the last months point to the same direction: programming against Spark's API is easier than using MapReduce, so MapReduce is seen as a legacy API at this point. Vendors will continue to offer support for it as long as there are clients using it, but practically all new development is Spark-based.


Not everyone using Spark has the same responsibilities or skills. Image: Databricks

As Ash Munshi, Pepperdata CEO puts it: "Spark offers a unified framework and SQL access, which means you can do advanced analytics, and that's where the big bucks are. Plus it's easier to program: gives you a nice abstraction layer, so you don't need to worry about all the details you have to manage when working with MapReduce. Programming at a higher level means it's easier for people to understand the down and dirty details and to deploy their apps."

Great. What's the problem then? Munshi points out that the flip side of Spark abstraction, especially when running in Hadoop's YARN environment which does not make it too easy to extract metadata, is that a lot of the execution details are hidden. This means it's hard to pinpoint which lines of code cause something to happen in this complex distributed system, and it's also hard to tune performance.

Having a complex distributed system in which programs are run also means you have be aware of not just your own application's execution and performance, but also of the broader execution environment. Pepperdata calls this the cluster weather problem: the need to know the context in which an application is running. A common issue in cluster deployment for example is inconsistency in run times because of transient workloads.

Data Scientists get automation: tuning Spark clusters

Pepperdata is not the only one that has taken note. A few months back Alpine Data also pinpointed the same issue, albeit with a slightly different framing. Alpine Data pointed to the fact that Spark is extremely sensitive to how jobs are configured and resourced, requiring data scientists to have a deep understanding of both Spark and the configuration and utilization of the Hadoop cluster being used.

Failure to correctly resource Spark jobs will frequently lead to failures due to out of memory errors, leading to inefficient and time-consuming, trial-and-error resourcing experiments. This requirement significantly limits the utility of Spark, and impacts its utilization beyond deeply skilled data scientists, according to Alpine Data.

This is based on hard-earned experience, as Alpine Data co-founder & CPO Steven Hillion explained. At some point one of Alpine Data's clients was using Chorus, Alpine Data Science platform, to do some very large scale processing on consumer data: billions of rows and thousands of variables. Chorus uses Spark under the hood for data crunching jobs, but the problem was that these jobs would either take forever or break.

The reason was that the tuning of Spark parameters in the cluster was not right. People using Chorus in that case were data scientists, not data engineers. They were proficient in finding the right models to process data and extracting insights out of them, but not necessarily in deploying them at scale.

The result was that data scientists would get on the phone with Chorus engineers to help them diagnose the issues and propose configurations. As this would obviously not scale, Alpine Data came up with the idea of building the logic their engineers applied in this process into Chorus. Alpine Data says it worked, enabling clients to build workflows within days and deploy them within hours without any manual intervention.

Alpine Data Spark Auto Tuning

So the next step was to bundle this as part of Chorus and start shipping it, which Alpine Labs did in Fall 2016. This was presented in Spark Summit East 2017, and Hillion says the response has been "almost overwhelming. In Boston we had a long line of people coming to ask about this".

Hillion emphasized that their approach is procedural, not based on ML. This may sound strange, considering their ML expertise. Alpine Labs however says this is not a static configuration, but works by determining the correct resourcing and configuration for the Spark job at run-time based on the size and dimensionality of the input data, the complexity of the Spark job, and the availability of resources on the Hadoop cluster.

"You can think of it as a sort of equation if you will, in a simplistic way, one that expresses how we tune parameters" says Hillion. "Tuning these parameters comes through experience, so in a way we are training the model using our own data. I would not call it machine learning, but then again we are learning something from machines."

Data Engineers get automation: analyzing Spark applications

Pepperdata now also offers a solution for Spark automation with last week's release of Pepperdata Code Analyzer for Apache Spark (PCAAS), but addressing a different audience with a different strategy. Data scientists make for 23 percent of all Spark users, but data engineers and architects combined make for a total of 63 percent of all Spark users. This is the audience Pepperdata aims at with PCAAS.

Architects are the people who design (big data) systems, and data engineers are the ones who work with data scientists to take their analyses to production. Munshi says PCAAS aims to give them the ability to take running Spark applications, analyze them to see what is going on and then tie that back to specific lines of code.

The thinking there is that by being able to understand more about CPU utilization, garbage collection or I/O related to their applications, engineers and architects should be able to optimize applications. PCAAS boasts the ability to do part of the debugging, by isolating suspicious blocks of code and prompting engineers to look into them.

PCAAS aims to help decipher cluster weather as well, making it possible to understand whether run time inconsistencies should be attributed to a specific application or to the workload at the time of execution. Munshi also points out the fact that YARN heavily uses static scheduling, while using more dynamic approaches could result in better hardware utilization.

Pepperdata Code Analyzer for Apache Spark

Better hardware utilization is clearly a top concern in terms of ROI, but in order to understand how this relates to PCAAS and why Pepperdata claims to be able to overcome YARN's limitations we need to see where PCAAS sits in Pepperdata's product suite. PCAAS is Pepperdata's latest addition to a line of products including the Application Profiler, the Cluster Analyzer, the Capacity Optimizer, and the Policy Enforcer.

The latter three are about collecting telemetry data, while the former two are about intervening in real-time, says Munshi. Pepperdata's overarching ambition is to bridge the gap between Dev and Ops, and Munshi believes that PCAAS is a step in that direction: a tool Ops can give to Devs to self-diagnose issues, resulting in better interaction and more rapid iteration cycles.

Interestingly, Hillion also agrees that there is a clear division between proprietary algorithms for tuning ML jobs and the information that a Spark cluster can provide to inform these algorithms. There are differences as well as similarities in Alpine Labs and Pepperdata offerings though.

Where is this going?

To begin with, both offerings are not stand-alone. Spark auto-tuning is part of Chorus, while PCAAS relies on telemetry data provided by other Pepperdata solutions. So if you are only interested in automating parts of your Spark cluster tuning or application profiling, tough luck.

When discussing with Hillion, we pointed out the fact that not everyone interested in Spark auto tuning will necessarily want to subscribe to Chorus in its entirety, so perhaps making this capability available as a stand-alone product would make sense. Hillion alluded that the part of their solution that is about getting Spark cluster metadata from YARN may be open sourced, while the auto-tuning capabilities may be sold separately at some point.

Alpine Labs is worried about giving away too much of their IP, however this concern may be holding them back from commercial success. When facing a similar situation, not every organization reacts in the same way. Case in point: Metamarkets built Druid and then open sourced it. Why? "We built it because we needed it, and we open sourced it because if we had not, something else would have replaced it."


The AI lock-in loop: great investment begets greater results begetting greater investment. Image: Azeem Azhar / Schibsted

In all fairness though, for Metamarkets Druid is just infrastructure, not core business, while for Alpine Labs Chorus is their bread and butter. As for Pepperdata, they are toying with the idea of giving free access to PCAAS for non-production clusters to get a foothold in organizations. The reasoning is tested and true: get engineers to know and love a tool, and the tool will eventually spread and find its way in IT budgets.

Either way, if you are among those who would benefit from having such automation capabilities for your Spark deployment, for the time being you don't have much of a choice. You will have to either pay a premium and commit to a platform, or wait until such capabilities eventually trickle down.

The bigger picture however is clear: automation is finding an increasingly central role in big data. Big data platforms can be the substrate on which automation applications are developed, but it can also work the other way round: automation can help alleviate big data pain points.

Remember the AI lock in the loop? First mover advantage may prove significant here, as sitting on top of million telemetry data points can do wonders for your product. This is exactly the position Pepperdata is in, and it intends to leverage it to apply Deep Learning to add predictive maintenance capabilities as well as monetize it in other ways.

Whether Pepperdata manages to execute on that strategy and how others will respond is another issue, but at this point it looks like a strategy that has more chances of addressing the needs for big data automation services.

Editorial standards