​How Apache Spark firm Databricks is firing up cloud automation

The firm founded by the creators of the Spark in-memory data-processing framework has improved its hosted platform to cut time spent developing and managing complex workloads.
Written by Toby Wolpe, Contributor

Apache Spark company Databricks has updated its cloud platform with a feature designed to let firms manage production pipelines to run Spark workloads without human intervention.

The company, started in 2013 by the creators of Spark's various components, says the new Jobs feature supports the creation of production pipelines using Databricks Cloud notebooks as well as standalone applications that use the Spark in-memory data-processing framework.

Because of that ability to move from exploration to production workloads, Databricks reckons the Jobs feature will cut the time spent developing, scheduling, and managing complex Spark workloads.

Databricks head of engineering Ali Ghodsi said the company had been working on the Jobs feature for some time because of the difficulties of making interactive exploration, collaboration and production to work well together.

"You can take your notebook and say, 'OK, I want this notebook that I've just developed interactively now to run over any new data that comes in every two hours. I want you to launch a cluster for me of this particular size, get enough machines for this cluster, configure it for me, run this job or notebook every two hours and dump the results somewhere else, " he said.

Once the workload is running in production, users may receive email notifications flagging up issues.

"If you get an email, you can go back into the UI again at any given time and see the output of each of these runs of the job. You can click on it to see its output and the nice thing is that again you get this notebook back," Ghodsi said.

"If you're confused about the output of some job or something looks strange or you just want to dig deeper, you can use that notebook just as you could to do interactive exploration to debug: 'Why is this output here looking like this or what if I change the query a little bit here?'. It provides you with a very nice way of mixing this interactive mode with the production mode."

Spark began in 2009 as a UC Berkeley research project to create a clustering computing framework addressing target workloads poorly served by Hadoop. It went open source in 2010 and its September 1.1 release counted more than 170 contributors.

"Spark is an engine that is much faster than Hadoop. It has a very simple API that lets programmers use it, writing very few lines of code compared with Hadoop and finally - this is one of its main strengths - it unifies many different models, which you would otherwise have to use many different systems for," Ghodsi said.

"So if you want to do real time screening or SQL queries or machine-learning or just basic raw data-crunching you would, before Spark, use different systems. But Spark lets you do this very naturally in one framework."

Ghodsi said the creators of Spark created Databricks Cloud, which was unveiled last June, because getting started with any of these frameworks, even Spark, requires users to go through a lot of hoops.

"You have to set up clusters - that could take you six months. You have to configure them. You have to work with operations to get that up. Once you've installed Spark, Spark is just the engine. You still need a way to explore the data interactively. You need some kind of interactive operation tool where you can just sit there and write these things," he said.

Ghodsi said the fear of lock-in lies behind the relative failure of platform as a service compared with infrastructure as service, which has been immensely successful.

"If you give them some API and say, 'Use this API' and it's proprietary and not open source, they're going to say, 'This is not an option. Why would I put all my eggs in this basket?'. That's one of the key things in Databricks Cloud. Spark is open source. This is why we heavily invest in open-source Spark. On Databricks Cloud, there's no lock-in. It's not our private API or computation engine. You can take it and make it run on open-source Spark. You can take it and run it on prem," Ghodsi said

The Databricks Cloud Jobs feature was launched this week at the inaugural Spark Summit East in New York City.

More on big data

Editorial standards