​Databricks' Apache Spark cloud platform goes public

Unveiled last June, the Apache Spark cloud-hosted platform from Databricks has now opened its doors for business.
Written by Toby Wolpe, Contributor
Head of engineering Ali Ghodsi: Spark today still remains elusive.
Image: Databricks
After running its Apache Spark-based cloud platform in a closed user program for the past year, Databricks says the service is now publicly available for the first time.

The cloud-hosted environment, described by Databricks as being deployed by more than 150 firms, aims to simplify the use of the open-source cluster compute engine and cut the time spent developing, scheduling, and managing complex Spark workloads.

Databricks head of engineering Ali Ghodsi said the cloud service, formerly called Databricks Cloud, is designed to automate interactive exploration, collaboration and production.

"If you want to use Spark to solve a big-data problem, today it still remains very elusive - it's almost rocket science. You have to get a bunch of machines, install a cluster manager on them and then you have to tune Spark on that," Ghodsi said.

"Once you have that up and running, you still only have the basic execution engine. You might want to do some plotting; you might want to do interactive exploration. You can type in SQL queries and it will now crunch lots of your data on lots of your machines. But then how do I actually plot those results so that I can visualise?"

Even once those tuning and exploration issues are resolved, the resulting processes need to be put into production.

"You want to take the human out of the loop and say, 'OK, now I just want this thing to just run itself, to crunch over the latest data that came in last night, over and over," Ghodsi said.

"That's also not part of Spark or any of these engines - having a production job with a scheduler that automatically just runs through this stuff. Databricks gives you all these things."

Spark started in 2009 as a UC Berkeley AMPLab research project to create a clustering computing framework addressing target workloads poorly served by Hadoop. It went open source in 2010 and last year had more than 450 contributors. Its creators went on to found Databricks in 2013.

Earlier today, IBM announced that it is making Spark a key part of its cloud and commerce services and will be offering Spark as a service on its Bluemix cloud development platform.

Following Spark version 1.4, which became generally available late last week, Databricks also plans to offer access to the popular R statistical programming language, enabling R users to work directly on large datasets through the SparkR API.

"From now on you'll be able to code and do your explorative analysis and write your jobs in Databricks in R - that's in addition to the languages that we already support, which are Python, SQL and Scala," Ghodsi said.

R in part owes its popularity to the statistical libraries that come with the language, according to Ghodsi, with the data-science community divided up into Python and R aficionados.

"People who come from a slightly more statistical mathematical background prefer R and maybe people with a little more computing background might prefer Python. But we don't want to force people to use either, so now they get access to both - and actually you can go between them," he said.

"Certainly if you want to do cloud things, R has a lot of built-in nice support for piping a display and immediately it plots for you many of the interesting statistical properties of the models that you're using."

In March, Databricks introduced the new Jobs feature, which supports the creation of production pipelines using Databricks cloud notebooks as well as standalone applications that use Spark.

The company is also working on security and governance features planned for the second half of the year. These features include access control and private notebooks, as well as version control to allow users to track changes to source code.

"A lot of the data scientists that use Databricks want to collaborate live on a notebook. They want to write comments to each other; they want to use each others' notebooks. So one of the things that then follows immediately is versioning," Ghodsi said.

"How can I make sure that if someone came and changed parts of my notebook, how can I see what were his changes, how can I audit that, how can I go to an earlier version, how can I see the results before and after?"

Databricks is also planning to offer support for full Spark streaming with fault-tolerant real-time processing.

More on big data

Editorial standards