After running its Apache Spark-based cloud platform in a closed user program for the past year, Databricks says the service is now publicly available for the first time.
The cloud-hosted environment, described by Databricks as being deployed by more than 150 firms, aims to simplify the use of the open-source cluster compute engine and cut the time spent developing, scheduling, and managing complex Spark workloads.
Databricks head of engineering Ali Ghodsi said the cloud service, formerly called Databricks Cloud, is designed to automate interactive exploration, collaboration and production.
"If you want to use Spark to solve a big-data problem, today it still remains very elusive - it's almost rocket science. You have to get a bunch of machines, install a cluster manager on them and then you have to tune Spark on that," Ghodsi said.
"Once you have that up and running, you still only have the basic execution engine. You might want to do some plotting; you might want to do interactive exploration. You can type in SQL queries and it will now crunch lots of your data on lots of your machines. But then how do I actually plot those results so that I can visualise?"
Even once those tuning and exploration issues are resolved, the resulting processes need to be put into production.
"You want to take the human out of the loop and say, 'OK, now I just want this thing to just run itself, to crunch over the latest data that came in last night, over and over," Ghodsi said.
"That's also not part of Spark or any of these engines - having a production job with a scheduler that automatically just runs through this stuff. Databricks gives you all these things."
Spark started in 2009 as a UC Berkeley AMPLab research project to create a clustering computing framework addressing target workloads poorly served by Hadoop. It went open source in 2010 and last year had more than 450 contributors. Its creators went on to found Databricks in 2013.
"From now on you'll be able to code and do your explorative analysis and write your jobs in Databricks in R - that's in addition to the languages that we already support, which are Python, SQL and Scala," Ghodsi said.
R in part owes its popularity to the statistical libraries that come with the language, according to Ghodsi, with the data-science community divided up into Python and R aficionados.
"People who come from a slightly more statistical mathematical background prefer R and maybe people with a little more computing background might prefer Python. But we don't want to force people to use either, so now they get access to both - and actually you can go between them," he said.
"Certainly if you want to do cloud things, R has a lot of built-in nice support for piping a display and immediately it plots for you many of the interesting statistical properties of the models that you're using."
The company is also working on security and governance features planned for the second half of the year. These features include access control and private notebooks, as well as version control to allow users to track changes to source code.
"A lot of the data scientists that use Databricks want to collaborate live on a notebook. They want to write comments to each other; they want to use each others' notebooks. So one of the things that then follows immediately is versioning," Ghodsi said.
"How can I make sure that if someone came and changed parts of my notebook, how can I see what were his changes, how can I audit that, how can I go to an earlier version, how can I see the results before and after?"
Databricks is also planning to offer support for full Spark streaming with fault-tolerant real-time processing.