​How Databricks is beefing up its Apache Spark cloud platform

With access to datasets widening inside businesses, the Databricks big-data Apache Spark cloud offering is adding new features to cope.
Written by Toby Wolpe, Contributor
Head of engineering Ali Ghodsi: Initially, only a small team of data scientists were running queries against their data.
Image: Databricks
Almost two months after its Apache Spark-based cloud platform became publicly available, Databricks is today unveiling a set of features it says will help firms with large teams control access to data and make Spark app development easier.

As well as access control, Databricks 2.0 now offers use of the popular R statistical programming language, support for multiple versions of Spark, and notebook versioning.

Spark started in 2009 as a UC Berkeley AMPLab research project to create a clustering computing framework addressing target workloads poorly served by Hadoop. It went open source in 2010 and last year had more than 450 contributors. Its creators went on to found Databricks in 2013.

Databricks is a cloud-based big-data processing platform built on Spark, with standard libraries such as Spark SQL and MLlib, and a multi-user graphical interface.

The platform also offers interactive notebooks, which are designed to make developing and managing Spark apps simpler. The notebooks have interfaces that enable developers to write Spark jobs in Python, Scala or SQL and then schedule them. Databricks says the notebooks can be run repeatedly as automatically-executing production jobs.

"Initially, it was only a small team of data scientists who were running queries against their data. But soon it expanded and we had maybe 100 people using it in the same organisation. The requirements changed quite dramatically all of a sudden," Databricks head of engineering Ali Ghodsi said.

"They had marketing people, product managers and others start accessing their data. You get these different personas in an organisation who all now can ask questions from the dataset. This is really the background of almost all these new features."

According to Databricks, access control lists for notebooks allow detailed rights and privileges to be set on an individual basis for large teams with diverse roles and varied needs for access to code and data.

"When you have marketing coming in, you want to be careful. Maybe there are some secret access keys to Amazon or other things that you have in your notebooks - because your notebooks now are your source code, your notes, everything," Ghodsi said.

"You want to make sure you don't share them with everyone willy-nilly in the organisation. Not only that, for some of these organisations, it's a violation of compliance."

But as well as setting various layers of access control, Databricks has introduced a notebook-versioning feature, so developers can manage and track the codebase by integrating with popular version-control tools, such as Git.

"More and more people are collaborating on the same notebooks. You can go in there and maybe change a query yourself. As the company usage grows, and you have maybe 100 people with access to this file, one of the obvious things that becomes an issue is you might not want someone squatting over your notebooks and messing things up," Ghodsi said.

"You'd like to see what I've changed and maybe you want to retrieve the old version of it."

Access to versions is not confined to notebooks but also extends to Spark itself, with a new feature that lets developers experiment with the latest Spark advances yet maintain compatibility by deploying multiple versions in the Databricks platform.

"As a company grows, some of the more savvy data engineers might want to have access to much newer Spark features. Now they want to version-control their Spark clusters and that's a much harder problem," he said.

"The nice thing in the SaaS environment is we can actually do that. We can control the different clusters and pick which versions each cluster has. Then, when you want to migrate to new versions, we can automatically resize your previous clusters to become smaller and smaller.

"You can gradually migrate over and try new Spark versions. The crucial thing here is we should be able to dynamically adjust the size of these clusters."

With Spark version 1.4, generally available in June, offering support for the R language, Databricks has followed suit on its cloud platform with R users now able to work directly on large datasets through the SparkR API.

The ability for non data scientists to conduct explorative analysis and write jobs in Databricks in R is an important part of spreading access to data more widely in businesses, according to Ghodsi.

"The other people in the company, not the hard-core PhDs in big data, not the original guys because those original guys actually were fine with probing such low-level Spark stuff directly. They're savvy, they love this stuff, they've been using it since the early days of Spark, some of them from even before it was a huge success," he said.

"But now you have people in the organisation who want to ask queries. Some of them know SQL. But what we saw last year is that more and more people were asking about R."

Databricks said that since the platform's general availability about six weeks ago, it has attracted more than 1,700 signups, with a number of enterprise deployments at firms such as online vehicle sales and information site Edmunds.com and dieting business MyFitnessPal.

More on big data

Editorial standards