Will Databricks' support for R Studio open the door?

While Databricks' built its success on open source, until now, the primary path to its analytics platform has been through its own executable notebook. Will the new integration with RStudio start cracking the door open to other tools favored by data scientists?


In the familiar role of the company whose founders start an open source goliath, providers like Databricks risk becoming victims of their own success. In this case, the founders are the ones who created the Spark project; their product or service has it, and so do many frenemies.

Databricks, the company positions itself as the cloud-based analytics platform that "unifies data science and engineering." It boasts a growing partner ecosystem encompassing almost all the usual suspects among cloud platforms; roughly a dozen software partners spanning data preparation, databases, data science, and visualization tools; plus a range of consulting and training providers.

While Spark is written in Scala, Databricks has reached out to the R and Python communities who otherwise perceive an impedance mismatch getting their programs to efficiently execute in Spark. Databricks accommodated R developers with SparkR, and later with SparklyR support, but you still had to go through the Databricks notebook to execute. As for SparkR, there are some impedance mismatches due to Spark and SparkR having some overlapping functions. Likewise, with Python, Databricks supports Python 2 and Python 3 clusters, but runs Python packages in a virtual environment.

But until now, there has been one primary path into the execution side of its platform: Databricks' own notebook. It differentiates from open source notebooks like Jupyter and Zeppelin with a native optimization that makes the Databricks notebook generate executables. As open source company, Databricks' exclusive support of its own proprietary notebook at first blush looks like an outlier when you consider the demographics of its target market: data scientists, who are a crowd of -- choose your term -- individualists or iconoclasts.

Data scientists like their own tools, and getting to them to mass adopt a single end-all tool is akin to herding cats.

Data scientists have voted with their feet for open source over proprietary. Open source makes data scientists more portable in their skills. That's been key to Spark's remarkable emergence. Even SAS, the incumbent that has never been known for open source, has cleared paths to its Viya platform for open source tools. The message that SAS is sending is, develop with your weapon of choice, but leave that "stuff" under the hood -- execution and governance -- to us.

If SAS can open itself to non-SAS (and open source) tools, why not Databricks?

With the new partnership with RStudio, Databricks is acknowledging that data scientists want to work in their own environments. It follows on earlier work to support SparklyR in Databricks notebooks. It allows RStudio users to work inside their tool and have it point to the Databricks runtime without having to feed their code to a Databricks notebook.

It comes via an integration between RStudio Server and the Databricks runtime. Databricks built some new optimizations to support RStudio (in place of the notebook) as the gateway to the runtime. The first step is firing up the Databricks cloud, then setting up clusters with your existing RStudio license (this is not an OEM deal), and then pointing RStudio Server to the Databricks runtime.

For Databricks, we wouldn't be surprised if this is the first step toward opening more paths. For a company monetizing its open source technology as an analytics platform, why quibble over the development tool?