Imitation being the sincerest form of flattery pretty well summarizes the challenges of running an open source software business. Over the past 4 - 5 years, Apache Spark has taken the big data analytics world by storm (for fans of streaming, no pun intended). As the company whose founders created and continue to lead the Apache Spark project, Databricks has differentiated itself as the company that can give you the most performant, up to date, Spark-based cloud platform service.
In the interim, Spark has continues to be the most active Apache open source project based on the size of the community (over a thousand contributors from 250 organizations) and the volume of contributions. Its claim to fame has been a simplified compute model (compared to MapReduce or other parallel computing frameworks), heavy leverage of in-memory computing, and availability of hundreds of third party packages and libraries.
Spark has become the de facto standard embedded compute engine for tools performing anything related to data transformation. IBM has given the project a bear hug as it rebooted its analytic suite with Spark.
But as a measure of its maturity, there is now real competition. Most of the competition was with libraries and packages, where R and Python programmers had their own preferences. There has also been competition for streaming where a mix of open source and proprietary alternatives supported true streaming, while Spark Streaming itself was based on microbatch (that's now changing). More recently, Spark is seeing renewed competition on the compute front, as emerging alternatives like Apache Beam (which powers Google Cloud Dataflow) are positioning themselves as the onramp to streaming and high-performance compute.
Ironically, while a large proportion of Spark workloads were run for data transformation, its original claim to fame centered on machine learning. The operable notion for Databricks was that you could get quick access to Spark and readily take advantage of MLlib libraries without having to set up a Hadoop cluster.
Since then, Amazon, Microsoft Azure, Google and others now offer cloud compute services specialized for machine learning -- with Amazon's SageMaker firing a shot across the bow for making machine learning accessible without requiring an advanced degree. At the other end of the spectrum, Spark's DLL libraries are still works in progress; for deep learning, TensorFlow and MxNet are currently stealing Spark's thunder -- although they can certainly be deployed to execute on Spark.
Databricks' strategy has morphed from "democratizing analytics" to delivering "the unified analytics platform." It offers a cloud Platform-as-a-Service (PaaS) offering targeted at data scientists that is informally is positioned as the go-to source for getting Spark jobs running quickly with the most current source of the technology.
But then again, you don't need Databricks to run Spark. You can run it on any Hadoop platform, and thanks to connectors, on virtually any analytic or operational data platform. And in the cloud, you can readily run it on Amazon EMR or any other cloud Hadoop service. And if you are heavily wedded to Python libraries, there's always the Anaconda Cloud.
Databricks promises simplicity. You can run Spark without the overhead of running a Hadoop cluster or worrying about configuring the right mix of Hadoop-related projects. You get a native Spark runtime and not worry about deployment of your models by working in a Databricks proprietary notebook where you can make your output executable without finding your models lost in translation once they were handed over to your data engineers. Well, you did have to worry about sizing your compute by specifying the number of "workers." With each of the major cloud providers offering serverless compute services (where you write code without worrying about compute), last summer, Databricks launched its own serverless option.
The company got a huge shot in the arm last summer with a fresh $140 million venture round that threatens to make the company another unicorn (its cumulative funding now exceeds $250 million). And it is now spreading its wings with several key product initiatives.
Databricks Delta adds the missing link of data persistence. Until now, the Databricks service drew data, primarily from cloud storage, and delivered results that could be visualized or post-processed through BI self-service tools. Ironically, as one of the most frequent Spark workloads is data transformation, Databricks did not directly provide a way to persist the data for future use, except through third-party data platforms downstream. Delta fills in the gap by adding the ability to persist the data as columnar Parquet files.
At first blush, Databricks Delta looks like its answer to cloud-based data warehousing services that persist data, use Spark, and directly query data from S3, like Amazon Redshift Spectrum. In actuality, Parquet is simply a file system that stores data in columnar format; it is not a database. So it is aimed at data scientists who tend to work with schema-on-read mode and want an option for persisting data. This way, they can work within the Databricks service without having to rely on Redshift or other data warehouses, in the cloud or on premise, for reusing the data they have just wrangled.
Dwarfing that announcement was the recent unveiling of Azure Databricks. Until now, Databricks ran as a managed service on AWS, but as a service provider with an arms-length relationship. For Azure, Databricks has gone fully native. Available through the Azure portal, Azure Databricks runs on Azure containers, has high-speed access to Azure Blob Storage and Azure Data Lake, can be run through the Azure console, and is integrated with PowerBI for query along with a variety of Azure databases (Azure SQL Database, Azure SQL Data Warehouse, and Cosmos DB) for downstream reuse of results.
As an Azure native service, Databricks could potentially be interwoven to other services, such as Azure Machine Learning, Azure IoT, Data Factory and others. That could significantly expand Databricks' addressable market. More to the point, with Microsoft Azure as OEM, Databricks gains a strategic partner that no longer makes it a David to everyone's Goliath.