As Databricks' annual conference in North America, Data + AI Summit, continues, so do the announcements from the company about new capabilities on its platform. Yesterday was focused on conventional analytics. Today's all about AI, and for multiple audiences. For the developer crowd as well as sophisticated business users, Databricks is introducing an AutoML (automated machine learning) engine; for data scientists, the company is adding a feature store.
Also read: Databricks rolls out data sharing, automated pipelines, data catalog
In general, AutoML platforms allow users to bring their own data set, and build a model from it, by indicating which column contains the target variable and what broad problem to solve (for example, classification or regression). From there, the AutoML platform can sweep through a range of algorithms, and hyperparameter values for each, looking for the best model, based on a selected metrics of accuracy and efficiency.
Unlike AutoML, which is often used by non-specialists, a feature store is designed for data scientists. The premise of a feature store is based on two important facts: (1) that a single ML model may derive its training data from multiple sources, each of which may be updated on a different cadence and (2) that some such source data may be used by more than one model. Based on this many-to-many relationship -- often thought to be one-to-one -- it turns out that looking at models as the smallest unit of granularity in ML operations is often incorrect. Instead, it's the source data and the group of ML model features (input variables) that data feeds into that should be managed together, in terms of ingest, feature engineering and then perhaps propagated retraining of impacted models.
Databricks' AutoML platform, which is both UI- and API-driven, goes a step further than many on the market, in that it avoids the "black box" scenario of simply taking data in, and pushing a model out. While you can use it that way, Databricks employs what it calls a "glass box" approach, where you can see the actual code used to produce the various models, and decide on the "winning" output model, just as if the work were hand-coded by a data scientist.
Databricks AutoML will put that code in a standard, editable notebook and the code will leverage the ML experimentation capabilities of MLflow, already part of the Databricks platform. This is an excellent approach that supports regulatory compliance and transparency. It also provides a good "grow up" story, where data scientists can take the AutoML code, use it as a baseline, and then develop it further. Essentially then, Databricks AutoML isn't just a tool for non-specialists, but also a utility that can support data scientists by eliminating a lot of their time-consuming grunt work.
Databricks' feature store is materialized in Delta Lake files and accessible via Delta Lake APIs. And, like the AutoML engine, the feature store is integrated with MLflow. It also integrates Shapley values for model and inference (prediction) explainability.
Both Databricks AutoML and Databricks Feature Store are part of Databricks' strategy to build out a completely self-contained data platform with a full range of lake/lakehouse, data prep, data management, data governance, BI and AI capabilities. As many in the industry presume the company is headed for an initial public offering, it certainly makes sense that it would be looking to get all its ML and data ducks in a row.