X
Innovation

Databricks' $1.3 billion buy of AI startup MosaicML is a battle for the database's future

MosaicML has been working on reinventing what a database is via artificial intelligence.
Written by Tiernan Ray, Senior Contributing Writer
naveen-and-hanlin-mosaicml-2022

Naveen Rao, [left] MosaicML co-founder and CEO, and Hanlin Tang, co-founder and CTO. The company's training technologies are being applied to "building experts," using large language models more efficiently to handle corporate data. 

MosaicML

On Monday, Databricks, a ten-year-old software maker based in San Francisco, announced it would acquire MosaicML, a three-year-old San Francisco-based startup focused on taking AI beyond the lab, for $1.3 billion.

The deal is a sign not only of the fervor for assets in the white-hot generative artificial intelligence market, but also a sign of the changing nature of the modern cloud database market. 

Also: What is ChatGPT and why does it matter? Here's what you need to know

MosaicML, staffed with semiconductor veterans, has built a program called Composer that makes it easy and affordable to take any standard version of AI programs such as OpenAI's GPT and dramatically speed up the development of that program, the beginning phase known as the training of a neural network. 

The company this year introduced cloud-based commercial services where businesses can for a fee both train a neural network and perform inference, the rendering of predictions in response to user queries. 

However, the more profound element of MosaicML's approach implies that whole areas of working with data --  such as the traditional relational database --  could be completely reinvented. 

"Neural network models can actually be thought of almost as a database of sorts, especially when we're talking about generative models," Naveen Rao, co-founder and CEO of MosaicML, told ZDNET in an interview prior to the deal. 

"At a very high level, what a database is, is a set of endpoints that are typically very structured, so typically rows and columns of some sort of data, and then, based upon that data, there is a schema on which you organize it," explained Rao.

Unlike a traditional relational database, such as Oracle, or a document database, such as MongdoDB, said Rao, where the schema is preordained, with a large language model, "the schema is discovered from [the data], it produces a latent representation based upon the data, it's flexible." And the query is also flexible, unlike fixed lookups into a database such as SQL, which dominates traditional databases.

Also: Serving Generative AI just got a lot easier with OctoML's OctoAI

"So, basically," added Rao, "You took the database, loosened up the constraints on its inputs, schema, and its outputs, but it is a database." In the form of a large language model, such a database, moreover, can handle large blobs of data that have eluded traditional structured data stores.

"I can ingest a whole bunch of books by an author, and I can query ideas and relationships within those books, which is something you can't do with just text," said Rao. 

Using clever prompting in an LLM, the prompt context gives flexible ways to query the database. "When you prompt it the right way, you'll get it to produce something because of the context created by the prompt," explained Rao. "And, so, you can query aspects of the original data from that, which is a pretty big concept that can apply to many things, and I think that's actually why these technologies are very important."

The MosaicML work is part of a broad movement to make so-called generative AI programs like ChatGPT more relevant for practical business purposes. 

Also: Why open source is essential to allaying AI fears, according to Stability.ai founder

For example, Snorkel, a three-year-old AI startup based in San Francisco, offers tools that let companies write functions which automatically create labeled training data for so-called foundation models -- the largest neural nets that exist, such as OpenAI's GPT-4.

And another startup, OctoML, last week unveiled a service to smooth the work of serving up inference.

The acquisition by Databricks brings MosaicML into a vibrant non-relational database market that has for several years been shifting the paradigm of a data store beyond row and column. 

That includes the data lake of Hadoop, techniques to operate on it, and the map and reduce paradigm of Apache Spark, of which Databricks is the leading proponent. The market also includes streaming data technologies, where the store of data can in some sense be in the flow of data itself, known as "data in motion," such as the Apache Kafka software promoted by Confluent.

Also: The best AI chatbots: ChatGPT and other noteworthy alternatives

MosaicML, which raised $64 million prior to the deal, appealed to businesses with language models that would be not so much the generalists of the ChatGPT form but more focused on domain-specific business use cases, what Rao called "building experts." 

The prevailing trend in artificial intelligence, including generative AI, has been to build programs that are more and more general, capable of handling tasks in all sorts of domains, from playing video games to engaging in chat to writing poems, captioning pictures, writing code, and even controlling a robotic arm stacking blocks.

The fervor over ChatGPT demonstrates how compelling such a broad program can be when it can be wielded to handle any number of requests. 

Also: AI startup Snorkel preps a new kind of expert for enterprise AI

And yet, the use of AI in the wild, by individuals and institutions, is likely to be dominated by approaches far more focused because they can be far more efficient. 

"I can build a smaller model for a particular domain that greatly outperforms a larger model," Rao told ZDNET.

MosaicML had made a name for itself with performance achievements by demonstrating its prowess in the MLPerf benchmark tests that show how fast a neural network can be trained. Among the secrets to speeding up AI is the observation that smaller neural networks, built with greater focus, can be more efficient. 

That idea was explored extensively in a 2019 paper by MIT scientists Jonathan Frankle and Michael Carbin that won a best paper award that year at the International Conference on Learning Representations. The paper introduced the "lottery ticket hypothesis," the notion that every big neural net contains "sub-networks" that can be just as accurate as the total network, but with less compute effort. 

Also: Six skills you need to become an AI prompt engineer

Frankle and Carbin have been advisors to MosaicML. 

MosaicML also draws explicitly on techniques explored by Google's DeepMind unit that show there is an optimal balance between the amount of training data and the size of a neural network. By boosting the amount of training data by as much as double, it's possible to make a smaller network much more accurate than a bigger one of the same type. 

All of those efficiencies are encapsulated by Rao in what he calls a kind of Moore's Law of the speed-up of networks. Moore's Law is the semiconductor rule of thumb which posited, roughly, that the amount of transistors in a chip would double every 18 months, for the same price. This is the economic miracle that made possible the PC revolution, followed by the smartphone revolution. 

Also: Google, Nvidia split top marks in MLPerf AI training benchmark

In Rao's version, neural nets can become four times faster with every generation, just by applying the tricks of smart compute with the MosaicML Composer tool.

Several surprising insights come from such an approach. One, contrary to the oft-repeated phrase that machine learning forms of AI require massive amounts of data, it may be that smaller data sets can work well if applied in the optimal balance of data and model à la DeepMind's work. In other words, really big data may not be better data.

Unlike gigantic generic neural nets such as GPT-3, which is trained on everything on the Internet, smaller networks can be the repository of the unique knowledge of a company about its domain.

"Our infrastructure almost becomes the back-end for building these types of networks on people's data," explained Rao. "And there's a whole reason why people need to build their own models."

Also: Who owns the code? If ChatGPT's AI helps write your app, does it still belong to you?

"If you're Bank of America, or if you're the intelligence community, you can't use GPT-3 because it's trained on Reddit, it's trained a bunch of stuff that might even have personally identifiable information, and it might have stuff that hasn't been explicitly permitted to be used," said Rao. 

For that reason, MosaicML has been part of the push to make open-source models of large language models available, so that customers know what kind of program is acting on their data. It's a view shared by other leaders in generative AI, such as Stability.ai founder and CEO Emad Mostaque, who in May told ZDNET, "There is no way you can use black-box models" for the world's most valuable data, including corporate data.

MosaicML last Thursday released as open source their latest version of a language model -- one containing 30 billion parameters, or neural weights -- called MPT-30B. The company claims MPT-30B surpasses the quality of OpenAI's GPT-3. Since the company's introduction of open-source language models in early May, it has had over two million downloads of the language models, it said. 

Although automatically discovering schema may prove fruitful for database innovation, it's important to bear in mind that large language models still have issues such as hallucinations, where the program will produce incorrect answers while insisting they are real. 

Also: ChatGPT vs. Bing Chat: Which AI chatbot is better for you?

"People don't actually understand, when you ask something of ChatGPT, it's not correct many times, and sometimes it sounds so correct, like a really good bullsh*t artist," said Rao.

"Databases have an expectation of absolute correctness, of predictability," based on "a lot of things that have been engineered over the last 30, 40 years in the database space that need to be true, or at least mostly true, for some kind of new way of doing it," observed Rao.

"People look at it [large language models] like it can solve all the problems they've had," said Rao of enterprise interest. "Let's figure out the nuts and bolts of actually getting there."

Editorial standards