Pinecone, a serverless vector database for machine learning, leaves stealth with $10M funding
Machine learning applications understand the world through vectors. Pinecone, a specialized cloud database for vectors, has secured significant investment from the people who brought Snowflake to the world. Could this be the next big thing?
Built by the team behind Amazon SageMaker. Having attracted investment by Wing Venture Capital, with Wing's Founding Partner and early Snowflake investor, Peter Wagner, joining startup Pinecone's board and comparing Pinecone's potential impact to Snowflake. Supporting critical workloads at one of the world's largest retailers today.
That's quite a pedigree for a previously unknown company. Pinecone, a machine learning cloud infrastructure company, left stealth today with $10m in seed funding led by Wing Venture Capital. So what makes Pinecone special, rather than yet another database? ZDNet caught up with Pinecone CEO and founder, scientist and former AWS Director Edo Liberty to find out.
Machine learning infrastructure
Liberty was trained as a scientist and spent most of his life as an academic, developing and writing papers and research on machine learning and systems. He spent seven years at Yahoo's machine learning group and about three years in AWS, building SageMaker, AWS's machine learning platform.
Liberty founded Pinecone in May 2019, to address what he believes is one of the most crucial components in being able to deploy large scale machine learning solutions: Vectors. For machine learning practitioners, that already says a lot. For the rest of the world, Liberty elaborated on what vectors are and why they are important:
"We are used to data being recorded in a database -- like keys and values, or images, audio, text documents. But when you use machine learning models, they don't look at the world this way. The input that they expect is a very long list of numbers. And that is called a vector. It's just a list of numbers. For a human, that's completely opaque and meaningless.
But for a machine learning model, that's exactly the inputs and outputs to be expected; that's what they consume and create. If you are building and deploying machine learning at scale, you will have millions, tens of millions and hundreds of millions of these high dimensional vectors, those very long list of numbers, which you have to manipulate in real time."
This is the problem Pinecone is addressing: storing and manipulating vectors at scale, in the cloud. As Liberty noted, organizations using machine learning are already grappling with that problem. So how do people deal with this currently, and what does Pinecone bring to the table?
One way people do it is by trying to "bend the pipes" as Liberty put it: use existing infrastructure, such as open-source frameworks, to make it do something that it was not designed to do -- store and retrieve vectors. That ends up being both a lot of work and not very efficient, Liberty claims, and this is why organizations end up understanding that this is too much work and they don't want to do it in-house.
What happens then, Liberty went on to add, is they just buy a black box solution for the application that they want, such as a recommendation engine on a shopping website. But at the same time, there is an imperative for organizations to move towards being data-driven, doing more data science and machine learning, and owning their data.
Under the hood
Pinecone wants to help resolve this conundrum, by making it easier for organizations to own their machine learning without having to build all the infrastructure. To that end, Pinecone built three different components that interact in a Customize - Load - Query - Observe lifecycle.
At the core is the vector index, a highly specialized piece of software that indexes high dimensional vectors efficiently and can interact with them fast and accurately. Then there is a container distribution platform that allows Pinecone to scale horizontally and withstand any workload, and a cloud management system that allows it to offer a simple API without having to worry about resources.
It sounds simple enough, but certain details are worth highlighting. To begin with, not all vectors are the same. There are many ways to represent real-world entities such as documents in vectors, and many machine learning frameworks out there, each one with its own way of doing this transformation.
Pinecone deals with this by enabling users to plugin their transformation model, be it something they trained or something generic. Pinecone orchestrates that in real time and makes sure that when for example a document is sent, it is converted to a vector and indexed or retrieved consistently.
Speaking of retrieval, there's a point to be made here. Pinecone does have its own query language, and it supports the type of CRUD operations people have come to expect from databases. But the way it does is not by using some SQL-clone, as you might expect from other types of databases. So how do you express the notion of a query, as in say, getting documents created after a certain date that contain a certain keyword?
As Liberty noted, when you deal with high-dimensional vectors, you don't have documents, or timestamps, or terms, or SQL:
"You don't have the regular constructs of a database, so you have to communicate your needs in a different way. When you look at two numbers, you can think about them as X and Y coordinates on a sheet of paper, or a dot corresponding to some location.
If you look at a thousand dimensional vector -- that's a list of a thousand numbers -- you can think about it as a dot in a thousand dimensional space. It might be hard to imagine, but mathematically, it's exactly the same thing. So you want to somehow try to retrieve that data point."
The way this works is, for example, by getting all the data points around a point of interest. For this operation, centering around a point with some specific radius can be used. Generalizing, Pinecone supports working with geometric constructs, such as getting everything inside a cone, or behind some hypercube, or using cosines, and so on.
Scaling-up in the cloud
As Liberty pointed out, those operations may sound mathematical and abstract, but it's the bread and butter of machine learning practitioners. Another point to note is what happens when data and models change, which is also a fact of life for machine learning.
The evolution of data is something Pinecone supports, as data is constantly incrementally updated and deleted. This was one of the hardest things to achieve, as Liberty noted. The vector index is updated and vectors are searchable within microseconds, as the claim is that hundreds of thousands of vectors a second are updated.
When models are retrained, the approach is a bit different. If a model that converts documents to vectors is retrained, the corpus or documents may not have changed, but the vector representation has. So there is a new index of vectors to work with. What Pinecone does is allow users to have both the old and the new index running in parallel to run A/B tests.
There is also the scenario, which Liberty referred to as a rare setting, in which the model is actually live: Incrementally training on live data and constantly deployed, where the freshest data and the freshest model are always used. This is something that poses an interesting research challenge, which Liberty said they will be tackling in the future.
What is definitely an everyday challenge, however, is dealing with customer requests for deployment options. Pinecone only runs in the cloud, and Liberty cited being fully elastic, auto-scaling, and fully managed as primary drivers for Pinecone's cloud-only approach. He went on to add that cost-cutting on users' behalf is only possible in the cloud:
"When we control everything, we can actually spin down resources, we can improve our operations and we can monitor and fix stuff. If we run on premise, we just can't offer that kind of service. Businesses want to build a better recommendation engine for their shopping site or a text search engine for their documents. They're not in the business of maintaining distributed systems and infrastructure in the cloud, they just want that service."
Pinecone will be using the funding to grow its team in all three locations it's based in -- Israel, New York, and San Francisco. Liberty mentioned that Pinecone is very lean on its go-to market strategy, as the platform enables users to self-onboard, so Pinecone will be doubling down on its research and engineering efforts.
"Pinecones contain the seeds to grow entire evergreen forests, protected by a beautiful geometric object that anyone can hold and appreciate. We thought it was the perfect name for a company that opens up a world of uses of AI/ML for businesses, whose products pack all the complex parts inside accessible and beautiful packages", said Liberty.