Bob van Luijt's career in technology started at age 15, building websites to help people sell toothbrushes online. Not many 15 year-olds do that. Apparently, this gave van Luijt enough of a head start to arrive at the confluence of technology trends today.
Van Luijt went on to study arts but ended up working full time in technology anyway. In 2015, when Google introduced its RankBrain algorithm, the quality of search results jumped up. It was a watershed moment, as it introduced machine learning in search. A few people noticed, including van Luijt, who saw a business opportunity and decided to bring this to the masses.
ZDNet connected with van Luijt to find out more.
Weaviate, a B2B search engine modeled after Google
For van Luijt, this was an "Aha" moment. Like everyone else working in technology, he had to deal with lots of unstructured data. In his words, relating data is a problem. Data integration is hard to do, even for structured data. When you have unstructured data from different sources, it becomes extremely challenging.
Van Luijt read up on RankBrain and figured it uses word vectorization to infer relations in the queries and then try to present results. Vectors are how machine learning models understand the world. Where people see images, for example, machine learning models see image representations, in the form of vectors.
A vector is a very long list of numbers, which can be thought of as coordinates in a geometrical space. Three-dimensional vectors -- i.e. vectors of the form (X, Y, Z) -- correspond to a space humans are familiar with. But multi-dimensional vectors also exist, and this complicates things:
"There are many dimensions, but to paint a mental picture, you can say there's just three dimensions. The problem now is, it's great that you can use a vector to recognize a pattern in a photo and then say, yes, it's a cat, or no, it's not a cat. But then, what if you want to do that for one hundred thousand photos or for a million photos? Then you need a different solution, you need to have a way to look into the space and find similar things."
This is what Google did with RankBrain for text. Van Luijt was intrigued. He started experimenting with Natural Language Processing (NLP) models. He even got to ask Google's people directly: Were they going to build a B2B search engine solution? Since their reply was "no," he set out to do that with Weaviate.
Searching the document space with vectors
NLP machine learning models output vectors: They place individual words in a vector space. The idea behind Weaviate was: What if we take a document -- an email, a product, a post, whatever -- look at all the individual words that describe it and calculate a vector for those words.
This will be where the document sits in the vector space. And then, if you ask, for example: What publications are most related to fashion? The search engine should look into the vector space, and find publications like Vogue, as being close to "fashion" in this space.
It's not that it isn't possible to store vectors in traditional databases. It is, and people do that. But after a certain point, it becomes impractical. Besides performance, complexity is also a barrier. For example, van Luijt mentioned, in most cases, people are not privy to the details of how vectorization happens.
Weaviate comes with a number of built-in vectorizers. Some are general-purpose, some are tailored to specific domains such as cybersecurity or healthcare. A modular structure enables people to plugin their own vectorizers, too.
Weaviate also works with popular machine learning frameworks such as PyTorch or TensorFlow. However, there is a catch: At this time, if you train your model, or use one provided by Weaviate, you're stuck with it.
If a model changes in a way that influences the way it generates vectors, Weaviate would have to re-index its data to work. This is not currently supported. Van Luijt mentioned it was not required in their current use cases, but they are looking into ways of supporting that.
As a startup, SeMI Technologies, the company van Luijt founded around Weaviate, is navigating the market for traction. Currently, the retail and FMCG industry is working well for them, with Metro AG being a prominent use case.
The challenge that Metro had was how to find new opportunities in the market. Weaviate helped them do that by combining data from their CRM and Open Street Maps. If a location where a business exists could not be associated with a customer in the CRM, that indicated an opportunity.
GraphQL makes for good API UX
Across industries, van Luijt noted, the problem is always the same at the root level: unstructured data needs to be related to something internally structured. Graphs are well-known for helping leverage connections. But it turns out that even the inability to find connections can generate business value, as the Metro use case exemplifies.
Van Luijt is a firm believer in the value of graphs for leveraging connections -- or lack thereof. Stacking up data in data warehouses and data lakes and lakehouses and whatnot does have value. But, to get value from connections in the data, it's the graph model that makes the most sense, he noted.
Then, the question becomes: How are we going to get people access to this? To give people a lot of capabilities so they can do "a tremendous amount of stuff," a graph query language like SPARQL may make sense, van Luijt said.
But if you want to make it simple for people to access graphs so they have a very short learning curve, GraphQL becomes interesting, he went on to add: "Most developers who are unfamiliar with graph technology, if they see SPARQL, they start sweating and they get nervous. If they see GraphQL, they go like, 'Hey, I understand this. This makes sense.'"
Weaviate also supports the notion of schemas. When an instance starts running, the API endpoint becomes available, and the first thing users need to do is to create a class property schema. It can be as simple or as complex as it needs to, and existing schemas can also be imported.
A pragmatic approach
Van Luijt has very pragmatic views when it comes to the limitations of vectors, as well as to the use of open source. To quote Gary Marcus and Ray Mooney before him, "You can't cram the meaning of a whole $&!#* sentence into a single $!#&* vector".
That much is true, but does it matter if you can get practical results out of using vectors? Not much, argues van Luijt. The problem Weaviate is trying to solve is finding things. So, if the similarity search does a good job in finding things using vectors, that's good enough. The idea, he went on to add, is to turn vectorization-based search from a data science problem into an engineering problem.
The same pragmatic approach is taken when it comes to open source. There are many reasons why people choose to go with open source. For Weaviate, open source, or rather open core, was chosen as a mechanism for transparency towards customers and users.
Perhaps surprisingly, van Luijt noted Weaviate is not necessarily looking for contributors. That would be nice to have, but the main purpose being open source serves is enabling audits. When clients ask their experts to audit Weaviate, being open source enables this.
Weaviate is available both as Software-as-a-Service and on-premises. Counter to conventional wisdom, it seems most Weaviate users are interested in on-premise deployments.
In practice, however, this oftentimes means their own project in one of the major cloud providers, with services from the Weaviate team. As the team and the product scale-up, a shift toward the self-service model may be called for.
Disclosure: SeMI Technologies has worked with the author as a client.