If open source is the new normal in enterprise software, then that certainly holds for databases, too. In that line of thinking, Github is where it all happens. So to have been favorited 10.000 times on Github must say something about a project. Open source ArangoDB, which also offers an Enterprise version, has hit that milestone recently.
On Aug. 27, ArangoDB announces its new release 3.7, which comes with interesting new features around graph. We take the opportunity to discuss the database market, graph, and beyond, with CEO and co-founder Claudius Weinberger and Head of Engineering and Machine Learning Jörg Schad.
Cloud and machine learning ready
ArangoDB was founded in Cologne in 2014 by OnVista veterans Claudius Weinberger and Frank Celler. The team made the headlines in 2019 with their $10 million in Series A funding led by Bow Capital. As Weinberger noted, he and his co-founder have been working together for 20 years, and the decision to pursue their vision was not a spur of the moment idea:
"The main idea for ArangoDB, what is still valid today, is what we call the native multi-model approach. That means that we found a way that we can combine the JSON document data model, the graph model, and the key-value model in one database core with one query language."
Today ArangoDB is a US company with a German subsidiary, it has a new chief revenue officer, Matt Ekstrom, and a new head of engineering, Schad. Schad joined ArangoDB last year but has been working with ArangoDB for the past four years. With a PhD in database systems, distributed data analytics, and large scale infrastructure container systems, Schad has been switching between databases.
Two key factors made him join the ArangoDB team: Distribution in a cloud setting and machine learning (ML). ArangoDB has been an early adopter of both Apache Mesos / DC/OS and Kubernetes. Eventually, Kubernetes prevailed, and ArangoDB 3.7 comes with the general availability of its Kubernetes operator, which has been developed over the last three years.
ArangoDB's Kubernetes operator is also the foundation for its managed service Oasis, available in AWS, Azure, and GCP. The new release includes a number of improvements for faster replacement and movement of servers, improved monitoring and cluster health analysis, an advanced inspection of pod failure causes, and overall reduced resource usage. Cluster scalability improvements for on-premise deployment apply too.
ArangoDB has been promoting ArangoML: Using ArangoDB as the infrastructure for teams using ML. The idea is that beyond training data, which is a prerequisite for training ML models, metadata is also important, and using ArangoDB is a good match for that. We have long argued for the importance of metadata. But why ArangoDB, and not any other data management system?
Schad referred to his experience building machine learning pipelines for finance and healthcare use cases. One of the biggest challenges he saw there were audit trails for CCPA or GDPR, making it necessary to have a full view of the entire pipeline. They had to figure out what happens if patients withdraw consent to use their data, for example.
Just being able to identify the different ML models deployed in production was very challenging because they had to go through a number of different metadata stores -- for the ML part, the data feature transformation part, and so on. So they wanted to have a common layer with all the metadata where this would end up being one query.
Relational systems are not a good match, Schad said. Machine learning features may be derived from other features, which means ending up with a lot of joins, and especially a lot of self joins. Apart from being ugly to write and maintain, those queries don't perform well either. So this started to look like a case for a graph database -- these are the types of queries graph databases excel at.
From graph to multi-model and back again
But still: why ArangoDB? ArangoDB is not a traditional graph database -- it is a multi-model database which also supports graph. The advantage according to Schad is that this enables users to combine the flexibility of having no schema, leveraging the JSON document view of multi-model, with the structure of how things are connected as a graph:
"In the end, looking at which models have been impacted by which is being derived from just one data set, it's just a graph traversal. So it turned out to be a really easy model, to be both flexible and very efficient in terms of formulating this query and many others as well."
Not having a schema, however, is not always a plus. ArangoDB 3.7 introduces JSON schema support, giving users the option to validate all new data written to the database, as well as analyze existing data validity. To us, this looks overdue. JSON schema may not be the most powerful schema mechanism around, but for a database emphasizing JSON, it's a natural choice.
Although ArangoDB has its own sui generis approach, we noticed that in the last year or so its messaging has shifted a bit from the multi-model aspect to emphasize graph. Its people confirmed that, mentioning they're seeing a lot of demand for graph. Many users are coming with a graph use case and expand upon multi-model use cases later on.
The ArangoDB team believes, however, more data models are needed to support efficient and successful graph use cases. Graph and beyond, where graph is a central use case. Up until recently, the hype was all around graph, too. But those who have been into graph before it was cool knew that hypes come and go, and were expecting the hype to subside at some point.
The first sign came last week, with Gartner's hype cycle for emerging technology in 2020 moving "graphs and ontologies" to the trough of disillusionment. Apart from the fact that conflating graphs and ontologies does not make much sense to us, we see this as a normal phase in the evolution of new, or in this case, not so new but still hyped, technology.
Schad noted that while graph use cases are on the rise, there's still a lot of trial and error. Although use cases become more mature, some disillusionment in terms of scalability limits does exist. For Weinberger, it's a good sign that the overall graph story is moving on, but expecting to do everything faster than other databases should not be the main reason people look at graphs.
Graph and beyond
ArangoDB 3.7 comes with a number of improvements around graph capabilities. Disjoint SmartGraphs shard large, hierarchical graphs to a cluster and precisely shard each branch of the graph for local query execution. SmartGraphs applies a smart sharding mechanism, where depending on how data is set up, ArangoDB tries to shard it in a way that the number of hops is minimal between nodes.
With Disjoint SmartGraphs, if the resulting sub-graphs are sub-partitioned so they are disjoint, a number of optimizations on the query optimizer can then push down a lot more computation down to the servers. SatelliteGraphs goes in a similar direction: Replicating graphs to each cluster node for local query execution of multi-model queries, using an automatic approach to replicate metadata across the different nodes.
Parallel traversals are slightly different. What this feature does is that it enables starting a number of graph traversals in parallel, for cases where identifying certain patterns across a large graph is needed. Schad said currently this requires user direction, while in the future automatic parallelization will be introduced.
It's clear that the focus of these features, as well as the overall approach for ArangoDB, is on graph queries and analytics. This is even more evident, considering some form of schema has just been introduced now. In a recent article, ArangoDB expressed the position that a multi-model approach may be beneficial for knowledge graphs.
While the main argument, i.e. that having multi-model capabilities helps with data transformation, is true, it's hard for us to conceive how it is possible to talk about knowledge graphs without a schema. Furthermore, we don't see the layered cake introduced, implying that ArangoDB can be a substrate for knowledge graphs, supported by at least some interoperability layer with graph standards at this point.
When discussing this with ArangoDB's team, they mentioned that AQL, ArangoDB's query language, is an integral part of its multi-model capabilities. While SPARQL does not work for ArangoDB, which makes sense considering ArangoDB's model supports property graphs, ArangoDB participates in the GQL query language standardization effort for property graphs.
Understandably, this may take a while. Equally understandably, ArangoDB's team expressed the conviction that AQL will still be the preferred way to access data in ArangoDB. They also said that being clear about not being SQL-compatible comes with the territory. What is not understandable to us, however, is the lack of support for interoperability on the graph data import/export level.
Support for RDF import/export, for example, which other graph databases offer, would be an obvious benefit. ArangoDB's team noted there is community work going on in that area, but it's not yet open-sourced or included in ArangoDB's distribution. In terms of graph capabilities, we see ArangoDB as a typical product in the property graph category: more suitable for analytics, less so for data integration/knowledge management.
Overall, ArangoDB's multi-model capabilities and distributed-first approach make it an interesting offering for a number of use cases. If you are willing to dive into its sui generis approach and have use cases that match it, it's certainly worth considering.