A lot has happened in graph land in the last six months. Quick recap: a new player (TigerGraph), Microsoft ramping up its graph play with graph support in SQL Server and CosmosDB, and the number two graph database, OrientDB, getting acquired.
The number one graph database, Neo4j, is kicking off its Graph Connect event today and announcing a new version, 3.3. This version brings extended support for querying in Spark, ETL, analytics, and improved performance. We discuss the developments and what they mean for this space with Neo4j's CEO, Emil Eifrem.
Spark, meet graph, again
One of the pain points in the graph space at the moment is querying. Not that you can't query graphs, but there may actually be too much of a choice there. It's almost as if every platform has its own query language.
Since there is no such thing as a SQL for graphs yet, this leaves an empty space that invites competition for the one graph query language that will become dominant. Some of the most widely languages/APIs used for graph querying are SPARQL, Gremlin, and Cypher.
Cypher is Neo4j's query language, which Neo4j has opened up as openCypher. Cypher is seeing adoption beyond Neo4j from solutions such as SAP and Redis. Today Neo4j is announcing adding support for using Cypher on Spark. Does that mean that you can now use Spark to do everything you can do with a graph database?
Not exactly. First of all, Neo4j integration with Spark was already there. Neo4j has been able to import and export data from and to Spark for a while now via a Spark connector. Of course, that does not really help much if your data is updated regularly, as you would need to update them from Spark on a regular basis to be able to query based on the latest data.
And then, if you really want to use graph data on Spark, there is always GraphX and GraphFrames. The same GraphX that is not getting an awful lot of traction, and the same GraphFrames that Cypher delegates to under the hood. Surprised? Don't be, because it really could not be much different at this point.
Neo4j makes a point of using a native graph stack, including the storage layer. Spark's storage layer on the other hand is anything but native graph -- HDFS, HBase, Cassandra, none of these options is graph-based. So what is the point of bringing Cypher to Spark?
Cypher on Spark is in practice a wrapper around GraphFrames, and works similarly to how Spark SQL works: data are processed in-memory, Cypher is used an interface to the outside world. Eifrem admits that this incurs a performance penalty, but says this is a first approach that will evolve and argues that Cypher offers a more convenient way to work with graph-shaped data on Spark:
"The GraphFrames API looks a bit like what Neo4j's API looked like in 2010. While i can't bash that API too much since i co-designed it, we've moved on and learned a few things about how developers like to use graph APIs since then. By bringing Cypher to Spark, we help Spark fast-forward about a decade."
We did mention Cypher, performance and memory, and all of these come together in explaining how Neo4j has improved its performance. Neo4j 3.3 is claiming gains of up to 55 percent over the previous version in writes and updates. This was achieved by focusing on two main areas.
The first improvement was introducing a new, faster, Cypher engine. So anything that gets through this engine will run faster, with a focus on writes and updates as it seems. This does not affect things such as non-Cypher API calls or Neo4j server-side procedures, but that's where the second improvement comes to play.
The second improvement was using off-heap memory management. Neo4j is JVM-based, which means among other things that its memory management is also done by the JVM. Typically the JVM does a better job than developers could do at this, but not when it comes to database-specific, fine-grained, and low-level memory management.
As exemplified by ScyllaDB, this can lead to substantial performance gains. So, was not Neo4j tempted to apply this to other areas as well? Certainly. Eifrem says they have tried doing this, but the performance gains they saw (in the range of 20 percent) would not offset the complexity induced at the programming and organizational level.
Among other new features in Neo4j, the most noteworthy ones are ETL and analytics. When it comes to ETL, what Neo4j currently offers are methods to populate graph databases from CSV files. Typically these CSV files are exported from some relational database schema and used as an intermediate step to populate Neo4j.
What is new is that now there is a tool that can introspect relational schemas and automate the extraction of CSVs, based on a few schema patterns it can identify and map to graph. Eifrem says that this can match up to 80 percent of use cases they see in practice, and it will eventually grow to include a fully automated and two-way compatible pipeline between Neo4j and relational databases.
Analytics is a less obvious one. To begin with, what does analytics even mean for Neo4j? As per Eifrem, Neo4j is an OLTP database. However due to its graph nature, people have also been using it for graph analytics, meaning basically anything with a prohibitive number of joins in a relational database. Recommendations would be a typical example.
What is new now is that there is a library of graph-based algorithms. These algorithms include PageRank, clustering and path navigation. Implementing those on Neo4j was possible, but now they come in a pre-packaged, optimized and ready to use form. Eifrem says that they run faster on Neo4j than on frameworks such as GraphX or Giraph, but substantiating that would be entering benchmark territory.
Lies, damn lies, statistics, and benchmarks
Eifrem is not very fond of benchmarks. As everyone who has been around for a while knows, benchmarks coming from vendors tend to show vendors in question to perform better than the competition. This is why he says Neo4j never produces benchmarks of its own and does not bother with other benchmarks either:
"There's lies, damn lies, statistics, and vendor benchmarks -- that's the hierarchy of lies," he says. Broadly speaking, there is at least some truth to that. But that's also a good opportunity to refer to the graph database space, which has seen a lot of activity the last six months.
To begin with, we have a new player that has entered the space with a bang and some impressive benchmarks -- not less so because they are out there in the open to be reproduced: TigerGraph. Neo4j is still number one in adoption, and working hard to stay there as it seems.
TigerGraph's CEO has compared TigerGraph to Neo4j by stating that "Neo4j is not a distributed system. It has 'clustering' offering where each machine still needs to have a complete copy of the whole graph. Its 'clustering' is only meant to provide higher queries per second and higher availability. Its clustering cannot handle a graph if the graph is bigger than a single machine can handle."
Eifrem did not comment on that directly. But he did comment on whether TigerGraph would be an option to consider for anyone looking for a premium graph database solution:
"Maybe so. At a mature state, the market should have space for this, and we'd rather have 60 percent of a big, mature market than 100 percent of a nascent one. TigerGraph may claim some success, but they have a lot to prove still. They raised some 30 million, but that's about it. They could hire a good CEO, or build their commercial team, or iron out their product.
Then they could end up occupying a piece of this market. That's good news, we want that. So far we have not seen them in deals. When that starts happening I may be a bit more grumpy, but not at this point. As long as people are intellectually honest and stick to the facts, more players is good news."
Eifrem says the activity this space has seen in the last six months is a sign of growing demand and maturity. The problem with that, he continues, is that this means the space is going through a somewhat annoying phase of FUD, marketing, hype, and ticking boxes:
"Gartner promoted the notion of multi-model database, so you had for example MongoDB adding a graph operator and saying they are a native graph database. This makes no sense, but enabled them to tick that box. By the way, Gartner also stated recently that they are seeing no growth in multi-model."
As for the rest of the competition, Eifrem seems to lump most of it together in what he calls non-native solutions: "by adding a layer on top of non native stores, you have something that looks and feels like a graph. You have a proper graph API, but in the end you serialize to tables or documents or whatever, and that's not going to get you performance."
If there is one thing for certain, it's that this space is growing. We will be revisting it to cover more approaches and opinions in the near future.