Fifty years ago, relational databases were neither ubiquitous nor standardized. The story of how we went from the relational model introduced by Edgar F. Codd to the SQL databases and language we all know today may be a bit different depending on who you ask.
One thing is certain, however -- it was a bumpy ride. Ideas, people, and vendors vied for domination, before settling and coalescing around SQL, the by now de-facto query language. This short trip down database memory lane may have some lessons in store for today.
The advent of NoSQL databases in the past few years has challenged the relational stronghold. Among them, it seems graph databases are the ones enjoying the widest popularity right now. Proponents claim graph is the most natural way to model the world, and every major database vendor today has graph in its arsenal.
Give me control of a database query language, and I care not who makes its engine
The noticeable spike in interest in graph in the last year can be largely attributed to heavyweights such as Microsoft and AWS entering this space with Cosmos DB and Neptune, respectively. But this space has been there long before the heavyweights decided to make a move, and over the past couple of weeks we've seen some announcements that are worth analyzing.
The gist of what's going on is that we have a war for graph database domination, and query language is a key battle to be won. SQL did not arrive at universal adoption overnight. If there is a graph query language that turns out to be the winner, it won't be overnight either.
The world is a different place now, so the war is arguably more subtle, and there may not even be a clear winner. But it's a war nonetheless, and the query language battle is raging. Just think what kind of power any vendor could have over the industry if they got to control SQL, and you will begin to see why the battle for graph query language is important.
RDF and SPARQL are W3C specifications, and TinkerPop is an Apache open-source project guided by DataStax's Marko Rodriguez. So, if you are the No. 1 vendor in graph databases, but do not control (or like) those, what do you do? You come up with your own.
That's what Neo4j has done with Cypher. Cypher started out as Neo4j's custom query language. At some point, Cypher was open sourced, and the openCypher project was created. OpenCypher has some industry support, most prominently by SAP.
SAP says Cypher is the de-facto open industry standard for graph processing, and SAP HANA Graph supports a subset of Cypher -- focusing on the querying aspects -- making it possible to migrate existing apps to HANA more easily.
This last announcement was done on stage in Berlin, and it's important because it expands Cypher's reach. This means you can now query your graphs in Cypher on all databases supported by TinkerPop, which is practically all graph databases.
Neo4j piggybacks on the work done by TinkerPop to expand its reach, which was done the Neo4j way: Originally started by a community contribution, the work was adopted and sponsored to reach production release status.
Inside baseball, or history in the making?
The way Cypher on TinkerPop works is by mapping Cypher queries to Gremlin. As became clear through discussion with the team that built this and Eifrem, this is probably not the most efficient way to do this: Cypher queries are translated to Gremlin, and then they're executed on the TinkerPop engine.
Eifrem mentioned his ambition to offer native implementations for Cypher on TinkerPop in the future. He also gave credit to Rodriguez, adding that part of the reason for TinkerPop's success is how easy it is for vendors to support TinkerPop by implementing a simple SPI.
Even if non native, the Cypher on TinkerPop bridge means Cypher can now be used where it was previously not possible, such as AWS Neptune and Microsoft Cosmos DB, for example.
Both support TinkerPop, which means that, by extension, they also support SPARQL, as there is a SPARQL-Gremlin bridge. Now, Cypher can tick that box, too. Neo4j does not support SPARQL, and its RDF support is not exactly prominent. There is a long story behind this, but it's mostly "inside baseball" to quote Eifrem.
Why bother then? Well, some of that "inside baseball" may be relevant not only for the database historians of the future, but also for people choosing a graph database today. As we noted, AWS Neptune, for example, already supported both SPARQL and TinkerPop/Gremlin, and now, it also indirectly supports Cypher. Which one is the best option?
Well informed industry sources note that Neptune is a native graph database rather than an abstraction layer on a relational database, and add that Neptune and Aurora both leverage common underlying AWS technologies. Neptune is said to use a storage design that is reminiscent of how Aurora works, but it's not built on top of Aurora.
Blazegraph is/was an RDF graph database, so if indeed Neptune was built on top of Blazegraph, this would perfectly explain support for SPARQL, among other things. Other sources note AWS had been looking to land a graph database vendor for a while now, and that they considered native Cypher support which will likely be added in the future.
SQL on SPARQL, performance, and shared graphs
Inside baseball or not, Neptune has renewed interest in this space, and everyone is taking note. Take Cambridge Semantics (CS), for example, which recently announced support for Neptune analytics. CS mainly markets its Anzo Smart Data Lake product, but as its executives noted in a recent call, they would not want to be left out of the graph buzz.
That may sound like graph-washing, but here at least there is some grounding. At the heart of Anzo lies AnzoGraph, a bona fide RDF graph database, the descendant of SPARQLCity. As CS has been working with SPARQLCity for a while, and there was complementarity between them, they gradually merged into one company and one solution in 2016.
As expected, CS is shown to outperform the competition in that one. Zane says they chose this particular benchmark (LUBM) because there was another vendor (Oracle) they could compare against for speed and scale, noting they were about 100-times faster. Zane also adds they threw more hardware at the problem, "because we could."
But what is perhaps more interesting about this is the scale. When RDF graph databases were taking their first steps, pioneers were told by vendors to come back when they could do a million triples. A few years back, doing something useful with a billion triples, or just loading them, was an open challenge. CS's benchmark was ran with a trillion triples.
AnzoGraph is built as a massively parallel solution, which Zane says most closely resembles TigerGraph in terms of architecture. TigerGraph also happens to boast top performance in its own benchmark. This is always a point of contention, and Zane concedes benchmarks are only indications -- clients should evaluate their specific use case.
This is called MultiGraph and is meant to give organizations control over which parts of a graph or graphs users can access while maintaining security and data integrity. It seems to resemble Named Graphs, which is the RDF way to work with multiple graphs, but the exact similarities and differences are not clear at this point.
What is clear is that TigerGraph does not seem to offer many options when it comes to language support, as users are limited to its custom GSQL query language. TigerGraph says they developed GSQL as customers found other query languages incomplete, while GSQL is complete with familiar SQL + procedural syntax.
Still, while data import/export to/from TigerGraph can be achieved via RDF or CSV, for queries things are not that simple. TigerGraph says they provide a migration path for customers using Gremlin or Cypher to GSQL, but this has to be done by a dedicated solution team. We won't be surprised to see TigerGraph adding support for TinkerPop soon, though.
And to wrap up on the language support front, today CS also announced a partnership with CData Software to provide SQL query access. The idea is to offer SQL access to the underlying AnzoGraph, leveraging existing SQL skills.
As SQL is built for the relational model, it may be difficult to map graph constructs. It would not be a first, though. SPARQL to SQL translation has been around for a while. As for SQL to SPARQL, data.world, for example, has this already. It will be interesting to see how this evolves and whether SQL on SPARQL becomes a thing.