'

Back to the future: Does graph database success hang on query language?

If the history of relational databases is any indication, what is going on in graph databases right now may be history in the making.

Video: What's new in the graph database world? Here's a quick recap

special feature

IoT: The Security Challenge

The Internet of Things is creating serious new security risks. We examine the possibilities and the dangers.

Read More

Fifty years ago, relational databases were neither ubiquitous nor standardized. The story of how we went from the relational model introduced by Edgar F. Codd to the SQL databases and language we all know today may be a bit different depending on who you ask.

One thing is certain, however -- it was a bumpy ride. Ideas, people, and vendors vied for domination, before settling and coalescing around SQL, the by now de-facto query language. This short trip down database memory lane may have some lessons in store for today.

The advent of NoSQL databases in the past few years has challenged the relational stronghold. Among them, it seems graph databases are the ones enjoying the widest popularity right now. Proponents claim graph is the most natural way to model the world, and every major database vendor today has graph in its arsenal.

Give me control of a database query language, and I care not who makes its engine

The noticeable spike in interest in graph in the last year can be largely attributed to heavyweights such as Microsoft and AWS entering this space with Cosmos DB and Neptune, respectively. But this space has been there long before the heavyweights decided to make a move, and over the past couple of weeks we've seen some announcements that are worth analyzing.

Read also: Sparkier, faster, more: Graph databases, and Neo4j, are moving on

opera-snapshot2018-03-05122325db-engines-com.png

Graph databases are trending, and there is an ongoing if subtle war for domination going on. (Image: DB engines)

The gist of what's going on is that we have a war for graph database domination, and query language is a key battle to be won. SQL did not arrive at universal adoption overnight. If there is a graph query language that turns out to be the winner, it won't be overnight either.

The world is a different place now, so the war is arguably more subtle, and there may not even be a clear winner. But it's a war nonetheless, and the query language battle is raging. Just think what kind of power any vendor could have over the industry if they got to control SQL, and you will begin to see why the battle for graph query language is important.

As fellow analyst Tony Baer noted, for people not in the esoteric group of graphistas, graph has been "a strange new database without standards de facto or otherwise." The truth, however, is there are some standards in graph: RDF as a model, SPARQL as a query language, and TinkerPop / Gremlin as a virtual engine / query language.

RDF and SPARQL are W3C specifications, and TinkerPop is an Apache open-source project guided by DataStax's Marko Rodriguez. So, if you are the No. 1 vendor in graph databases, but do not control (or like) those, what do you do? You come up with your own.

That's what Neo4j has done with Cypher. Cypher started out as Neo4j's custom query language. At some point, Cypher was open sourced, and the openCypher project was created. OpenCypher has some industry support, most prominently by SAP.

SAP says Cypher is the de-facto open industry standard for graph processing, and SAP HANA Graph supports a subset of Cypher -- focusing on the querying aspects -- making it possible to migrate existing apps to HANA more easily.

As Neo4j CEO Emil Eifrem admitted in a discussion we had last week in the backstage of Neo4j's Graph Tour event in Berlin, perhaps they should have opened up Cypher earlier. But Neo4j is trying to catch up now: First by offering Cypher on Apache Spark, and then by offering Cypher on Gremlin.

This last announcement was done on stage in Berlin, and it's important because it expands Cypher's reach. This means you can now query your graphs in Cypher on all databases supported by TinkerPop, which is practically all graph databases.

Neo4j piggybacks on the work done by TinkerPop to expand its reach, which was done the Neo4j way: Originally started by a community contribution, the work was adopted and sponsored to reach production release status.

Inside baseball, or history in the making?

The way Cypher on TinkerPop works is by mapping Cypher queries to Gremlin. As became clear through discussion with the team that built this and Eifrem, this is probably not the most efficient way to do this: Cypher queries are translated to Gremlin, and then they're executed on the TinkerPop engine.

Read also: The year of the graph: Getting graphic, going native, reshaping the landscape

Eifrem mentioned his ambition to offer native implementations for Cypher on TinkerPop in the future. He also gave credit to Rodriguez, adding that part of the reason for TinkerPop's success is how easy it is for vendors to support TinkerPop by implementing a simple SPI.

Even if non native, the Cypher on TinkerPop bridge means Cypher can now be used where it was previously not possible, such as AWS Neptune and Microsoft Cosmos DB, for example.

Both support TinkerPop, which means that, by extension, they also support SPARQL, as there is a SPARQL-Gremlin bridge. Now, Cypher can tick that box, too. Neo4j does not support SPARQL, and its RDF support is not exactly prominent. There is a long story behind this, but it's mostly "inside baseball" to quote Eifrem.

gremlinneo.png

TinkerPop and its query language, Gremlin, are the lingua franca of graph databases. Now, Neo4j wants to use this as a substrate to expend the reach of its own query language, Cypher. (Image: Apache TinkerPop)

Why bother then? Well, some of that "inside baseball" may be relevant not only for the database historians of the future, but also for people choosing a graph database today. As we noted, AWS Neptune, for example, already supported both SPARQL and TinkerPop/Gremlin, and now, it also indirectly supports Cypher. Which one is the best option?

It might help to step back and figure out what's under Neptune's hood. We previously speculated it could be that Neptune was built on DynamoDB. It turns out this is not the case.

Well informed industry sources note that Neptune is a native graph database rather than an abstraction layer on a relational database, and add that Neptune and Aurora both leverage common underlying AWS technologies. Neptune is said to use a storage design that is reminiscent of how Aurora works, but it's not built on top of Aurora.

But there is more. It's kind of a common secret going around in the graph database world that Neptune is based on the acqui-hiring of Blazegraph: Amazon acquired Blazegraph's domains, many former Blazegraph engineers are now Amazon Neptune engineers, according to LinkedIn, and Amazon now owns the Blazegraph trademark.

Blazegraph is/was an RDF graph database, so if indeed Neptune was built on top of Blazegraph, this would perfectly explain support for SPARQL, among other things. Other sources note AWS had been looking to land a graph database vendor for a while now, and that they considered native Cypher support which will likely be added in the future.

SQL on SPARQL, performance, and shared graphs

Inside baseball or not, Neptune has renewed interest in this space, and everyone is taking note. Take Cambridge Semantics (CS), for example, which recently announced support for Neptune analytics. CS mainly markets its Anzo Smart Data Lake product, but as its executives noted in a recent call, they would not want to be left out of the graph buzz.

Read also: TigerGraph, a graph database born to roar

That may sound like graph-washing, but here at least there is some grounding. At the heart of Anzo lies AnzoGraph, a bona fide RDF graph database, the descendant of SPARQLCity. As CS has been working with SPARQLCity for a while, and there was complementarity between them, they gradually merged into one company and one solution in 2016.

CS recently published a benchmark and followed this up by announcing AnzoGraph complements AWS Neptune by offering scale for both complex graph traversal queries and data warehouse style aggregation analytics. Reading this, you might think there is some special integration or business agreement between AWS and CS, but this is not the case.

CS hinted that may be so in the future, but for the time being, CS-Neptune integration is done via RDF export/import, as would be done between any other two solutions that support RDF. Barry Zane, formerly SPARQLCity's CEO and now CS VP of engineering, is into running benchmarks.

As expected, CS is shown to outperform the competition in that one. Zane says they chose this particular benchmark (LUBM) because there was another vendor (Oracle) they could compare against for speed and scale, noting they were about 100-times faster. Zane also adds they threw more hardware at the problem, "because we could."

But what is perhaps more interesting about this is the scale. When RDF graph databases were taking their first steps, pioneers were told by vendors to come back when they could do a million triples. A few years back, doing something useful with a billion triples, or just loading them, was an open challenge. CS's benchmark was ran with a trillion triples.

AnzoGraph is built as a massively parallel solution, which Zane says most closely resembles TigerGraph in terms of architecture. TigerGraph also happens to boast top performance in its own benchmark. This is always a point of contention, and Zane concedes benchmarks are only indications -- clients should evaluate their specific use case.

TigerGraph on its part recently announced version 2.0, which it says comes with performance improvements. But what is maybe most noteworthy is what TigerGraph says is a unique feature that enables multiple users to work on the same graph simultaneously.

tigergraphmultigraph.png

MultiGraph is a new feature announced in TigerGraph 2.0, giving the option of working with different views on the same graph. (Image: TigerGraph)

This is called MultiGraph and is meant to give organizations control over which parts of a graph or graphs users can access while maintaining security and data integrity. It seems to resemble Named Graphs, which is the RDF way to work with multiple graphs, but the exact similarities and differences are not clear at this point.

What is clear is that TigerGraph does not seem to offer many options when it comes to language support, as users are limited to its custom GSQL query language. TigerGraph says they developed GSQL as customers found other query languages incomplete, while GSQL is complete with familiar SQL + procedural syntax.

Still, while data import/export to/from TigerGraph can be achieved via RDF or CSV, for queries things are not that simple. TigerGraph says they provide a migration path for customers using Gremlin or Cypher to GSQL, but this has to be done by a dedicated solution team. We won't be surprised to see TigerGraph adding support for TinkerPop soon, though.

And to wrap up on the language support front, today CS also announced a partnership with CData Software to provide SQL query access. The idea is to offer SQL access to the underlying AnzoGraph, leveraging existing SQL skills.

As SQL is built for the relational model, it may be difficult to map graph constructs. It would not be a first, though. SPARQL to SQL translation has been around for a while. As for SQL to SPARQL, data.world, for example, has this already. It will be interesting to see how this evolves and whether SQL on SPARQL becomes a thing.

Previous and related coverage

From graph to the world: pioneering a database virtual machine

What are the options for querying graphs, and how do we go from that to the equivalent of a virtual machine for databases?

Graph query languages

Unlike the world of relational databases, where SQL is the de facto query language, in graph there is a number of query languages.