Business

From graph to the world: pioneering a database virtual machine

What are the options for querying graphs, and how do we go from that to the equivalent of a virtual machine for databases?

Written by George Anadiotis, Contributor Jan. 22, 2018 at 10:02 a.m. PT

If you're not a graph afficionado, the name Marko A. Rodriguez probably does not mean much to you. Rodriguez however has been working on the intersection of research, engineering and entrepreneurship in the graph space his entire career. The fact that graph is now in the limelight is at least to some extent related to his efforts.

Rodriguez's primary contributions to the graph computing landscape, in his own view, have been the design and development of Apache TinkerPop's Gremlin as well as the founding of Aurelius, the graph consulting firm behind Titan.

His story sheds some light on the foundations of the emerging graph landscape, and hints to what could be coming next in the database world.

From Titan to DSE Graph and JanusGraph

special feature

IoT: The Security Challenge

The Internet of Things is creating serious new security risks. We examine the possibilities and the dangers.

Read now

Titan is the graph database a significant part of today's graph landscape gravitates around, as it is the foundation for both DSE Graph and JanusGraph. Rodriguez says that circa 2010 there were only a few graph database vendors around and it was difficult for companies to find theoretical and applied support for their graph-related problems.

Thus, Aurelius was founded in order to handle the growing need for graph expertise in industry, by Rodriguez and Stephen Mallette. The company grew fast and they were noticing a trend -- the contracts they were getting were requiring scalable graph technologies and at the time, according to Rodriguez, nothing of the sort existed in the market.

Rodriguez and Mallette met Matthias Bröcheler and Dan LaRocque, and thus Titan was born. Bröcheler and LaRocque were interested in developing a commercial version a distributed graph database based on big table technologies such as Cassandra and HBase, and Rodriguez says the timing could not have been better.

The team pushed heavy to create and promote Titan which was a mix of the ideas of Bröcheler / LaRocque (graph structure) and Mallette / Rodriguez (graph process). Rodriguez describes it as a match made in heaven, and Titan as having proved to be very successful not only for their clients, but also for the general public which were in need of such technology.

But as is often the case, the growth of Aurelius proved daunting. As the team and client-base grew, it became harder for Rodriguez to focus on graph theory and technology. When Bröcheler presented Titan at Cassandra Summit 2014, DataStax's interest was piqued. This eventually led to the acquisition of Aurelius by DataStax in 2015, and Rodriguez was once more free to focus on what mattered most for him.

Titan was a strong contender in the graph space. Aurelius, the vendor behind Titan, has been acquired by DataStax and Titan forms the basis for DSE Graph. Titan has also been forked by IBM and others, forming the basis for JanusGraph. Image: Aurelius
Aurelius

Rodriguez says that DataStax acquired Titan because they were interested in adding graph capabilities to their pre-existing DataStax Enterprise (DSE), and this meant taking the ideas of Titan and TinkerPop and developing them within their commercial solution under the moniker of DSE Graph. When Aurelius was acquired, IBM, along with other companies, decided to move the Titan codebase forward as JanusGraph.

According to Rodriguez, while DSE Graph leveraged the concepts behind Titan, it was a complete re-write. This is a point that deserves attention, as it may help clear some confusion. In our recent Big on Data analyst roundtable on graph, one of the discussion points was the definition of a native graph engine.

With regards to DSE Graph, there were 2 opinions voiced: one that saw it as a native graph, with Titan fully integrated under the same hood, and one that saw DSE and Titan as two otherwise unrelated components of an overarching platform, which would among other things mean data would have to be moved around. That was based on earlier discussions with DataStax people, and is apparently not the case - at least not anymore.

On querying graphs

For the last 3 years, while with DataStax, Rodriguez has been developing Apache TinkerPop3, along with colleagues Stephen Mallette and Daniel Kuppitz. TinkerPop is an open source, vendor-agnostic, graph computing framework, that comes with its own graph querying language, Gremlin. TinkerPop's documentation states that:

"When a data system is TinkerPop-enabled, its users are able to model their domain as a graph and analyze that graph using the Gremlin graph traversal language. Furthermore, all TinkerPop-enabled systems integrate with one another allowing them to easily expand their offerings as well as allowing users to choose the appropriate graph technology for their application".

This means that TinkerPop can act as a layer that bridges different graph systems, with queries written in Gremlin reusable across implementations. In a nascent market such as graph databases, with over 30 different options and without the equivalent of SQL - a standard, universally accepted query language in the relational database world, this is an important point.

There is a number of graph query languages out there at the moment besides Gremlin, with SPARQL and Cypher being the two most prominent ones. SPARQL is a W3C standard that works with RDF native graphs as well as other sources including relational databases using a mapping bridge. Cypher started out as Neo4j's query language and spawned the openCypher project, with support from SAP Hana Graph and Redis Graph among others.

Graph query languages

Then there are a number of vendor-specific query languages, and GraphQL. GraphQL has a few things going for it: it has been created and supported by Facebook, and it has an open specification and an intuitive syntax. But it can be argued that GraphQL is not exactly a graph query language.

GraphQL mostly resembles a different approach to REST APIs, and is limited in what it can express in terms of graph constructs, as it only supports trees. It has however formed the basis for an alternative approach to querying RDF graphs, called HyperGraphQL.

Gremlin offers portability across the spectrum, including SQL. But what about performance, and ease of use? A 3rd party evaluation shows Gremlin to be comparable e.g. to Cypher in terms of performance, at least for simple queries. As for syntax, while this is somewhat subjective, Gremlin may not be everyone's cup of tea when it comes to writing graph queries.

But Gremlin does have one thing that no other option seems to have: it is more than a query language. This seeme to be the point Rodriguez is trying to make: pick the graph back-end of your choice. Write your queries in your language of choice. And if you need to port them, there's Gremlin to help do that.

From graph to the world

TinkerPop indeed has the widest range of industry support at this point: from AWS Neptune to Microsoft CosmosDB, and from Hadoop and Spark to Neo4j and RDF vendors via a Gremlin-SPARQL bridge. Rodriguez says that owing to the number of TinkerPop-enabled graph vendors growing, himself, Mallette and Kelvin Lawrence of IBM worked to get TinkerPop into Apache.

With their combined efforts, Apache TinkerPop became a reality. Rodriguez says "this move really helped everyone's cause, as Apache TinkerPop has become better positioned to enable a new generation of graph system vendors. TinkerPop3 was the first Apache release and it has proved extremely successful.

With TinkerPop3, Gremlin is not only a language for processing graphs, but also as a virtual machine similar in many respects to the relationship between Java and the Java virtual machine. Microsoft's CosmosDB and Amazon's Neptune realized the benefit of this model and have recenty become adopters of Apache TinkerPop".

Gremlin aims to operate as a database virtual machine, offering a portability layer for many different database engines. Image: DataStax

Rodriguez does not believe that the graph space will battle it out in the language arena: "there will never be a 'standard graph language'. In analogy, there is no 'standard programming language'. Java, Scala, Clojure, Groovy, etc. all compile to the Java virtual machine. Likewise, Apache TinkerPop enables Gremlin, SPARQL, Cypher, SQL, etc. to compile to the Gremlin traversal machine.

Instead, Apache TinkerPop's future will be focused on the universal adoption of the Gremlin traversal machine. If this distributed virtual machine can be advanced sufficiently enough, it may rise above its graph pigeonholing to become a general-purpose database virtual machine.

There is graph, but more generally, there is interconnected data regardless of the terminology that is espoused at the endpoint (documents, key/values, rows/columns, vertices/edges, etc.). With careful research and development, I think it is possible to take Gremlin to the larger database space, where database vendors will have much less to do in terms of query languages and data processing and can primarily focus their efforts on data storage and retrieval.

Virtual machine computing in the database space should open up a host of new approaches to database-driven application development and interoperability between data(base) systems. Under Gremlin's Turing Complete virtual machine architecture, data processing should go much more smoothly for both vendors and users".

A vision as grand as this needs solid foundation. Rodriguez's way of dealing with that is the Red Herring: this is the name of the boat he is currently sailing on as part of his sabbatical. Rodriguez will be working on a book entitled "Graph Computing Theory", which will also help him formalize thoughts for TinkerPop4. Whether this will be a step closer to the vision, we'll have to wait and see.