Graph data standardization: It's just a graph, making gravitational waves in the real world
AWS, Google, Neo4j, Oracle. These were just some of the vendors represented in the W3C workshop on web standardization for graph data, and what transcribed is bound to boost adoption of the hottest segment in data management: Graph.
Getting a number of vendors to talk to each other, let alone align, is no easy feat. Adding academics and researchers does not necessarily make things easier. Now try adding to the mix a fragmented community and long-standing unresolved issues, and you get the picture of why graph data standardization has not been achieved so far.
This, however, seems about to change, and that's good news for everyone. We have been closely following the rise of graph databases for the past couple of years. The stars seem to be finally aligning for graph, and the Gartners and Forresters of the world are picking up on this, too.
"I always felt graph was better suited being embedded under the hood because it was a strange new database without standards de facto or otherwise. But I'm starting to change my tune -- every major data platform provider now has either a graph database or API/engine."
Bingo. Standards -- de facto or otherwise. The technology has been making progress, to the point where now using graph at scale is feasible. But going for a piece of the incumbents pie without a way to interoperate can be challenging. Just ask the NoSQL crowd, which ended up largely adopting SQL. So this is where W3C comes in..
Neo4j is the market leader in graph databases, as per the DB-Engines index. We've had a number of conversations with Emil Eifrem, Neo4j's CEO, including one last week, just before the W3C workshop, in which Eifrem was adamant: Standardization is a top priority for Neo4j.
The graph database landscape has been fragmented, with property graphs & RDF representing different ways to model, store, and query data, with no standard way of interoperability. While RDF is standardized, property graphs are not.
This has been detrimental to Graph database adoption, and experts, standards bodies, and vendors, all realized this. The W3C Workshop on Web Standardization for Graph Data brought a who is who of graph databases in Berlin to address this issue.
A graph is a graph is graph? RDF vs LPG
RDF has been around for about 20 years, initially driven by research and academia. Initiated by WWW inventor Sir Tim Berners Lee's vision for a Semantic Web, RDF has a substantial stack. This stack includes things such as reasoning and rules, and there have been stable standards there for a while now, including ones for serialization, schema, and querying.
The problem, however, is that pragmatism has not always been a core concern there. Plus, tooling for RDF has been sparse and not always easy to use. Take JSON-LD, for example. Coming up with a standard way to serialize RDF based on JSON, the most popular format for web developers, seems like a no-brainer.
The combination of JSON-LD and schema.org has probably done more to spread the use of RDF than anything else. Just getting Google and other search engines to adopt it has lead to an array of use cases. And yet, JSON-LD was hugely controversial in its time in the RDF community. This was not the last controversy the RDF community faced, but it seems like JSON-LD's success may have had something to teach. But we'll get back to that shortly.
Property graphs have been around for about 10 years, and have been driven by the industry. As such, you could say they are a reversed mirror image of RDF: Pragmatism rules, tooling is abundant and easy to use, outreach and community building are a top priority, but standardization only came as an afterthought at this point.
Most property graph solutions do not have a schema, or have a very basic schema. Just getting data in and out of property graph solutions is an exercise in patience and improvisation -- good luck representing a graph structure in CSV, and mapping that from solution to solution. There is no standard query language for property graphs. And there's no such thing as an abstract model, or semantics, for property graphs at this point either.
So what's at stake for the RDF world then, in which all of that already exist? A well-directed metaphor used in the W3C workshop to describe the status is that of a bridge. Building bridges was the main theme of the event after all. Building bridges among property graphs is one thing, but what about bridges between property graphs and RDF?
While property graphs have work to do in building pillars for this bridge to the RDF world, in RDF the pillars are mostly there, except for one thing: reification. If you're not into RDF, reification is something you've probably never heard of, and don't really care about either. But it's the key for building the bridge to the property graph world, and it seems like RDF is finally getting close to settling this.
Reification is mechanism for adding properties to RDF graph edges, thus making them directly translatable to property graphs. Although this is possible, up to now there has not been one standard, agreed upon way to do this. RDF* is a proposal on how to do this, introduced in 2014, which is getting traction in the RDF world.
One of the outcomes of the W3C workshop was the practically unanimous notion to make this a W3C specification. This technicality, or red herring as some people called it, has been stalling the RDF community for a long time. Watching this being finally sidelined, hopefully for good, was reminiscent of the account renowned sociologist Harry Collins gives on the gravitational physicist community.
In his book,Artifictional Intelligence, Collins embarks on a description of the way people construct meaning socially. As a case study, he uses the gravitational physics community, in which he has been embedded, and their convergence around gravitational wave experimental evidence in 2015. Watching the RDF community converge around RDF* has been similar in many ways.
It's just a graph, making waves in the real world
It remains to be seen whether RDF* can be as pivotal for RDF, and graph at large, as gravitational waves have been for physics. The potential and the dynamics are certainly there, and people in the W3C workshop seem to have left with the commitment to keep working on those pillars and bridges.
In the meanwhile, however, graph is making waves in the real world. In the end, as Brad Bebee from AWS Neptune put it in his keynote, it's just a graph. Users don't really care about the underlying technicalities; they are getting up to speed with the fact that "graphs let us integrate data like crazy."
Neptune is a cloud based graph database from AWS, which lets users use both RDF and property graphs, and would benefit immensely from having those bridges in place. As Bebee pointed out, Neptune has been among the most popular new AWS products in 2018 according to a social media poll at the recent AWS re:Invent conference. This speaks volumes, but it's not all that's new in the graph database world.
But what about cloud and scaling up? Neo4j does not offer a managed cloud version at this point. As this is getting to be table stakes for any database solution, Neo4j is working on this. Eifrem said a managed cloud version of Neo4j based on Kubernetes is currently in private beta, feedback is good, and general availability is coming soon.
Plus, Neo4j will be making a substantial amount of hires in the coming period. The quite unglamorous, but sorely needed effort to do this and scale the company up, is what's keeping Eifrem busy. So not that many shiny new toys to show for, for now, but Eifrem alluded to more of this coming soon. In the meanwhile, however, other vendors are stepping up their game too.
Case in point, RDF vendors adding support for property graphs. AWS already has this, Cambridge Semantics and Stardog are adding it as well. Plus, multi-model support, and JSON as part of this, is becoming a key feature for many vendors. JSON-LD has opened the door, and in the past couple of months vendors such as AllegroGraph and Ontotext have added support for JSON, too. We'll be back with more in-depth analysis of this space soon.
NOTE: Article was updated on 3/11/2019, to clarify the source for AWS Neptune reference as AWS popular product.
An open-source database that is resilient, supports automatic geo-scaling on-premise and in the cloud, and SQL. CockroachDB already is all that. Next in the roadmap: Analytics, with Hybrid Transactional Analytical Processing.
What if machine learning applications on the edge were possible, pushing the limits of size and energy efficiency? GreenWaves is doing this, based on an open-source parallel ultra low power microprocessor architecture. Though it's early days, implications for IoT architecture and energy efficiency could be dramatic.
Graph databases are crossing the chasm to mainstream use cases, adding features such as machine learning to their arsenal and becoming more cloud and developer friendly. Last year was a breakout year, and graph database growth and evolution is well under way in 2019.