RDF is a graph data model that has been around since 1997. It's a W3C standard, and it's used to power schema.org and Open Graph, among other things. Plus, there's a bunch of RDF-based graph databases out there, some of which have been around for a while and can do things other databases can't do.
RDF is also a building block of the Semantic Web, and the Semantic Web has a bad reputation. It's impossible, it's academic, it's inherently flawed, people working on it don't have a clue about real-world problems and software engineering, and it will never scale, according to its critics.
Could it be however that RDF is not entirely useless, and RDF databases are worth a look? How do they compare with Neo4j, the most popular graph database out there? Here's a crash test of sorts.
You give graph a bad name
Neo4j CEO Emil Eifrem says he has spent a lot of time in the Semantic Web world, and at a high level the idea of seeing the world as a graph and modeling it as such is one that philosophically he could not agree more with.
"Semantic Web and Neo4j are spiritual brothers and sisters. Where we do differ is the implementation details, and I'm someone who thinks details matter," says Eifrem.
"RDF is basically a publication model for the web. Some people have tried to turn that into a software data model, and that is something that I don't think makes sense. At the lower level, RDF is very expressive, but it's also way too granular.
Consider this: you have a node representing a person, and you want to give that person a name. In RDF, you would have one Person node, and you'd have to add another node for the person's name, and a vertex 'hasName.'
In Neo4j, we have a more compact model: you just add a property for the person's name, and that's it. That may not sound like a big deal, but it may actually make the difference as to whether a product takes off or not.
Another difference has to do with the software engineering approach. There is a pool of super smart people in the Semantic Web community, but their approach is typically extremely academic. These are people whose main deliverable is an academic article, and that impacts the quality of the software.
Sometimes it's ok, but frequently it's very hard to use. Take APIs: when we write our APIs, we obsess over them -- every call, every parameter name. Broadly speaking, people who write RDF software obsess over wording, not APIs."
Eifrem is seconded in that notion by O'Reilly Media Learning group director and AI expert Paco Nathan. In his words, AI transformations need better tooling, and "the SPARQL and triple store crowd haven't gotten the memo yet about containers, orchestration, microservices, etc."
RDF -- what is it good for?
So, should we just dismiss RDF as impractical and RDF stores as inferior software delivered by academics and move on? Maybe not so fast. There are RDF vendors who are entirely professional about what they do, and RDF does have certain things to offer that are not there in other graph data models.
We had a discussion with Vassil Momtchev, GraphDB product owner at Ontotext, about the benefits and use cases of RDF. Ontotext's legacy is in text mining algorithms, however these days it's mostly known for GraphDB, its RDF Graph database engine.
"Our text mining is backed by a complex semantic network to represent background and extracted knowledge. Back in 2006, we found that none of the existing RDF databases were able to match our requirements for a highly scalable database. This was how GraphDB started," says Momtchev.
"RDF databases are very good at representing complex metadata, reference, and master data. Nearly all of our clients use this technology for the representation of concepts with high complexity, where semantic context and data quality are critical.
RDF technology is very powerful in setting standards on how to publish, classify or report information. Its strongest feature is the ability to share and publish data in an open way.
If you want to expose data so it can be easily consumed by other users, federated across different information systems, or linked by a third party system, it's a technology without many viable alternatives actually. The main area of use cases for GraphDB is where organizations manage highly valuable data.
This includes content providers storing highly valuable information that does not expire, and need a standard way to publish information assets are typical users. Benefits include increased discoverability, better semantic context, easier knowledge exploration, and navigation."
GraphDB has clients like AstraZeneca and BBC to show for there, and publishing data and data integration scenarios is where RDF shines. But what about real-time, transactional applications at scale?
"Scalability is one of the first questions when people ask about the potential of RDF databases. Most of our clients use graphs in the range of 500 million to 1 billion RDF facts, while the biggest cluster installations go up to 15 billion.
In the real world, datasets are typically smaller. For example, all structured knowledge in Wikipedia is less than 800 million facts, and that is really a lot of data.
GraphDB is most commonly used in scenarios with high model complexity, where semantic context and data quality are important. GraphDB data are used as a standard for master or reference data in the organization. GraphDB data are used as a standard for master or reference data in the organization.
Although we support ACID properties, the engine is designed to work well with big batch updates rather than thousands of small transactional updates."
Graphs with schemas
Interestingly, part of what made RDF notorious may also be part of its biggest strengths: rich semantics and inference. This is a point on which views depart.
"GraphDB can be used with and without schema. The definition of schema allows users to control what the database will derive as implicit information or validate in the sense of data constraints," explains Momtchev.
RDF has different types of schemas one can use, ranging from RDFS to OWL2 variants. Each of these offers a different tradeoff between expressiveness and complexity, ranging from simple constrains and inheritance to description logic.
Some RDF stores, including GraphDB, can infer new knowledge from existing facts. This is essentially rule-based reasoning, and can range from simple inheritance to anything an OWL-based reasoning engine can generate.
For example, if an RDFS schema contains the knowledge that Persons are Entities, then an RDF(S) compliant store containing the fact that A isA Person will infer that A isAn Entity. When querying for Entities, A will also be fetched, even if the A isAn Entity fact is not explicitly contained in the database.
For Eifrem, "Inheritance is an example of how things went wrong in the RDF community. People thought, hey, I want to be able to model everything, and they added all these features, resulting in OWL being so complicated that nobody can use it in practice. That's why we are so hesitant about adding schema."
Still, if RDF and Neo4j are such close relatives, would it not make sense for the two worlds to maintain a relationship, and perhaps even adopt some features from each other? Family relationships can be complicated, and this seems to be the case here as well, despite some efforts to bring the two communities together.
Features such as named graphs (giving the ability to provide context for graphs), an expressive query language, provenance and vocabulary management are either still being discussed or were only recently added in Neo4j. In the RDF world, they have been around for years.
"We still monitor the RDF community closely for inspiration, and we want to be able to import and export RDF easily" says Eifrem. "There's a key advantage there -- RDF uses URIs, and having a globally unique identifier is central for a publication format. But exposing RDF to developers? Absolutely not."
In fact, even though it's not a very prominent feature, Neo4j can import and export RDF data. It looks like that's as far as Neo4j is willing to go though, by treating RDF as another data exchange format like CSV, albeit one that is clearly more powerful.
But even if getting RDF out of Neo4j is an option, that can only be done using Cypher, Neo4j's query language, not SPARQL, the query language that comes as part of the RDF stack. A Cypher versus SPARQL comparison is a rather nuanced topic. One fact about this approach however is that Cypher at this point can only be used with Neo4j.
By contrast, SPARQL adapters exist for anything from relational databases (which by extension cover any ANSI-SQL compliant system such as Hadoop data lakes) to CSV. What this means is that SPARQL can be used to integrate your 99 data stores, while Cypher can not.
"Cypher is a much lower friction language, and there is the technology value and the go-to-market value. We'd much rather invest in integration with relational databases. A few years back it may have made sense for us, but now there's more people into Cypher than into SPARQL" says Eifrem.
In GraphConnect, there was a session dedicated to the relationship between RDF and LPG -- Neo4j's data model. It was organized by an ex-RDFer turned LPG, Jesus Barrasa. Again, this is a nuanced topic that comes down to the fact that these are different graph models, so you should investigate before making a choice.
When asked to comment on the points made by Barrasa, Momtchev generally conceded, pointing out the differences in the two models and ways of alternative modeling. There was however one statement made by Barrasa that sparked a snowball of a reaction:
"RDF stores are very strongly index based. Neo4j is navigational (implements index free adjacency). Index based storage is ok for not very deep queries, forget path analysis. Neo4j's native graph storage is best for deep or variable length traversals & path queries."
And Momtchey remarks:
"From an implementation point of view the RDF specification does not require building many indexes. Still, if you want basic inference like A is the same as B, you most likely need it.
But this is an extremely controversial, disputable and highly biased statement towards LPG, ignoring tons of theory books. The general trade-off is always read vs write performance -- more indexes means slower writes but faster reads.
If Neo4j claims that they have optimized better their storage for general purpose use cases I think this is a very long shot. At the same time the claim for the extremely fast path queries in real life scenarios has been proven incorrect by many developers.
To be fair I would turn the question in a different way: how much control do you have over the indexing so you can cover some extreme use case? To my best knowledge none -- you use the database exactly in a single way, as it was designed to be used."
"We have 100 times more adoption than all RDF stores combined, not because we are so much smarter, but because our software is so much easier to use". Although Neo4j does have its haters as well, adoption data seem to -- partially -- confirm Eifrem on this one. For most, RDF databases are considered a niche market.
"It depends on how you define the term niche market. Compared with other specialized databases like Object-oriented, XML or time series databases, I believe RDF databases are a very mature market with many competitors, where every vendor has its strengths and weakness.
For sure, RDF/graph databases are not ubiquitous like relational systems, which still dominate the market. Probably the main reason is better predictability in working with lower data abstraction levels.
In relational, you operate with the physical data model instead of the logical model like in RDF. Still, this key advantage of the relational model is also a big disadvantage when designing very complex information systems with hundreds of concepts."
So, what is the verdict here? If you are sold on graph, what kind of graph database should you use? As usual, the answer is "it depends." It depends on what your use case is. As Momtchev puts it:
"If you care about the standards and how you publish data, so it can be reused in a open format/protocol, SPARQL federation, linked open data and the like -- you go with a triple store. If you care about graph path analysis and fast transactional support -- you should consider Neo4j or another property graph implementation."