Graph technology is well on its way from a fringe domain to going mainstream. We take a look at the state of the union in graph, featuring Neo4j's latest release and insights as well as data and opinions from Cloudera, DataStax, and IBM.
Polyglot persistence is becoming the norm in big data. Gone are the days when relational databases were the one store to rule them all; now the notion of using stores with data models that best align with the problem at hand is becoming increasingly popular and understood.
Graph is a data model that has long lingered on the fringe of mainstream adoption. One of those things that few had heard of, fewer really knew and fewer yet actually used. But that is rapidly changing, as the benefits of graph are becoming pronounced and graph use cases, awareness and adoption are on the rise.
The state of graph databases worldwide
The State of Graph Databases Worldwide
A few days ago, IBM released "The State of Graph Databases Worldwide," a report on the adoption and use case characteristics of graph databases. The report features a survey with responses from 1,365 entrepreneurs and developers in diverse industries across 74 countries about the potential they see for graph databases as well as their current and planned use for this technology.
Nearly half of the respondents (43 percent) reported that they are using or planning to use graph technology for transactional applications, while almost one third (29 percent) do this for batch analytics.
This may be somewhat counterintuitive, as the most widely known applications of graph technology to day have been exploratory analytics (such as the Panama Papers investigation) or fraud detection use cases.
It turns out though that while fraud detection is indeed a prime application domain for graph technology, and many organizations are using it for this purpose, it's neither the only one nor the most popular one.
Network and IT operations, MDM (master data management), personalization and recommendations and resource optimization were all in the same league of popular applications. Note that in many of those real-time, transactional features are required.
As for the reasons people give for using graph technology? It mostly comes down to speed and ease of use. Graph databases think like you do, according to their proponents: as per one Application Architect respondent, "for domains that have a good fit, graph provides zero friction between model and implementation, which results in exponential savings in time and effort."
What all of that means is that there's enough good reasons for people to be considering graph technology and for vendors like IBM to take interest and add it in their portfolio. But when it comes to graph technology, most people would probably first think of one name: Neo4j.
Neo enters the matrix
Neo4j has been around since 2007, working exclusively on graph technology. As Emil Eifrem, Neo4j's CEO and co-founder, noted in a conversation we had at last week's GraphConnect on the occasion of the release of v3.2, their ambitions are nothing short of becoming a household name for the database world.
"It's absolutely possible -- what we've stumbled upon is fundamental. We think in terms of connections and associations, our brain is organized that way, and then in math, graphs are central. It would be very unusual if in data that was not the case.
There's value in connections, and you can look at examples in the consumer web to verify this: Google and LinkedIn have overshadowed their predecessors like AltaVista and Monster by leveraging the power of connections.
We're at 0,1 percent of where we want to be - this is the first point in the first set of the first match, to use a tennis game metaphor. In 10 years from now, all the software you use will directly or indirectly depend on a graph database."
Eifrem arrived at graph technology not by getting exposure to graph theory, but by being confronted with the realities of software development.
"A few years back, I was working as a software architect on an enterprise content management solution. That's essentially like a big web file system, with files and folders and users and groups and permissions etc. And all the relations between them -- what belongs where and who accesses what and so on.
We were using a standard software development stack, including a relational database, and that was a problem. Trying to model and apply connections between items, actors and groups using a relational database ended up taking 50 percent of our time.
I was then, and still am, a huge fan of relational databases. They were my friend. But then I realized, relational databases are not always your friend.
We as an industry did not have the terms for it back then, which is partly why it took us so long to realize what was the problem: the mismatch between the shape of our data and our data model. When we did realize, we thought hey, what we need is something like the relational database -- robust, transactional and so on, but for graphs, not for tabular data."
Native graphs as a system of record
Which brings us to a key topic. Relational databases have been around for 4 decades now. Could graph technology, and Neo4j in particular, possibly have caught up to the degree of aiming to beat them in their own game?
Most people would be willing to concede relational as a contender, or even a more suited solution for analytics. But a graph database serving as a system of record? You need transaction support and scale for that, at a minimum. Eifrem is quick to point to a number of clients using Neo4j as a system of record.
"We have always had ACID support. In version 1.0 back in 2009, we supported XA and could participate in 2 phase-commits with Oracle. That was not a very popular view in the NoSQL world back then, but if you don't support transactions natively, that means developers will have to do it on the application layer and that sucks.
I think by now that is well understood, and you see more and more transactional features being added every year even where previously eventual consistency was touted as good enough. For us, the way to go is to start with full ACID compliance and then loosen it up. Causal consistency means you can work in your 100-node Neo4j cluster environment and still guarantee that you read your own writes in the right sequence."
What about scale then? Jim Webber, Neo4j chief scientist, referenced benchmarks comparing Neo4j to other graph databases in GraphConnect. According to those, Neo4j can be anything from 2 to 100K times faster than its graph database counterparts. That's kind of hard to digest obviously, but does it even matter? If Neo4j sees itself as a system of record, should it not be picking a fight with the relational database guys?
According to Eifrem:
"It does not make sense to compare databases based on different data models. It was valuable in the first days, when the discourse was still that you can use a relational database for everything. Today the fact that your data model should match the shape of your data is broadly accepted. We all like the relational database, but if your data is connected, we are a million times faster.
We have a native architecture, which means we own the entire stack from the OS and up. We have designed and optimized every bit of that stack specifically to work with graphs. What others are trying to do is add a layer on top of systems designed to work with other data models to translate to graph. It makes sense from a time to market perspective, and it creates less friction with the Ops team.
If you have invested a gazillion in your system and have operational familiarity you don't want to start building from scratch. But if you're building on top of another system, graph operations are just another application. We have been building all the way up the stack forever. I don't think we as an industry realize the results of the native approach are spectacular. This is why we are 100, 100K times faster."
As Eifrem says, building all the way up the stack takes a while. Which may help explain why there are still a few things missing from Neo4j, such as schema.
"Schema is good. Unlike transactions though, schema has always been an ambition for us.
I don't believe in a schema-free only approach, and we want to add schema over time. With every release, we're adding more schema. We have just added Node Keys, a way of declaring external primary keys.
A schema-free approach can be valuable in the initial stages of a project, when developers are trying different things and businesses are not sure about what they want. So schema-free enables you to iterate quickly.
But as things stabilize, it's the same as transactions -- if we don't support it, developers will have to enforce it on the application layer. Our strategy is to be schema-optional, but you can expect to see more semantics going forward."
What else can we expect to see? For example, in GraphConnect eBay presented how they are using Neo4j to support things like inheritance, versioning, and vocabulary management. Would Neo4j be willing to pick these up?
"Our job is to build the best implementation of the graph model. And we are always in touch with our community to see what patterns they are implementing and adopt them in our model.
Take labels for example. In the beginning we didn't have labels for nodes. Then we saw everyone implementing these lightweight schemas, for example declaring that this node is a Car. So we pulled this in and made it part of Neo4j.
Some of the things eBay is working on we see elsewhere too. Not all of those will make it -- for example, I have complicated views on inheritance -- but others will. This is how we work with our community -- what makes sense gets adopted."
The graph technology solution landscape
Is Graph the One?
Clearly, Neo4j is an opinionated vendor with big plans. But even though they seem to be leading the graph technology race at this time, they are definitely not the only graph in town. There's graphs everywhere, from Hadoop to DataStax and beyond, and people have different views on it.
"Graph is really funny, because it's a new technology that's hot right now so people get exuberant and want to use it for everything. Aurelius people too -- for them everything is a graph," says Patrick McFadin, VP of developer relations at DataStax. DataStax recently acquired Aurelius, one of the leaders in graph technology, and added its Titan graph database to its stack.
According to McFadin:
"What's really interesting about graph and makes folks take notice is what the leaders in graph technology are doing with it: Google, Facebook, LinkedIn. It's more than an idea, it works in practice. Graph is a hard problem to solve, like AI, but these organizations are cracking it. And yes, these two are probably linked -- graph is a natural data model to use for AI.
But while the social graph has great value, other domains are a different story. Why go for it? The thing is, while relational databases are great at expressing relations, graph databases are great at finding relations. That's where the value is.
The classic Facebook example is something like, you're in that area where restaurant X is, and two of your friends have been there, so aha. You can't write SQL for that efficiently, it's a hard problem. But when you have a graph model and you traverse it, that's where the magic comes in."
So where does graph fit in for DataStax? "The biggest player there is Neo4j, but it's really a single system graph database" says McFadin.
"When people come to DataStax, it's because they have a big workload problem. When they think about graph, they want a graph of all their customers, they want to upgrade their customer experience.
Our graph manages its own data, it's a native graph store and it's separate from Cassandra -- it's not like it works by projecting an index over it. So if you're managing data in a Cassandra table, it's not going to be available as a graph unless you move it there.
There's different types of integrations we are considering, such as time series. Graph works to discover and visualize connections in time series data, but when you're done with it, do you want to store that again as a graph? Probably not, you will want to use Cassandra. That's the kind of thing we can do easily in our ecosystem."
Speaking of ecosystems, what about Hadoop? There are also a couple of graph processing frameworks there, most notably GraphX, but Sean Owen, Cloudera's director of data science, does not seem that enthusiastic about it.
"I tend to disbelieve in graph use cases. I often see graph frameworks being applied to use cases that don't lend themselves to this paradigm. They do exist, but they're relatively rare. Many people would like to apply it to just about anything, but I don't see it used that much.
GraphX, I won't say it's deprecated, but it's not that active really. It's there, it's included in our release, but it's not used that much. I don't really know why -- maybe people don't have all that many graph problems to solve, or maybe they're using other solutions.
A lot of the graph use cases are really graph database use cases, so it could be that is what people are using. There's Neo4j, Titan -- now acquired by DataStax -- and Gaffer. There's also Giraph and SAP Hana, which also supports graph.
Still, I don't know what to make of it. Every time I say this, people go -- oh, you're wrong, there is this wonderful graph use case and so on. It must be true, but for some reason it still feels like the exception. I guess to some degree it's a chicken and egg problem, as there is a paradigm shift there."
We will follow up on the graph theme with what is one of the first graph models around that you've probably either never heard of or already dismissed.