Catching up with Amazon Neptune

In the run-up to re:Invent next week, we review recent announcements with Amazon’s Neptune graph database.

graph-database.png

With re:Invent on tap next week, the Internet will be bursting with AWS announcements. With a raft of announcements last month, Neptune, Amazon's graph database, won't hog the spotlight next week. But it's hardly because the platform has stood still.

With Neptune in production for nearly a year and a half, we were interested to see what kind of take-up it has had, especially given the fact that it's the rare example of a graph data platform that supports both the older Resource Description Framework (RDF) model that came out of the semantic web, and the newer property graph model invented by Neo4J, that has become the default standard for most new graph database entries.

There's a good reason that AWS entered this market.

Big on Data bro George Anadiotis has lived the graph dream, and documented his life of graph in a deep dive report that he published just over a year ago. While we won't go as far as George in saying that we're in "the year of the graph," we will say that where there's smoke, there's fire, and there's been plenty of that reported in these pages recently. In the past few months alone, there have been announcements of drawing new venture funding; release of new cloud services; and the powering of mainstream enterprise cloud business applications. Not wishing to be left out of the action, existing database vendors are adding graph support as part of strategies to make their platforms more extensible with multi-model capability. And you can count Oracle and Teradata among them, in extending their relational platforms with graph overlays and APIs.

Although in its briefer history, Neptune has yet to amass the same number of reference logos as Neo4J, which had roughly a 10-year head start, there are now almost a couple dozen prominent references such as Samsung, NBC Universal, Intuit, FINRA, AstraZeneca and others. Among the use cases, Thomson Reuters delivers a financial service analytic application on Neptune that helps its customers analyze their global tax obligations and optimize their corporate financial structures accordingly. Nike Run Club uses Neptune as the hub of its member application for connecting people and activities based on shared interests, while Siemens developed a knowledge graph for organizing domain knowledge about industrial equipment.

What's new with Neptune?

Many of the features taken for granted with established enterprise SQL databases, such as query planning tools, ACID transaction support, and cloning, are becoming checklist items for graph databases.

Last month, AWS announced extended query, transaction, and query planning enhancements to Neptune. Specifically, it has added a change-data-capture (CDC) streaming capability that allows Neptune change logs to generate their own streams; this could conceivably feed Kinesis or Kafka feeds for analytics and monitoring of real-time changes to Neptune graphs. AWS has strengthened transaction semantics, a feature that has become increasingly important as graph databases are being used for business-critical use cases. Significantly, given the differences in the way that RDF (using the SPARQL query language) and property (using the Apache TinkerPop Gremlin language) handle queries, there are differences in the way that transactions are exposed to developers in each of the engines.

Other enhancements include new "Explain" features that provide detail on the actual query execution plans for SPARQL and Gremlin. A federated query capability for SPARQL across different Neptune clusters has also been added (but not yet for Gremlin). On the other hand, the Gremlin engine now has a session transaction capability, where queries are committed only after the session is closed; that allows multiple statements to occur within a single transaction, which is suitable for coding complex Gremlin transactions. The recently added support for database cloning support is an example of a feature that simplifies cloud-based DevOps and deployments.

While Amazon Neptune is a relatively new graph database entrant, AWS's move has prompted Neo4J and TigerGraph to usher in their own managed cloud services. Our take is that novel data platforms like graph are begging for the simplification that cloud managed services can provide – customers should focus on modeling the data rather than provisioning servers. And in fact, we'd love to see some cloud-based tools emerge that help developers model their graphs, which these services could deliver. Given that much of the data, such as IoT, location, social networks, and so on, that is being captured by graphs lives outside the data center, that's yet one more reason that the natural home for graph databases should be in the cloud.

Property or RDF graph model? Which models have Neptune users adopted?

There's a good reason for the popularity of property graphs. Compared to RDF, they are more flexible, and for many developers, intuitive compared to RDF models. While RDF models have a prescribed structure based on "triples" (subject/predicate/object), property graph models lack such constraints. There's a similar analogy comparing JSON-based document data models vs. SQL relational models; JSON will be more intuitive to JavaScript programmers and easier for developers in general outside the SQL community because they are far more forgiving when it comes to schema flexibility compared to relational.

But here's the rub. While property graph models have become the de facto standard for most of the current generation of graph databases, AWS has found that RDF models continue to have an addressable market. They are well-suited for well-defined corpuses of knowledge and, given its origin for describing features on the web, relating information from different sources.

Just under 18 months in, which models have Neptune users gravitated to? Not surprisingly, most of the implementation have used property graphs. But at the high end, involving larger, more complex data sets, RDF remains the choice. And increasingly AWS is seeing customers that want interoperability between property graph and RDF. Maybe what's old is new after all.