AWS Neptune going GA: The good, the bad, and the ugly for graph database users and vendors
It's official: AWS has a production-ready graph database. What features are included today, and what will be included in the near future, what use cases are targeted, and what does AWS Neptune's release mean for users and graph database vendors?
The good: Neptune tries to get the best of two worlds, looks production ready
Even before the announcement, AWS emphasized two key points: Neptune would enable users to seamlessly go from proof of concept to production, and there was great interest by many major clients. Its people were confident about the GA timeplan and how AWS clients using Neptune pre-GA would be able to go to production upon GA.
The names included in yesterday's press release do not disappoint: Samsung Electronics, Pearson, Intuit, Siemens, AstraZeneca, FINRA, LifeOmic, Blackfynn, and Amazon Alexa. Their use cases range from fraud detection to medical research, and AWS says that was precisely what drove Neptune's development and is reflected in Neptune's profile.
AWS notes that, in its experience, people coming from a relational background seemed to find PG easier to work with. AWS has therefore geared PG, exposed through the Gremlin traversal API, toward interactive applications. For example, each user action when clicking through a UI can translate to another step in a graph traversal via Gremlin.
RDF and its query language SPARQL on the other hand offer other benefits, according to AWS. Most prominently they lend themselves well to data exchange and integration scenarios. By enabling users to integrate and ingest datasets such as Wikidata or life sciences data, RDF can help them bootstrap their applications and exchange data.
Neptune also has a double bottomline when it comes to transactional (OLTP) versus analytical (OLAP) applications. Although it is primarily geared toward OLTP at this point, AWS says it can also support some OLAP.
Neptune's primary goal at this point, however, is high availability and durability, meaning up to 100 billion nodes / edges / triples, while automatically replicating six copies of data across three Availability Zones (AZs) and continuously backing up data to S3. AWS says Neptune is ACID-compliant both in SPARQL and Gremlin, offering repeatable reads that are up to date across AZs within 10 milliseconds.
AWS also says Neptune is designed to offer greater than 99.99 percent availability and automatically detects and recovers from most database failures in less than 30 seconds. Neptune also provides advanced security capabilities, including network security through Amazon Virtual Private Cloud (VPC) and encryption at rest using AWS Key Management Service (KMS).
That array of features puts Neptune in the same league as Microsoft CosmosDB in terms of high availability in the cloud. There are differences too, though -- most notably the fact that CosmosDB is multi-model, while Neptune is exclusively a graph database, albeit a dual one. CosmosDB has more APIs besides graph, while Neptune has two different graph APIs.
That should not come as a surprise, as bridging the two models is anything but trivial. AWS wants to pursue a unified view over RDF and PG, but that's quite hard, and we do not expect to see it anytime soon. So, while having two graph databases for the price of one looks attractive, it becomes less attractive if you have to ETL data from one to the other to use them.
And that's not the only occasion on which you'll be awkwardly moving data around at this point. If you want to import or export data, or do RDF inference and advanced graph processing, prepare to do a lot of heavy lifting.
While Neptune has tools for ingesting data in CSV, RDF, and GraphML, these are only for static files. AWS says you can also use DynamoDB streams for dynamic data import, but you will have to write the ingestion code for this yourself. Same for exporting data -- possible via SPARQL and Gremlin, but not very convenient in lack of a tool for this.
RDF inference is also missing. Inference is the ability to process rules, typically expressed in RDFS or OWL variants for RDF. These rules can be used to declare schema, including classes, inheritance, types, and restrictions for nodes, edges, and properties, effectively adding data in the database.
AWS has chosen not to include RDF inference in Neptune, citing its impact on scalability. AWS notes, however, it's looking into adding RDFS support in the future. Doing so would enable data structure type validation, and type subsumption via query rewriting. For the time being, if you want support for those, you will have to use a reasoner engine in addition to Neptune.
And if you want to apply advanced analytics to your graph, utilizing solutions such as Spark or GraphX, you will have to find a way to integrate and move that data around yourself, too. Again, AWS says it is looking into ways of adding this, considering client needs.
Finally, Neptune is also lacking when it comes to visualization, which is an important feature for querying and exploring graphs. While Neptune does offer visualization via partnerships, these do not come out of the box and incur additional cost. So if you want to formulate queries or navigate results visually, you will have to turn to one of AWS's partners for this.
The ugly: Standing up to AWS
AWS pledges to continue on a path to making Neptune a force to be reckoned with, and catch up on all those missing features. And it makes sense for it to focus on its core value proposition at this point.
But how do other graph database vendors measure up against Neptune? And what can they do to avoid being 'Amazoned'? That must be going through a lot of people's minds for a while now. A while back, we had a discussion with the CEO and founder of Neo4j, Emil Eifrem. One of the things we talked about was exactly this.*
As Neo4j is the No. 1 graph database in terms of mindshare and adoption, Eifrem confessed to having done a lot of soul searching on this. His conclusions may be of interest not just for the graph database community, but beyond that as well. So, how does one stand up to a mega-cloud vendor entering their domain?
Eifrem has identified five points of differentiation against AWS, and says that vendors that manage to nail all of those have a chance of surviving against the AWS of the world.
Pervasiveness. Eifrem acknowledges that it's hard to make the transition from an on-premise to a cloud company. But the flip side of that, he says, is that there is value in telling CIOs "we run on all clouds, and on your laptop, and on your data center."
Ecosystem. There is not a single framework or programming language that you won't find Neo4j on, while the same can't be said for AWS, according to Eifrem.
Data. Eifrem says he believes Neo4j can get and curate datasets in a way that people who dabble with graph will be able to find and use them, and claims this is a big advantage.
Vertical integration. Here, Eifrem refers to Neo4j's graph platform strategy, claiming that Neo4j is moving up the stack and is becoming more than a database. Neo4j is the Oracle and the Tableau of graph databases, and a much richer offering, he says.
Does that strategy make sense, and does Neo4j check those boxes? The answers will be different depending on who you ask.
One thing is certain: The graph database landscape just got a lot more interesting. Going from niche to mainstream means vendors will be battling it out, and users will be benefiting by having more choice and more features.
*NOTE: The conversation with Eifrem was not published at that time. You can, however, find the reference for this five-point strategy here, as well.