It's often easier to understand the use cases for graph databases than understanding how graph databases work. For instance, asking the question of who the most powerful thought leaders across multiple social networks, with the greatest variety of connections, are better suited for graph databases because the alternative of running the query in a relational database would require a ridiculous number of table joins.
And so, with TigerGraph ramping up product R&D out of a new San Diego base, and appointing a new head to the operation, Dr. Jay Yu, that provided the excuse to look at the trajectory for the company, not to mention our wish list for graph databases, which is all about simplification. We had the chance to speak with him and CEO Dr. Yu Xu, and below are our thoughts.
Setting the context
Graph databases are all around us, but more often than not are hiding in plain sight. A good example is the Microsoft Graph, which Microsoft characterizes as "the gateway to data and intelligence in Microsoft 365." More to the point, it can be used to orchestrate the flow of documents, tasks, messages, or other processes throughout Microsoft 365, which of course encompasses Microsoft Office. But to developers, the Microsoft Graph is exposed as an API to write apps against, not a database. In this case, Microsoft models all the data, developers can just run and go play with it.
But increasingly, graph databases are shedding their disguises because the use cases are just staring us in the eye. They can range from tracing cybersecurity threats to risk management in financial services, combatting money laundering, recommendation engines, supporting investigative journalism, delivering recommendations for healthcare treatments in real time, to building knowledge graphs for space exploration. The common thread is that extracting wisdom involves combing through multiple webs of relationships.
Where to go from here? A few weeks back, George Anadiotis in these pages argued that graph databases make a logical launchpad for AI. We're going to look at it from the bottom up: graph databases need to draw a critical mass skills base and become more accessible, both to developers and business analysts.
But the development that kept graph databases from becoming a footnote in db-Engines was invention of the property graph that became popularized by the founders of Neo4J. Before that, graphs were an outgrowth of the W3C Semantic Web initiative, requiring fairly rigid RDF triples. With triples, each node had to carry a subject, predicate, and graph. While such models are well suited for well-bounded and well-defined corpuses of information, such as clinical pharmaceutical research, property graphs (or more specifically, "labelled property graphs") that defined the world as nodes (entities) and links (relationships) proved far more flexible and easier to model. In his piece, Anadiotis spoke of RDF**, which might provide that elusive logical link between RDF and property graphs.
The next hurdle is query language. Until recently, each graph database provider carried its own unique language, meaning there was no common target for building a critical mass skills base. A few popular dialects, starting with Gremlin as procedural language, which came out of the Apache TinkerPop project, provided a syntax for navigating around a graph database. Some providers are forming their own alliances, such as Neo4J and AWS around OpenCypher, the open source implementation of Neo4J's Cypher query language.
But there are some signs of emerging sanity, as players including Neo4J, TigerGraph, Oracle and others are collaborating on GQL. Now an official ISO project, GQL is being designed as a declarative language that would fuse elements of Neo4J's Cypher; Oracle's PGQL; and GCore, a reference implementation. At the end of the day, we won't expect Neo4J to drop Cypher, not would TigerGraph drop its more SQL-like GSQL. We anticipate that GQL will be a reference implementation against which the vendor languages would add cross-compatibility, and therefore nudge closer to the goal of having a common skills target.
Go where developers and business analysts live
Cut to the chase – making graph more accessible to developers and business analysts was the topic that we spoke with Drs. Yu and Xu of TigerGraph about. TigerGraph, which has drawn $171 million in venture financing, is not as well-known as its primary rival Neo4J. But TigerGraph has differentiated itself with support for a distributed database architecture that employs a number of tricks, such as data compression, automatic partitioning, pre-compilation of queries to streamline traversals, and aggregation of interim results. To perform a similar task, other graph databases would require parceling out data and processing to multiple separate physical instances with results merged. In a recent press release, the company cited an Uber-like customer in Southeast Asia that swapped in TigerGraph after Neo4j failed to scale.
The company has made some moves to make data more accessible. For instance, it offers a drag and drop tool for developing queries and has connectors to Power BI and Tableau, which are ubiquitous in the BI visualization world. That's pretty much table stakes as most of the popular graph platforms have BI connectors. Of course, if your team has skills with relational or NoSQL databases, many of those platforms offer graph materialized views that allow you to run fairly simple graph queries that could handle, at most, 2-3 traversals (sets of relationships). But for queries that are more complex, such as discovering patterns of fraud or identity theft, a database that represents data natively in graph schema will be required.
With the company's San Diego R&D center and lead executive in place, we asked what's next on their agenda. At the top of the list is adding support for new languages and APIs, going where developers (and not necessarily graph database developers) live. They are adding support for GraphQL, a combined API and query language that is far more efficient than REST when it comes to simple data retrievals. Although it has "graph" in its name, GraphQL has not been associated with graph databases until now. Instead, the graph in GraphQL refers to the underlying knowledge graph that maps the data source, therefore provides shortcuts to get to the right data without all the chatter of a RESTful call.
Not surprisingly, GraphQL has proven popular with mobile apps, and has carved footholds with NoSQL databases like MongoDB or Apache Cassandra. GraphQL's popularity has become viral among developers to the point where a new company, Hasura, has built a cloud service around it for querying PostgreSQL, as we reviewed about a year ago. And it is that growing degree of familiarity among mobile developers that TigerGraph is seeking to tap into to spread its footprint.
Another piece of the puzzle for meeting developers on their own turf is rounding out connections to all the key compute engines and data stores, such as Apache Spark and Cassandra. Our take is that this could lead to adding data virtualization, where TigerGraph could access data in sources like Cassandra, MySQL, PostgreSQL, or others in the db-Engines top ten listing, treating them as extended nodes and edges. When it comes to projecting graph views on non-graph data, why should Cassandra or MongoDB hog all the fun?
We have a few more items on our wish list. For starters, tools that can help developers who are novices to graph with modeling tools, so they can understand how to structure the webs of relationships that are key to graph database schema.
But let's not forget business users. Providing tie-ins to BI tools are obvious first steps, but churning out visualizations based on relational views won't do justice to explaining the nuances of different levels of relationships that provide the real answers to their questions. Yet, we shouldn't require business end users to form queries based on the web of connections between different vertices.
The going notion is that, while business users might not have the faintest idea of what a graph database is or how to query it, the questions they ask are quite straightforward: Who are the most important influencers? Who are the common patients across multiple referral networks? How to rapidly distinguish safe from malicious websites to guard cybersecurity? How to prevent viewer churn by analyzing their patterns of consuming streaming entertainment? Or, how to identify patterns of money laundering by analyzing connections across different investigations? This is a bit of an ambitious step, but when we are already seeing natural language query popping up in analytic tools, it's not a concept leap to apply this to the connected data inside graph databases.