Knowledge graphs are hyped. We can officially say this now, since Gartner included knowledge graphs in the 2018 hype cycle for emerging technologies. Though we did not have to wait for Gartner -- declaring this as the "Year of the Graph" was our opener for 2018. Like anyone active in the field, we see the opportunity, as well as the threat in this: With hype comes confusion.
Knowledge graphs are real. They have been for the last 20 years at least. Knowledge graphs, in their original definition and incarnation, have been about knowledge representation and reasoning. Things such as controlled vocabularies, taxonomies, schemas, and ontologies have all been part of this, built on a Semantic Web foundation of standards and practices.
So, what's changed? How come the likes of Airbnb, Amazon, Google, LinkedIn, Uber, and Zalando sport knowledge graphs in their core business? How come Amazon and Microsoft joined the crowd of graph database vendors with their latest products? And how can you make this work?
Knowledge graphs before they were cool
Knowledge graphs sound cool and all. But what are they, exactly? It may sound like a naive question, but actually getting definitions right is how you build a knowledge graph. From taxonomies to ontologies -- essentially, schemas and rules of varying complexity -- that's how people have been doing it for years.
RDF, the standard used to encode these schemas, has a graph structure. So, calling knowledge encoded on top of a graph structure a "knowledge graph" sounds natural. And the people doing this, the data modelers, have been called knowledge engineers, or ontologists.
There can be many applications for these knowledge graphs -- from cataloguing items, to data integration and publishing on the web, to complex reasoning. For some of the most prominent ones, you can look at schema.org, Airbnb, Amazon, Diffbot, Google, LinkedIn, Uber, and Zalando. This is why people seasoned in knowledge graphs sneer at the hype.
Like any data modeling, this is hard and complicated work. It must take into account many stakeholders and views of the world, manage provenance and schema drift, and so on. Add to the mix reasoning, and web scale, and things easily get out of hand, which may explain why up until recently, this approach was not the most popular in the real world.
Going schema-less, on the other hand, has been and still is popular. Going schema-less can get you started quickly; it's simpler and more flexible, at least up to a certain point. The simplicity of not using a schema can be deceiving though. Because, in the end, whatever your domain, a schema will exist. Schema-on-read? Fine. But no schema at all?
You may not know your schema well enough a priori. It may be complex, and it may evolve. But it will exist. So, ignoring or downplaying schema does not solve any problem, it only makes things worse. Issues will lurk, and cost you time and money, as they will hamper developers and analysts who will try to develop applications and derive insights on a fuzzy blob of data.
The point then is not to throw schema away, but to make it functional, flexible, and interchangeable. RDF is pretty good at this, as it also underlies standardized formats for data exchange, such as JSON-LD. RDF can also be used for lightweight schema and schema-less approaches, and data integration, by the way.
Getting knowledge into or out of graphs
So, what's with the hype? How can a 20-year old technology be on the emerging slope of the infamous hype cycle? Hype is real, too, as is the reason for this. It's the same story as the meteoric rise of the AI hype: It's not so much that things have changed in the approach, it's more that the data and compute power are there now to make it work at scale.
Plus, the AI itself helps. Or, to be more precise, the kind of bottom-up, machine learning-based AI that gets the hype these days. Knowledge graphs essentially are AI, too. Just another kind. Not some hyped-up-to-now AI, but the symbolic, top-down, rule-based kind. The hitherto unpopular kind.
It's not that this approach does not have its limitations. It's hard to encode knowledge about complex domains in a functional way, and to reason about it at scale. So, the machine learning way of doing things, just like the schema-less way, got popular. And for good reasons, too.
With the big data explosion, and the rise of NoSQL, something else started happening, too. Tools and databases for non-RDF graphs appeared in the market, and started finding success. These graphs, of the labeled property kind (LPG), are simpler and less verbose. They either lack schema, or have basic schema capabilities compared to RDF.
And they typically perform better for operational applications, graph algorithms, or graph analytics. Lately, graphs are starting to be used for machine learning, too. These are all very useful things.
Algorithms, analytics and machine learning can provide insights about graphs, with some common use cases being fraud detection or recommendations. You could therefore say that such techniques and applications get knowledge out of graphs, bottom-up. RDF graphs on the other hand get knowledge into graphs, top-down.
So, are bottom-up graphs knowledge graphs, too?
As a knowledge engineer would say, it's a matter of semantics. It's tempting to ride the knowledge graph hype. But in the end, lack of clarity might prove of little service. Graph algorithms, graph analytics, and graph-based machine learning and insights are all good, accurate terms. And they are not mutually exclusive with "traditional" knowledge graphs either.
All the prominent use cases we mentioned earlier are based on a combination of approaches. Having a knowledge graph and populating it using machine learning for example has helped build the biggest knowledge graph ever -- at least in terms of instances, if not entities. And it's what AI pioneers like DeepMind are researching, as well.
Some things old, some things new, and some things borrowed for graph databases
As usual, the choice of approach and tool to use for your graph depends on your use case. This also applies to graph databases, which we have been closely monitoring as they evolve, with new vendors and capabilities being added rapidly.
Last week at Strata, both the winner and the runner-up for the Most Disruptive Startup award were graph databases: TigerGraph and Memgraph. In case you needed more proof of how rapid progress is made in the field, there you have it. Both startups are no more than a couple of years old, by the way.
For TigerGraph, which came out of stealth in September 2017, this has been a very active year. Today, TigerGraph is announcing a new release. And it's got some things old, some things new, and some things borrowed -- though we could not really spot anything blue.
The new things are a handful. And they are all addressing existing pain points for TigerGraph. TigerGraph has added integration with popular databases and data storage systems including: RDBMS, Kafka, Amazon S3, HDFS, and Spark (coming soon). TigerGraph said a github repository will host open source connectors to TigerGraph as they roll out.
Of course, a github repository is not worth much without a community. TigerGraph is working on that, and has announced a new developer portal and eBook. The release also brings more deployment options, adding support for Microsoft Azure to existing Amazon AWS. Keeping up with the containerization trend, support for Docker and Kubernetes has been added, too.
We mentioned graph algorithms previously, and this is perhaps the most interesting aspect of the release, combined with query language. TigerGraph has added support for graph algorithms such as PageRank, Shortest Path, Connected Components, and Community Detection. The interesting part is that these are supported via GSQL, TigerGraph's own query language.
We have referred to the importance of query languages for graph databases. Recently, Neo4j, the leading graph database vendor in terms of mindshare according to DB-Engines, has put forward a proposal to create a standard query language for LPG graph databases. This does not exist in the LPG world, as opposed to RDF that comes with SPARQL.
TigerGraph initially responded to Neo4j's call. Now, however, things are changing. TigerGraph just announced a Neo4j Migration Toolkit, which is largely based on translating Cypher, Neo4j's query language, to GSQL. This is a point we discussed at length with TigerGraph.
It makes sense for TigerGraph to do this, as having to migrate an existing body of queries in Cypher, Neo4j's query language, would be a roadblock. The interesting part is how TigerGraph has chosen to implement this: As a one-off, batch process of translating, rather than in an interactive way.
This is a strategic choice. TigerGraph wants people to switch to GSQL, rather than work with Cypher on top of TigerGraph. Developers have traditionally been averse to learning new query languages. TigerGraph had some stories to share on how great this is working for them, but how this will play out is anyone's call.
The old part in TigerGraph's announcement is benchmarks. These benchmarks are actually new, but TigerGraph has been into benchmarks since it came out of stealth. For a solution that claims to be faster than anything else due to its MPP architecture, this also makes sense. The benchmark compares TigerGraph to Neo4j, Amazon Neptune, JanusGraph and ArangoDB, and unsurprisingly finds it to be faster than all of those.
The borrowed part? Why, knowledge graphs of course. TigerGraph's people also confirmed the great interest clients are showing on this, citing for example knowledge graph events in China attracting more than 1,000 people. What knowledge graphs? Well, now you know.
Disclosure: I have business relationships with a number of organizations active in the field. TigerGraph and Memgraph are sponsoring the Connected Data London event that I am co-organizing.
Previous and related coverage:
Imagine you could get the entire web in a database, and structure it. Then you would be able to get answers to complex questions in seconds by querying, rather than searching. This is what Diffbot promises.
Three-point shooting, Steph Curry, and coming up with stories. If you feel like doing your own analysis to investigate hypotheses or discover insights at any level, RDF graph's got your back. Case in point: The NBA.