Knowledge graphs beyond the hype: Getting knowledge in and out of graphs and databases
What exactly are knowledge graphs, and what's with all the hype about them? Learning to tell apart hype from reality, defining different types of graphs, and picking the right tools and database for your use case is essential if you want to be like the Airbnbs, Amazons, Googles, and LinkedIns of the world.
Knowledge graphs are real. They have been for the last 20 years at least. Knowledge graphs, in their original definition and incarnation, have been about knowledge representation and reasoning. Things such as controlled vocabularies, taxonomies, schemas, and ontologies have all been part of this, built on a Semantic Web foundation of standards and practices.
So, what's changed? How come the likes of Airbnb, Amazon, Google, LinkedIn, Uber, and Zalando sport knowledge graphs in their core business? How come Amazon and Microsoft joined the crowd of graph database vendors with their latest products? And how can you make this work?
Knowledge graphs before they were cool
Knowledge graphs sound cool and all. But what are they, exactly? It may sound like a naive question, but actually getting definitions right is how you build a knowledge graph. From taxonomies to ontologies -- essentially, schemas and rules of varying complexity -- that's how people have been doing it for years.
RDF, the standard used to encode these schemas, has a graph structure. So, calling knowledge encoded on top of a graph structure a "knowledge graph" sounds natural. And the people doing this, the data modelers, have been called knowledge engineers, or ontologists.
There can be many applications for these knowledge graphs -- from cataloguing items, to data integration and publishing on the web, to complex reasoning. For some of the most prominent ones, you can look at schema.org, Airbnb, Amazon, Diffbot, Google, LinkedIn, Uber, and Zalando. This is why people seasoned in knowledge graphs sneer at the hype.
Like any data modeling, this is hard and complicated work. It must take into account many stakeholders and views of the world, manage provenance and schema drift, and so on. Add to the mix reasoning, and web scale, and things easily get out of hand, which may explain why up until recently, this approach was not the most popular in the real world.
Going schema-less, on the other hand, has been and still is popular. Going schema-less can get you started quickly; it's simpler and more flexible, at least up to a certain point. The simplicity of not using a schema can be deceiving though. Because, in the end, whatever your domain, a schema will exist. Schema-on-read? Fine. But no schema at all?
So, what's with the hype? How can a 20-year old technology be on the emerging slope of the infamous hype cycle? Hype is real, too, as is the reason for this. It's the same story as the meteoric rise of the AI hype: It's not so much that things have changed in the approach, it's more that the data and compute power are there now to make it work at scale.
Plus, the AI itself helps. Or, to be more precise, the kind of bottom-up, machine learning-based AI that gets the hype these days. Knowledge graphs essentially are AI, too. Just another kind. Not some hyped-up-to-now AI, but the symbolic, top-down, rule-based kind. The hitherto unpopular kind.
It's not that this approach does not have its limitations. It's hard to encode knowledge about complex domains in a functional way, and to reason about it at scale. So, the machine learning way of doing things, just like the schema-less way, got popular. And for good reasons, too.
With the big data explosion, and the rise of NoSQL, something else started happening, too. Tools and databases for non-RDF graphs appeared in the market, and started finding success. These graphs, of the labeled property kind (LPG), are simpler and less verbose. They either lack schema, or have basic schema capabilities compared to RDF.
Algorithms, analytics and machine learning can provide insights about graphs, with some common use cases being fraud detection or recommendations. You could therefore say that such techniques and applications get knowledge out of graphs, bottom-up. RDF graphs on the other hand get knowledge into graphs, top-down.
As a knowledge engineer would say, it's a matter of semantics. It's tempting to ride the knowledge graph hype. But in the end, lack of clarity might prove of little service. Graph algorithms, graph analytics, and graph-based machine learning and insights are all good, accurate terms. And they are not mutually exclusive with "traditional" knowledge graphs either.
Some things old, some things new, and some things borrowed for graph databases
As usual, the choice of approach and tool to use for your graph depends on your use case. This also applies to graph databases, which we have been closely monitoring as they evolve, with new vendors and capabilities being added rapidly.
The new things are a handful. And they are all addressing existing pain points for TigerGraph. TigerGraph has added integration with popular databases and data storage systems including: RDBMS, Kafka, Amazon S3, HDFS, and Spark (coming soon). TigerGraph said a github repository will host open source connectors to TigerGraph as they roll out.
Of course, a github repository is not worth much without a community. TigerGraph is working on that, and has announced a new developer portal and eBook. The release also brings more deployment options, adding support for Microsoft Azure to existing Amazon AWS. Keeping up with the containerization trend, support for Docker and Kubernetes has been added, too.
We mentioned graph algorithms previously, and this is perhaps the most interesting aspect of the release, combined with query language. TigerGraph has added support for graph algorithms such as PageRank, Shortest Path, Connected Components, and Community Detection. The interesting part is that these are supported via GSQL, TigerGraph's own query language.
TigerGraph initially responded to Neo4j's call. Now, however, things are changing. TigerGraph just announced a Neo4j Migration Toolkit, which is largely based on translating Cypher, Neo4j's query language, to GSQL. This is a point we discussed at length with TigerGraph.
It makes sense for TigerGraph to do this, as having to migrate an existing body of queries in Cypher, Neo4j's query language, would be a roadblock. The interesting part is how TigerGraph has chosen to implement this: As a one-off, batch process of translating, rather than in an interactive way.
The old part in TigerGraph's announcement is benchmarks. These benchmarks are actually new, but TigerGraph has been into benchmarks since it came out of stealth. For a solution that claims to be faster than anything else due to its MPP architecture, this also makes sense. The benchmark compares TigerGraph to Neo4j, Amazon Neptune, JanusGraph and ArangoDB, and unsurprisingly finds it to be faster than all of those.
The borrowed part? Why, knowledge graphs of course. TigerGraph's people also confirmed the great interest clients are showing on this, citing for example knowledge graph events in China attracting more than 1,000 people. What knowledge graphs? Well, now you know.
Disclosure: I have business relationships with a number of organizations active in the field. TigerGraph and Memgraph are sponsoring the Connected Data London event that I am co-organizing.
Imagine you could get the entire web in a database, and structure it. Then you would be able to get answers to complex questions in seconds by querying, rather than searching. This is what Diffbot promises.
Three-point shooting, Steph Curry, and coming up with stories. If you feel like doing your own analysis to investigate hypotheses or discover insights at any level, RDF graph's got your back. Case in point: The NBA.