The web as a database: The biggest knowledge graph ever

Imagine you could get the entire web in a database, and structure it. Then you would be able to get answers to complex questions in seconds by querying, rather than searching. This is what Diffbot promises.

The web is among humankind's greatest achievements and resources. Ever-expanding and nearly all-encompassing, we've all come to depend on it. There's just one problem: It takes work to get information out of it.

That's because the information is in documents, and documents on the web are all over the place, and someone needs to locate them, and read them, to extract that information. Search engines have come a long way, and they greatly assist in the locating part, but not so much in the extracting part. At least, not until today.

Also: Zen and the art of data structures: From self-tuning to self-designing data systems

Google and its ilk may sometimes give the impression they can understand and answer questions. Part of the reason is the addition of human knowledge in the mix. Google famously went from using purely text-based and statistical methods to adding a form of curation when it bought MetaWeb. MetaWeb developed Freebase, which was a crowd-sourced knowledge graph, similar in approach to Wikipedia, which was integrated in Google's search engine.

Eat your heart out, Google

That enables Google to do some of its magic. If you Google "Google," for example, you don't just get a bunch of links. You also get an info-box that lists facts such as Google's CEO, founders, and address. That's because there is an entry in Google's knowledge graph that lists Google as a company, and these are some of the properties companies have, so Google fetches and displays that information from Wikipedia.

But if you try Googling "how many employees does Google have," or "what is Google's address", what you will get is a bunch of links. You are on your own -- you have to read the documents and figure out the answer. If that information was in a database, you would type something like "SELECT Address FROM Organizations WHERE Name=Google" and you'd have your answer in seconds. That is the difference between structured and unstructured information.

Also: MemSQL 6.5: NewSQL with autonomous workload optimization, improved data ingestion and query execution speed

That is also what Diffbot is unveiling today: The ability to query the web as a database. This impressive feat is also based on a knowledge graph. The difference is that, in Diffbot's case, the knowledge graph is only partially curated by humans, and is automatically populated by crawling the web. ZDNet talked to Mike Tung, Diffbot's CEO and Founder, to find out how Diffbot does this.


Diffbot ingests and parses the entire web into a knowledge graph - a database you can query. Image: Diffbot

First off, you have to crawl the web. This is where Gigablast and Matt Wells come in. Gigablast is a search engine created by Matt Wells, Diffbot's VP of Search, in 2000. Tung says this is what Diffbot uses to crawl, and store, every single document on the web. Hard as this may be, however, it's not even half the job.

The really hard part is getting the information out of documents, and this is where the magic is. Tung explains this is done using computer vision, machine learning (ML), and natural language processing (NLP).

Computer vision helps Diffbot understand the structure of documents. It mimics the way humans break down documents, figuring out what are the structural elements of each document -- things such as headers, blocks, etc. In a perfect world, this should be possible by inspecting the HTML structure of web documents. But not everything on the web is HTML, and HTML documents are not perfect either.

Also: MemSQL 6.5: NewSQL with autonomous workload optimization, improved data ingestion and query execution speed

After structure comes content. Content is parsed using a combination of NLP and ML, the result of which is structured knowledge which is added to Diffbot's knowledge graph (DKG). Tung showcased an example based on Marissa Mayer, ex-CEO of Yahoo.

Taking a brief text about Mayer as input, Diffbot's system processed it and was able to extract all kinds of facts described in the text: Mayer's gender, employment history, education, etc. By doing this, Diffbot adds an entry for Mayer in its knowledge graph, and populates it with properties such as gender, age, and the like.

"Contrary to popular perception, Google's knowledge graph is not derived primarily from automation," says Tung. "Unlike Google, the goal of our processing is not to rank pages for humans to read (and inject some advertising along the way), but rather to avoid human reading altogether.

DKG is the first web-scale knowledge graph that is entirely synthesized by an automated AI system, without a human-in-the-loop. That is why the main constraint to growth is the number of machines that we dedicate to it acquiring knowledge," he adds, concluding that DKG currently contains something in the area of a trillion facts.

From a web of documents to a web of data

This is not entirely new. The first one who put forward the vision of going from a web of documents to a web of data was none other than the web's inventor, Tim Berners Lee, who published his Semantic Web manifesto in 2001.

As Tung notes, however, "a long line of history (ranging from RDF/microformats/RSS/semantic markup) has shown that requiring human annotation is never going to scale in terms of economic incentive and accuracy to all of knowledge."

Even though annotation does not necessarily have to be human (it can come from automation as well), Tung does have a point: Most content on the web is very poorly, if at all, annotated. Tung thinks that building this global knowledge graph using the current state of AI is the right approach -- and it seems to be working.

Also: Moving fast without breaking data: Governance for managing risk in machine learning and beyond

The applications are wide and far-reaching. Tung notes that "enterprise functions such as sales, recruiting, supply chain, accounting, business intelligence and market intelligence all work off of databases that can be kept updated and accurate by integrating directly with the knowledge graph."


Diffbot natural language processing in action. Note how facts extracted from text are represented as subject -- predicate -- object triples. (Image: Diffbot)

Tung demonstrated such a scenario, using DKG to query for people who work for Uber. Initially the query returned nearly 40,000 results, which Tung was able to filter using standard filtering as one would expect from a database: Get only current employees, filter by region, etc.

And that reference to integrating with databases has far-reaching implications too. The above scenario was based only on information found on the web. But enterprises don't just work with what they find on the web -- they also have their own internal systems and databases, and Tung says DKG can support those as well, offering one access point to rule them all.

Also: GraphQL for databases: A layer for universal database access?

DKG may well count as Diffbot's greatest achievement to date, but it did not come out of nowhere. Tung has strong credentials to show for, having designed web-scale information extraction architectures and worked for Microsoft, eBay, and Yahoo. Diffbot has been around since 2008, it has names such as eBay Microsoft Bing, and Salesforce among its clients, and Tencent and Bloomberg among its investors.

Impressive as all of that may sound, however, there are a few gotchas.

Language, son

To begin with, not all of DKG is auto-magically created. That's not necessarily a bad thing, but it goes to show the limits of even what "the current state of AI" can do. DKG is seeded by Diffbot's knowledge engineers, who have decided that the entities it will handle are people, companies, locations, articles, products, discussions, and images.

This means that everything Diffbot crawls from the web will be classified as one of those things. Clearly, this decision was driven by what Diffbot's clients are mostly interested in, but that does not mean every page on the web is classified as one of the 20 types DKG currently knows. Tung says they plan to expand this to include categories such as events or medical information.

In other words, Diffbot has consciously chosen to limit the scope of what it handles, to make a well-known problem manageable. To anyone familiar with knowledge graphs (also going by the name of ontologies for the connaisseurs), what Diffbot does is defining an upper ontology, and populating it from the web. The concept and related challenges are well-known, but the way Diffbot handles this is state of the art.

Also: AWS Neptune going GA: The good, the bad, and the ugly for graph database users and vendors

Which bring us to another key topic: Question answering. If you have the whole web at your fingertips, how are you going to query it? It depends. If you are a business person, ideally, you would like to use natural language. At present, DKG does not support this. It does, however, have its own Diffbot Query Language (DQL).

DQL looks pretty simple, if you are familiar with query languages. But, then again, if you are familiar with query languages, why would you want to have to learn yet another one? There already is bunch of graph query languages out there, such as SPARQL, Gremlin, and OpenCypher, and with the rise of graph databases, we expect them to become more and more widespread.

This touches upon another issue: Even though Diffbot's approach shares many similarities with semantic web concepts and standards (Tung even specifically mentioned RDF-like subject-predicate-object triples in his breakdown of text processing), its approach is proprietary.

Regardless of whether you know or like those standards, would it not have made Diffbot's life easier to use them? For example, by building DKG on top of an off-the-shelf graph database. Tung acknowledges it would, but he says they tested over a dozen graph databases, and they all broke down at around 10-100M entities, so they had to build something proprietary.

As for the language issue, Tung says their approach is to meet users where they are, eliminating the need for directly using a query language (or an API, which DKG also supports) as much as possible. The way to do this, Tung says, is by integrating DKG with popular systems such as Salesforce, SAP, or Tableau, so users can transparently get data from DKG in their applications.

That may be good for users, but it also places quite a burden of Diffbot to develop and maintain all those integrations. Tung says they intend to develop bridges for popular query languages, however, so integrations will not have to be hand-crafted.

Also: Planet analytics 1.0: From the UN lab to the globe

Last but not least, does being able to query the web also mean you should automatically trust the results? Not necessarily. This is why Google and its ilk have developed sophisticated algorithms to rank results, trying to determine the most relevant ones. DKG only partially does this.

You can filter Uber employees by age, for example, but what is the definitive source for that? If source X says a person was born in 1974, and source Y says they were born in 1947, which one should you trust? How do you know they are talking about the same person to begin with?

These are well-known, hard-to-tackle issues, and Diffbot has to tackle them like anyone else who has come before them. Even as it is, however, DKG is an impressive achievement with many potential applications.

Previous and related coverage:

AI chips for big data and machine learning: GPUs, FPGAs, and hard choices in the cloud and on-premise

How can GPUs and FPGAs help with data-intensive tasks such as operations, analytics, and machine learning, and what are the options?

Data-driven disaster relief: Measuring the impact of emergency response

With natural disasters picking up in frequency and intensity, the role of NGOs in disaster relief is picking up as well. A key requirement for all NGOs is transparency, and applying data-driven techniques may help.

Wolfram Research goes for Software 2.0, releases neural net repository

Wolfram, having been into AI before it was cool, now gets a piece of the deep learning hype, in its sui generis way. Where does it stand compared to the competition, and how easy is it to use and integrate Wolfram with the rest of the world?