If you don't know what OpenCorporates does, this is a good opportunity to learn. Whether directly or indirectly, chances are OpenCorporates has already been valuable to you in some way. OpenCorporates crawls the web and aggregates and structures information about corporate ownership and structure all over the world.
Its database contains information on nearly 165 million companies and counting. Although this information is almost entirely collected from the public domain, from places such as national business registries for example, having it all in one place is very useful. You can think of OpenCorporates as the Google of corporate information.
Whether you are doing market research or investigative journalism, having this information, and being able to query it, is invaluable. As untangling the web of interconnections on corporate ownership and relationships is something that relational databases struggle with at this size and complexity, OpenCorporates is now using a graph database to power its back-end.
Exploring corporate data and relationships
OpenCorporates has been around since 2010. It was launched by Chris Taggart and Rob McKinnon, both veterans of the UK open data scene, as more-or-less a proof-of-concept, with three million companies in three jurisdictions (UK, Jersey, and Bermuda). It has now grown to 165 million companies from 130 jurisdictions.
OpenCorporates acts as a social enterprise, putting the public benefit before profit. The data it collects are available either as dump or via an API, and made available via a dual licensing scheme. For commercial use, a fee is charged, while for non-commercial use access is free, and licensing is tailored for each use.
OpenCorporates has worked with the likes of CNN, The Economist, and The Sunday Times, powering things such as a database with to let the public query names in Panama Papers. In its own words, "OpenCorporates exists to allow journalists, NGOs, and others to use our data for good", which has earned it the Open Data Business Award, handed by the founder of world wide web, Sir Tim Bernes-Lee and the Open Data Institute.
OpenCorporates just unveiled its latest offering, a comprehensive dataset on German companies. Keep in mind, this is all public domain data. But the way it is scattered and the lack of structure make it hard to use. OpenCorporates adds value by structuring it, and this is where things get interesting from a technical perspective.
As Taggart explained, OpenCorporates has been using a relational database to store all that data. While this works, and is used as the back-end for OpenCorporates API, there are limitations to this approach. Whatever the reason you are interested in corporate data, things usually get interesting by following connections. For example, to answer questions, such as: "What is the entire chain of ownership of company X?"
The API available via OpenCorporates does offer some search capabilities, but these are not always enough to address complex information needs. Following complex relation paths between nodes, whether they are companies or people involved in them, is where graph databases shine.
OpenCorporates goes Graph, TigerGraph
The way to do this for OpenCorporates data is either by getting the entire dataset and then using your platform of choice to explore this on your own, or by utilizing OpenCorporates value-add services. And this is where graph database competition and evolution comes in.
As Taggart said, OpenCorporates has been running two systems in parallel. A relational database, used to power the API and run simple queries, and a graph database for the complex queries and analytics. The problem, Taggart said, was that the graph database they have been using so far was having trouble coping with their growing dataset and client base, as well as the increasing complexity of the information needs they need to serve.
Some of the queries OpenCorporates needs to handle include degrees of separation, siblings, up the chain only, temporal graph search, and active vs. dead relationships. Interestingly, this is not just a matter of performance, but also a matter of expressivity.
It was not just about being able to run such queries in reasonable execution times that OpenCorporates was struggling with. Sometimes it was about being able to to formulate them in the first place. As we have noted before, there is a number of competing query languages for graph databases.
Each comes with its own idiosyncrasies, and many of them are tied to a specific vendor. The vendor OpenCorporates has chosen to switch to is TigerGraph, an up and coming startup for which OpenCorporates is a showcase of what it can do. Taggart explained that this forms the basis of a mutually beneficial relationship.
TigerGraph acknowledged the nature of OpenCorporates work, as well as the high profile that comes with it, and has provided its platform to OpenCorporates under special terms. OpenCorporates wins by migrating to a platform that works for them, TigerGraph wins by getting exposure and promoting GSQL, its query language.
2019, another year of the graph
This is a salient point. To quote fellow ZDNet contributor Tony Baer: "I always felt graph was better suited being embedded under the hood because it was a strange new database without standards de facto or otherwise. But I'm starting to change my tune - every major data platform provider now has either a graph database or API/engine".
- Researchers turn to AI for help in mapping every solar panel in US (CNET)
- Amazon Neptune is here: 6 ways customers use the AWS graph (TechRepublic)
- Best Presidents' Day 2019 deals
Standards are sorely needed for interoperability and accessibility reasons, and would greatly promote the growth of graph databases. For RDF-based graph databases, standards do exist. For property graph ones, such as TigerGraph, this is not the case. In order to address this, a workshop under the auspices of W3C has been called for early March 2019.
The workshop will try to address both query language and data format standardization. Both of these are very important, and we will be there to contribute and report on progress made. Even though as noted the dark side of standards also exists, we hope this will contribute to the continuous evolution of graph databases.
In our discussion with Taggart, we noted for example how a JSON-LD based standard for graph data could facilitate OpenCorporates to move to a single back-end powered by a graph database. This would make exporting data and powering the OpenCorporates API directly with a graph database possible with minimum changes.
The W3C workshop has been initiated and is sponsored by Neo4j, and additional sponsors include Ontotext and Oracle. While some of the top minds in this space will convene to discuss and hopefully find common ground in Berlin, however, progress is not limited to this.
As we called 2018 the "Year of the Graph," one question we have been asked since the beginning of 2019 is: "Now what?" We think OpenCorporates is just evidence No. 1 for what will be another great year for graph databases.
Most graph database vendors have been very active, adding features such as cloud deployment and managed versions and support for machine learning out of the box. We have also seen Redis releasing an initial version of what it presumably wants to grow to a fully fledged graph database, and a number of other vendors making moves as well.
The landscape is constantly shifting, and deserves analysis and highlighting beyond the scope of this article. Stay tuned as we navigate another exciting year of the graph ahead.
Previous and related coverage:
Data gets flexible. Machine learning reigns supreme and transforms everything, including software and hardware. Regulation, governance and licensing interplay with the brave new data world. The years of the graph are only getting started. And frogs are boiling. These are the trends shaping the software, hardware, data, machine learning and AI landscape
AI is the most disruptive technology of our lifetimes, and AI chips are the most disruptive infrastructure for AI. By that measure, the impact of what Graphcore is about to massively unleash in the world is beyond description. Here is how pushing the boundaries of Moore's Law with IPUs works, and how it compares to today's state of the art on the hardware and software level. Should incumbent Nvidia worry, and users rejoice?
What exactly are knowledge graphs, and what's with all the hype about them? Learning to tell apart hype from reality, defining different types of graphs, and picking the right tools and database for your use case is essential if you want to be like the Airbnbs, Amazons, Googles, and LinkedIns of the world.