Anyone who's ever tried to build distributed applications (dApps) on the (Ethereum) blockchain would concur: Although blockchains are conceptually quite close to databases, querying databases feels like a different world entirely compared to querying blockchains.
First off, there are notable performance issues with storing data on blockchains. These have a lot to do with the distributed nature of blockchains, and the penalty imposed by the combination of consensus protocols and cryptography.
Databases would be slow, too, if they were comprised of a network of nodes in which every node kept a full copy of the entire database, and every transaction had to be verified by every node. This is why people have been experimenting with various approaches to use blockchains as a database, including altering blockchain structure.
The Graph does something different: it lets blockchains be, but offers a way to index and query data stored on them efficiently using GraphQL.
Actually, performance is only part of the issue with retrieving data from blockchains. It gets worse: Blockchains have no query language to speak of. Just let that sink in for a moment: No query language. Imagine a database with no query language! How would you ever get what you need out of it? How do people build dApps, really? With a lot of effort, and brittle, ad-hoc code.
As pointed out in this post by Jesus Rodriguez, blockchain data access is challenging mainly due to three fundamental reasons: Decentralization, Opacity, and Sequential Data Storage. So people are left with a few choices:
Writing custom code to locate the data they need on blockchains, and either repeating those (expensive) calls every time they need the data, or retrieving the data once and storing in an off-chain database, and building an index to point to the original blockchain data.
This is where The Graph comes in. The Graph is a decentralized protocol for indexing and querying blockchain data. But it's more than just a protocol: The Graph also has an implementation, which is open source and uses GraphQL.
GraphQL is a query language for APIs, developed and open sourced by Facebook. GraphQL has taken a life of its own, it's gaining in popularity and being used to access databases, too -- see Prisma or FaunaDB, for example.
ZDNet had a Q&A with The Graph's co-founders, project lead Yaniv Tal and research lead Brandon Ramirez.
In Tal's words, right now, teams working on dApps have to write a ton of custom code and deploy proprietary indexing servers in order to efficiently serve applications. Because all of this code is custom there's no way to verify that indexing was done correctly or outsource this computation to public infrastructure.
By defining a standardized way of doing this indexing and serving queries deterministically, Tal went on to add, developers will be able to run their indexing logic on public open infrastructure where security can be enforced.
The Graph have open sourced all their main components including: Graph Node (an implementation of an indexing node built in Rust), Graph TS (AssemblyScript helpers for building mappings), and Graph CLI (Command line tools for speeding up development).
The Graph, an open source protocol and implementation
As per Tal, the core of what The Graph have done is to define a deterministic way of doing indexing. Graph Node defines a store abstraction that they implement using Postgres:
"Everything you need to run a subgraph is open source. Right now, we use Postgres under the hood as the storage engine. Graph Node defines a store abstraction that we implement using Postgres and we reserve the right to change the underlying DB in the future. We've written a lot of code but it's all open source so none of this is proprietary."
The subgraph that Tal refers to here is simply a part of the blockchain used to store data for specific dApps. Defining a subgraph is the first step to use The Graph. Subgraphs for popular protocols and dApps are in use already, and can be browsed using the Graph Explorer, which provides a user interface to execute GraphQL queries against specific smart contracts or dApps.
When The Graph was introduced in July 2018, Tal mentioned they would launch a local node, a hosted service, and then a fully decentralized network. The hybrid network is a version of the protocol design that bridges the gap between the hosted service, which is mostly centralized, and the fully decentralized protocol.
Users can run their own instance of The Graph, or they can use the hosted service. This inevitably leads to the question about the business model employed by The Graph, as running a hosted service costs money.
According to Ramirez, The Graph's business (token) model is the work token model, which will kick off when they launch the hybrid network. Indexing Nodes, which have staked to index a particular dataset, will be discoverable in the data retrieval market for that dataset. Payment in tokens will be required to use various functions of the service.
The hosted service, Ramirez went on to add, ingests blocks from Ethereum, watches for "triggers," and runs WASM mappings, which update the Postgres store. There are currently no correctness guarantees in the hosted service, as you must trust The Graph as a trusted party.
In the hybrid network there will be economic security guarantees that data is correct, and in the fully decentralized network there will be cryptographic guarantees as well. The goal would be to transition everyone on the hosted service to the hybrid network once it launches, although Ramirez said they wouldn't do this in a way that would disrupt existing users.
Using GraphQL with dApps
Now, GraphQL is popular, and it certainly beats having no query language at all. But there are also some popular misconceptions around it, and it's good to be aware of them when considering The Graph, too. A significant part of GraphQL, added relatively recently, is its SDL (Schema Definition Language). This may enable tools to center the development process around a GraphQL schema.
Developers may create their domain model in SDL, and then use it not just to validate the JSON returned by GraphQL, but also to generate code, in MDD (Model Driven Development) fashion. In any case, using GraphQL does not "magically" remove the complexity of mapping across many APIs. It simply abstracts and transposes it to the GraphQL resolver.
So unless there is some kind of mapping automation/maintenance mechanism there, the team that uses the APIs abstracted via GraphQL may have a better experience, but this is at the expense of the team that maintains the API mappings. There's no such thing as a free lunch, and the same applies for blockchains.
Even more so, in fact, as smart contracts cannot at this point be driven by GraphQL Schema. You first need to create a smart contract, then the GraphQL Schema and resolver for it. This makes for a brittle and tiresome round-trip to update schema and resolver each time the smart contract changes. Ramirez acknowledged this, and elaborated on the process of accessing smart contract data via GraphQL:
"The GraphQL schema is used to express a data model for the entities, which will be indexed as part of a subgraph. This is a read-schema, and is only exposed at layer two, not in the smart contracts themselves. Ethereum doesn't have the semantics to express rich data models with entities and relationships, which is one reason that projects find querying Ethereum via The Graph particularly useful.
If a smart contract ABI changed in breaking ways, then this could require mappings to be updated if they were relying on the parts of the interface, but this isn't a Graph specific problem, as any application or service fetching data directly from that smart contract would have similar problems.
Generally making breaking changes to an API with real usage is a bad idea, and is very unlikely to happen in the smart contract world once shipped to production and widely used (defeats the purpose).
Part of the "magic" of The Graph is that we auto-generate a "read schema" and resolvers based on your data model. No need to maintain anything but the data model schema and the mappings, which shouldn't need to change often. We're also adding support for custom resolvers, however, for more advanced users."
While The Graph has still has not reached complete maturity, it has come a long way in a short time, considering it was announced in June 2018. It is already seen as a significant part of the Web3 stack, and for good reason. Besides transitioning to a fully decentralized implementation, we wondered what else is in the roadmap.
What about mutations, i.e. offering dApps the ability to also write data in the blockchain via GraphQL? Ramirez said this is in the roadmap, but will ultimately be a client-side concern:
"The protocol cannot sign a transaction on the users behalf based on a mutation. Instead, dApp developers will write a mutation against a local GraphQL API which would then generate a transaction to be signed with the user's private key."
Ramirez went on to add that The Graph can index any data that is on-chain, for example data that is supplied by Oracles via Chainlink. Furthermore, he said, Chainlink Oracles may also decide to provide a data feed that is pulled from a GraphQL subscription via The Graph, so the integration works in both directions.
And what about off-chain databases? A dApp may want to store some of its data there for caching, privacy, performance, or other reasons. We wondered whether this is a scenario The Graph has come across, and what they would recommend in that case. Ramirez's reply was that if an application is using centralized databases off-chain, then they are no longer technically a "dApp":
"However, they could still leverage The Graph via schema stitching to compose their decentralized and centralized data sources. This is becoming a common pattern in the GraphQL ecosystem. If the other data stores are also decentralized, then they would also be able to use The Graph for this. For example, today we support indexing off-chain data store in IPFS, and plan on adding support for other blockchains in the future."
Concluding the discussion, Ramirez emphasized that their vision is to fully decentralize the internet application stack, primarily focusing on operational dApps. For a number of reasons, he added, GraphQL makes more sense strategically to achieve that goal. SQL is in the agenda, too, and it could still be very useful for other, secondary, use cases, but right now they are staying focused on the main mission.
As to whether GraphQL can be used for analytics too, Ramirez said that while expressing an analytics query using GraphQL semantics is possible, he has not seen too many examples of this yet:
"I could imagine something akin to Elastic's Query DSL expressed in GraphQL rather than JSON, which I think could be made to be quite ergonomic. There's also a question if this is how analysts and data engineers want to write analytics style queries this way vs. just using SQL."