A new open-source library by Nvidia could be the secret ingredient to advancing analytics and making graph databases faster. The key: parallel processing on Nvidia GPUs.
Nvidia has long ago stopped being "just" a hardware company. As its hardware is what much of the compute supporting the explosion in AI runs on, Nvidia has taken upon itself the task of paving the last mile to the software. Nvidia does this by developing and releasing libraries that software developers and data scientists can use to integrate GPU power in their work.
The premise is simple: Not everyone is a specialist in parallelism or wants to be one. Parallel programming is hard. Yet, this is what is required to take advantage of GPU capabilities and boost performance in software and analytics. So, Nvidia provides libraries people can use to build their software, without knowing all the implementation and hardware details.
Nvidia has been doing this with CUDA since 2007. Since then, Nvidia has released more than 40 Nvidia CUDA-X libraries, with the most recent being Rapids, an open-source data science platform that is the umbrella under which several initiatives such as Dask or XGBoost for data science have evolved.
Nvidia is now releasing Rapids cuGraph 0.9, a library whose goal is to make graph analysis ubiquitous. This could be the foundation for major developments in graph analytics and graph databases. Graph is a field we have been closely monitoring, but we're no longer the only ones, and that's not the only reason why we think this is big.
Graph analytics on steroids
This brief excerpt is taken out of Gartner's analysis on why graph will rule the world in the 2020s:
"The application of graph processing and graph databases will grow at 100 percent annually through 2022 to continuously accelerate data preparation and enable more complex and adaptive data science."
Brad Rees, however, has been doing that long before it was cool.
Rees started working with graph programming and analytics, data science, and AI in the 1980s. Over several years and projects, Rees found his way to Nvidia in 2017, in what was then a nascent effort within Nvidia. Today, Brad Rees is the AI infrastructure manager in Nvidia, tasked with bringing graph analytics and algorithms to the world.
Rees got interested in GPU programming around the time CUDA 2.0 came out. As others, too, have pointed out, the meshes used in graphics processing are kind of a natural match for graph processing: Each node represents a concept, each edge represents a relationship.
So, the fact that GPUs can speed up graph processing did not go unnoticed. When Rees joined Nvidia, there were already implementations for a few graph algorithms on GPUs. As Rees explained, however, these were not very systematic, or very well integrated within the Nvidia ecosystem. As the algorithm collection grew, and graph was gaining steam, cuGraph was born, and Rees became the project leader.
CuGraph is a collection of graph algorithms implemented over Nvidia GPUs. That may not sound like much if you're not into graph algorithms, so to put that into context, let's say that PageRank, the famous algorithm that Google built its empire on, is a graph algorithm, too.
There are many graph algorithms around, and each algorithm can provide insights for different data analysis scenarios. When cuGraph's first official release, 0.6, came out in late March, it already contained many algorithms, including PageRank. That initial release focused on providing a foundation and included several algorithms optimized for single-GPU analytics.
With the release of version 0.9, Nvidia cuGraph is coming one step closer to 1.0. As Rees explained, the goal is not just to keep adding algorithms to cuGraph, but to make them work over multiple GPUs, too. This has now been achieved for PageRank. Even in version 0.6, however, cuGraph was already up to 2000 times faster than NetworkX.
NetworkX is a graph analytics framework for Python that cuGraph was modeled on, to do everything NetworkX does on GPUs. NetworkX was chosen because it's the most popular graph framework used by data scientists. NetworkX on steroids would already be quite a feat, but the vision goes way beyond that, and the implications are quite interesting.
The vision for Nvidia cuGraph
Rees noted that cuGraph development would slowly shift toward improving ease-of-use, interoperability, and integration with the rest of Nvidia's Rapids library.
In a blog post, Rees went on to explain how cuGraph utilizes the property graph paradigm, and how Data Frames are the key to interoperability with Rapids. Rees said that Data Frames could be used to build graphs, run algorithms on those graphs, and then take the data those algorithms produce and add them to the original Data Frames as needed.
CuGraph's roadmap also includes adding dynamic data structures. These can come in handy when analyzing graph changes over time. As data is streamed in, how the structure of a network changes can be monitored and reported on.
Of equal importance is the use of a dynamic structure within analytics. In many cases, the size of the result set is unknown a priori. Being able to collapse, expand, add to, and reduce either the graph or the results on-the-fly is a powerful technique.
It does not stop there, though. Rees noted that cuGraph would be adding state of the art graph analytics frameworks such as GraphBLAS and Hornet while keeping an eye on all new developments and seamlessly integrating them under cuGraph for developers to use.
The reference to property graphs, however, was a trigger for a more speculative discussion, which touches upon not just algorithms, but databases, too. Property graphs are one of the two more widespread ways to model graphs. Several graph databases adopted it, and it's the focal point for W3C's ongoing effort to standardize graph databases.
So, we wondered what the interplay between cuGraph and graph databases might be. To begin with, we should emphasize the difference: cuGraph is an analytics framework, optimized to load data and run algorithms. Databases, on the other hand, are also supposed to store data. Although cuGraph is not geared toward this, there are a couple of ways cuGraph can influence, and be influenced by, graph databases.
Graph queries and graph databases
We've been seeing databases offering graph analytics frameworks, regardless of whether they are graph databases. You may have a relational database modeling and storing data in tables, for example, which comes bundled with a framework that allows querying data in a graph query language such as Cypher. So, if they can do this, could, and should, cuGraph do this too?
Rees noted that adding support for Cypher, for example, is feasible. Whether, or when, this may be done, however, is a different story. The utility of doing this would be significant: it's easier to express processing as part of a query, potentially even in an interactive environment, than it is using an API. The latter needs programming skills; an analyst can also do the former.
Equally, if not more important, however, is the other way round. Many graph databases have started offering graph algorithm implementations out of the box. As running graph algorithms is a common use case, would it not make sense for them to integrate with cuGraph to boost their performance? It totally would, and they totally could.
Although this is somewhat speculative, let us note a couple of points. First, there are graph databases rumored to either already having, or to be working on support for GPUs. Second, there are many GPU databases around. As there is a growing demand for graph processing, we may soon see one or more of those adding it to their capabilities.
This is why cuGraph may prove a decisive factor that will influence the graph database landscape: adding graph query capabilities would reinforce both cuGraph and whatever query language gets to be supported, while adding cuGraph to a database offering would work similarly, too.
In the greater scheme of things, cuGraph's bet is to make graph analysis ubiquitous. Doing this would not only mean faster analytics but potentially a stepping stone in the future of AI, which, to a large extent, goes through graph. CuGraph is something to keep an eye on.