Graph analytics databases have been described as embodying the next generation of data storage connected with AI, and that's what innovation is all about: a new-and-improved version of how something works.
All the data nodes in a graph database are connected. These are databases that use graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph itself, which connects information with a subject or topic far in advance so that the time spent by the querier is substantially lessened.
Graph DBs are now being used for a rapidly increasing number of use cases. AI is considered by many as the "new electricity," something we use and rely on every day; graphs enable that. We all use the PageRank algorithm for every web search, and we depend on community detection algorithms to uncover fraud and money-laundering rings; graphs enable those. Meanwhile, similarity matching algorithms identify healthcare patients who need urgent help or customers in financial services, retail, and e-commerce who are ready to buy. Graphs are ideal for handling these tasks.
Graph algorithms are the driving force behind the next generation of AI and machine learning that will power even more industries and use cases. To this end, Redwood City, Calif.-based graph analytics provider TigerGraph has enlisted as chief scientist a thought leader in this sector, Dr. Alin Deutsche, a data science professor at UC San Diego. He recognized graph analytics as valuable early on and has become one of the world's top experts in the field.
Here's a Q&A with Dr. Deutsch based on a recent interview.
Q: What can graph analytics do that no other technology can do?
A: Well, graphs are particularly good at allowing the user to think about, discover and learn about connections between data items. That's the whole point; these connections are first-class citizens. A graph is a mathematical object known for centuries, consisting of nodes and edges. The nodes represent the real-life objects we wish to model. The edge is the relationship between them; for example, when I follow you on Twitter, or we are friends on Facebook, etc. The relational standard tabular data model does not treat these connections between items as first-class citizens. They have to be inferred through very expensive computation as you run your analytics. In a graph, this connection is materialized, and you just use it to hop from one person to their friend (on a social network). And this makes it possible to perform traversals across multiple such hops, or connections, over huge scales of data at a performance that a tabular implementation cannot match.
Q: Why is graph analytics considered the wave of the future in documenting data?
A: We have always thought of data as being interlinked. We are always interested in the way data points relate to each other. It's just that we were previously limited by the tabular thinking, and we could only look so far ahead, hopping from connection to connection, but we always wanted to look at these connections. And now, finally, there is a technology that is developed precisely for finding these connections at distances between objects. And this is how you move, how your most interesting insights about the data come about when you start seeing connections that are not obvious--which means they have to be in a graph setting. (These are) nodes of the graph that are reachable by a long chain of hops between each other. And that's exactly where new machine-learning algorithms and rough analytic tasks are shining because they can exploit this complex and deep connectivity information that the graph offers.
Getting more traction in a SQL-led world
Q: How do you see graph analytics getting more traction in the database world as conventional DBs begin to max out with the surging amount of data that we're now creating and storing?
A: First of all, graph (analytics) is small but growing exponentially in terms of adoption within the industrial sectors. As a matter of fact, interesting articles are showing that this has been one of the fastest-growing database niches out there. So in terms of growth rates, it has long overtaken the standard classical relational databases. If it stays on this trajectory, it will catch up very quickly. Actually, I would not call it small at this point anymore; it used to be but not anymore.
How does it get more traction? Well, as you point out in this question, the flow of information is just exploding on us, and we have to try to gain insight from it quickly. And if first I have to take the data that is spread across various tables and put these tables together in expensive computational steps in order to extract my insight, I will not be able to do this in real-time and over the large scales that we are seeing. And that's the reason why graph technology, which treats the connection as a native first-class citizen, is perfectly positioned here to help us learn more about data within the same time unit. So it's really the throughput of how much we can learn and analyze, taking these connections into account.
The classical technology basically computes these connections again and again, and that computation is known as the join operation. You can have only so many joins until you bring the (database) engine to its knees. And the more data you throw into this process, the quicker you will bring the engine to its knees.
Q: What role will graph play in the growing trend toward more and more streaming data?
A: It will play the same role; streaming data means an even higher volume at which data hits you, so you have even less time to touch every data item in order to compute your analytics to learn your insights. So again, the scale will be extremely important. If you want your insights to be interesting enough and to understand that certain objects are connected via a non-trivial path in this graph, then you will need to exploit the advantages of graph database technology.
Q: How does graph work with analytics and machine learning assets to help solve fraud and cybercrime cases? These are key use cases for graph analytics.
A: It is one of the killer apps one typically touts everywhere. Not that this is the only one -- there are many. But what is specific to these applications is the fact that you want again to identify connections. For fraud and money-laundering schemes, for example: There is an account from which there was a transfer to another account, from which there was yet another transfer to another, and so on and so forth. At some point, you reach the destination account, so this means that you have to be able to follow this graph of accounts and transfers between them along a complicated path. Paths are deliberately made complicated and long because that's how the fraud perpetrators are trying to obfuscate how money moves.
So this is a typical example of following connections between nodes in the graph along the edges. Cybercrime detection/prevention also applies; there is a path of little steps that are done within the criminal process in order to never draw attention to any one step, but they start accumulating. You want to find this sequence, this path. That's a perfect example of finding reachable nodes in a graph; moreover, it's identifying the path through which this reachability happens. For example, find which are the third-party enablers, which are servers that have been compromised and that are somewhere on this path, and so on. One example we encountered was this: We wanted to identify in real-time whether a credit-card transaction could be fraudulent based on the knowledge that a certain terminal, maybe at the gas station, had been compromised and whether a user may have lost their data when their card data had been read. It was a question of "Is this card connected to this compromised terminal?" And "Is this card now connected to a new transaction?" and therefore, can I infer that this transaction is potentially fraudulent? Again connection, connection, connection--it comes down to that, and that's exactly what a graph model and background databases can very quickly traverse.
Key use cases for graph DBs
Q: What are some use cases involving the graph that might not be readily apparent to some potential customers?
A: Of course, there are many. The interesting thing is that unless your application is really focused on looking at the data as a spreadsheet, every other use of data is compatible and perfectly suited for graph technology. As soon as I have two spreadsheets, I can connect them to get complete use of the graph. In a conventional database, finding useful information in those two spreadsheets meant hopping between them and trying to compute and find which are the links; (in graph DBs) all these links we already have, and they're efficiently stored without the need for computation. So that means that any interesting data that has any linkage in it, that we have been traditionally for 40 years doing relational style in enterprise applications, is already perfectly suited for graph technology.
Everything that people have been doing in large-scale data manipulation is actually a perfect example of something that will benefit from graph technology. In the healthcare domain, for example, there is the 360-degree view of a patient with all the visits, all the missed appointments, all the insurance claims. These are all connected. One of the use cases we identified, for example, were those customers or patients who had missed an appointment that was important and had ramifications; they needed to be sent a postcard or a reminder of their appointment. We have seen efforts in this particular domain to identify the providers who were most efficient, the ones with the most positive outcomes at the fewest claim values for insurance. These are again interesting connections between all of these transactions that we have about who treated whom, when, how much they cost, etc.
Supply chain management applications are another example. Again, it's all about connection to one thing at the end--the raw materials need to make it all the way into a finished product, and before that, they go through many stages, and each stage is linked to the next. You have to find ways to navigate this. What happens when a particular warehouse is compromised, for example. There are floods in Europe that have impacted the supply chain; how can we find another path? And so on. Pathfinding through processes is often what graphs facilitate.
Q: How will the emergence of graph and GQL, the graph query language, help put enterprises on a path to more standardized systems? Do you foresee this coming in the future?
A: Yes, GQL has been in the works for about five years now. There are quite a few industrial players on the standard committee. This is a big, big event because it is the first new language in about 40 years that is being standardized by the ISO (International Organization for Standardization) international standard office, with its American counterpart ANSI (American National Standards Institute). Since SQL, there have been all these other data models: XML, JSON, and so on. They all have some query language attached to them, but none of them was standardized by this body, which is the supreme standard body in the industry. So the very fact that they took up the standardization process shows how important graph querying has become industrially.
Q: So, do you see this as being a few years away?
A: Actually, TigerGraph is one of the lead contributors there, so we're very much involved. And by the end of this year, we expect the so-called guidance standard, which would be a sneak preview, so the industry at large can form an idea of what this will look like. It will not be completed, but it will be crystallized enough that there wouldn't be many surprises afterward. So we're talking another couple of years until it's official.
How will GQL compete with SQL?
Q: Why do you believe GQL is a better long-term technology than SQL? SQL is well-embedded everywhere, and often it takes enterprises and developers a long time to switch.
A: This is a very, very insightful question. Let me first distinguish between the query language itself, which in this case is GQL, which will be the standard, and the underlying technology, which is the engine that runs the specific analytical tasks that are expressed in the query language. As I was saying, the data is connected; every interesting data point is connected. Every interesting application tries to specify the navigation along with these connections; GQL is designed to succinctly, and in a very user-friendly way, specify this traversal of the connection chain. Moreover, it facilitates specifying traversals where you don't know ahead of time how many steps you have to go until you find the desired data.
In a fraud use case, for example, the fraudulent destination is where the money is going, but you don't know where that is. Will it be in one transfer, two transfers, or many transfers? Graph query languages are particularly good at saying, "I don't care; go as far as you need to until you find the destination." SQL is a query language that can only tell you: "Okay, join two tables with three tables, I need four tables," but it has real difficulty explaining that you would like to perform a cascade of joins of undetermined length. And that means that it automatically puts a limit on how far you can explore the ramifications, starting from a particular point in your graph.
So that's the language part. Now, the other part is the technology underneath. One can, in principle, take a graph query language like GQL and transform it using some programming into SQL queries that can be run on an SQL standard edge. But that would mean not exploiting the structure of this problem, the knowledge that is the graph, the fact that the graphs can be stored in particularly beneficial ways for high-performance evaluation. And that's why applications will, for sure, call for applying native graph databases to certain problems.
We are not talking about replacing relational technology here. We have been talking about coexistence between relational and graph database technology for a long time, and during this coexistence, there will be a battle of where to draw the boundary. And the jury's still out. I personally view it as a situation where graph databases have started with their niche, initially looking at social network-style data. Now that we've realized that everything is a network, we start to apply it broader and broader, and we will see years of battle between the two technologies. I am not sure that one will replace the other, but neither will be able to kill the other, either.
Q: We have touched on a lot of key points. Do you think there's anything we left out of this conversation that we need to talk about?
A: You put together a very, very informative and well-thought-out list of questions. So, first of all, as I mentioned, TigerGraph is one of the lead contributors to the standard development. That's in recognition of its existing query language and ideas and technology in the arena. Let me point out that it stands out through various technological advantages that we have actually published in the flagship database conference, which is called ACM SIGMOD, under the umbrella of the Association for Computing Machinery. This special interest group in the management of data is THE database conference to follow, and in the past two years, we've published two papers there; it's a sign of how seriously the community takes our work because publication is highly competitive in this forum.
In the 2020 paper, we bring a novel view of computing aggregations over all the data we find in the graph, and we do this in a revolutionary way that turns out to be significantly better than the old SQL style. That's one paper. A second paper talks about how we can use our graph's parallel computation engine to even speed up classical SQL queries by translating them to queries over graphs; this is part of that ongoing battle between the two technologies that I was mentioning.
In addition, I would like to point out that we (TigerGraph) also have our homegrown query language, which is strictly more expressive than the upcoming GQL standard. As usual, what happens when several companies get together, there is a compromise in the standard that comes out. We have certainly influenced the process of the standard, but with some of the ideas and primitives that are specific to just our company, of course, there was resistance by other companies to have to implement that. So for that reason, we have the additional computing power and expressive power that, on one hand, allows us to easily become conformant to the standard simply by translating, with very little effort, GQL standard queries into our own homegrown language. Moreover, because TigerGraph's language is highly expressive, it is used to implement the library of algorithms that we put at the developer's disposal. Whenever the user wishes to extend this library with new algorithms or customize existing ones, all they need to do is add/modify just a few lines, as opposed to writing in a very low-level programming language where you have hundreds of lines and then tweaking something according to your needs becomes very time consuming to accomplish and to maintain.
And finally, the philosophy behind this query language that we designed is to provide a very smooth onramp for what we deem to be the highest-potential adopter community out there, which is composed of SQL developers. There are also data scientists who don't know SQL and who jumped onto the graph bandwagon, but their number is dwarfed by the number of experienced SQL programmers who want to go beyond. And here, we stand out by having designed the language around the notion of a minimum extension of SQL so that you can now express graph analytics and keep this graph philosophy. This SQL flavor has now been recognized as more important in the standard due to our efforts. So, there will be two ways to write queries according to the standard, both of them conformant; one will be SQL-inspired. And the other one will be graph-inspired. It's our contribution to the standard that this SQL-inspired flavor will be available.
We identify those elements that are common between SQL querying and graph query, and we build out from there so that it will be easy for somebody who has never queried graph data to sit down and from the very beginning issue a few simple queries to start getting an idea about the graph. The entry barrier to graph (analytics) will be much lower.
Editor's note: TigerGraph is hosting the East Coast edition of the Graph + AI Summit industry conference in New York City on Oct. 19.