Graph is a data model that has long lingered on the fringe of mainstream adoption. But that is changing, as graph lends itself well to representing many real world problems, and the technology is evolving.
It's not the IBMs and Oracles of the world that are leading in this space however. The leaders in terms of market share are Neo4j and Titan, the latter recently acquired by DataStax and is now the basis of DataStax Enterprise Graph.
Either way, none of those were able to deal with graph data at Twitter scale when Yu Xu needed that. Social graphs are a prime example of utilizing the graph model, Xu was working at Twitter till 2011, and the graph databases that were around at the time could not cope.
Xu has a Ph.D. Computer Science from UCSD, 26 patents in distributed systems & databases, led Teradata's big data initiatives, and worked on Twitter's distributed data infrastructure. So when faced with that problem, Xu saw an opportunity and went off to create a solution. Xu founded GraphSQL in 2012 and has been working with a team of 30 engineers since.
Today GraphSQL is officially entering a new stage in its development, including a new name: TigerGraph. The product is now generally available, a series A founding round of US$33 million is announced and a hosted version of TigerGraph based on Amazon EC2 is launching
Good for TigerGraph, but why should you care? So far that sounds like a typical startup coming out of stealth announcement. The thing however is that graph analytics and the platforms that support them will be increasingly important going forward, and TigerGraph is an important new entry in this space
Upsetting the status quo
The 5 years that elapsed since Xu started out with GraphSQL is a long time, but it seems a lot has been accomplished during this period. TigerGraph boasts a new parallel architecture for native graph storage and processing that puts it ahead of the competition, and has the benchmarks and the use cases to back this up.
TigerGraph positions itself as a solution for real-time graph analytics for extremely big data. For doing things such as fraud prevention, recommendations and network management for clients like Alipay, Visa, Wish and the State Grid Corporation of China.
These are massive clients with massive data, so how come they chose to go with a vendor you have probably never heard of?
One thing that may have helped in getting for example what TigerGraph says is the largest transaction graph in production in the world at Alipay, with more than 100 billion vertices, 600 billion edges and 2 billion daily real time updates, is TigerGraph's backing.
TigerGraph is backed by a mix of Chinese and US investors (Qiming VC, Baidu, Ant Financial, AME Cloud, Morado Ventures, Zod Nazem, Danhua Capital, DCVC) and has its HQ in Silicon Valley and a branch office in Shanghai. And the US$33 million TigerGraph scored in series A is the #2 in funding after Neo4j for graph startups.
According to TigerGraph's benchmark, TigerGraph runs queries from 4 to almost 500 times faster than the competition, loads data from 2 to 25 times faster, and uses about 80 percent less space to store that data.
Benchmarks are like opinions - everybody has one - but that does sound too good to be true. What's the catch? Why choose Neo4j and Titan for example, and not IBM or Oracle, since these are proprietary solutions that are maybe more approriate to compare to TigerGraph's own proprietary solution? And how does one replicate these results?
TigerGraph chose Neo4j because it's a native graph database and a market leader, and Titan because it's the most well-known distributed graph storage system, and the basis for both DataStax Enterprise Graph and IBM's JanusGraph. Oracle's graph project is single-server only, and clients did not ask for a benchmark. And TigerGraph is making both the benchmark report and an AWS image available so everyone can run in it.
Fast Tiger, Smart Tiger
The catch, says TigerGraph, is that their engine is built in a way that combines elements from Hadoop's MapReduce parallelism for processing and in-memory, disk and compression techniques for storage. The benefits are real-time data loading and deep link querying and massive scale.
TigerGraph says they can load 50 to 150 GB of data per hour per machine, traverse hundreds of million of vertices/edges per second per machine, and do real-time updates and inserts thus unifying real-time analytics with large scale offline data processing. But how?
By having a native C++ graph storage engine (GSE) work side-by-side with a graph processing engine (GPE) to handle of data and algorithms and by using parallelism and a distributed architecture. TigerGraph treats the graph as both a storage and a computational model.
A vertex or an edge in the graph can store information and be associated with a compute function. Therefore, they act as a parallel unit of storage and computation simultaneously. Vertices are active computing elements that send messages to each other and respond via edges.
There is a SQL-like graph query language (GSQL) that can be used for exploration and analysis. Xu says GSQL can do both fast queries like a few traversals starting with only a few vertices and large scale analytical queries like traversing the whole graph and finding patterns, and adds users can expect 3 to 10+ traversal hops where other solutions time out.
In terms of loading, import and export, TigerGraph can load data in CSV, JSON or RDF via an API, a high level declarative mapping or connectors to sources such as Kafka, AWS S3, Hadoop and Spark. It also compresses data up to 10 times.
TigerGraph also supports different graph partitioning algorithms enabling it to split very large graphs over a distributed architecture. This can be done either automatically, or as specified by users using application-specific partitioning strategies.
Real-time graph analytics
With these features, TigerGraph can be classified as a HTAP (hybrid transactional - analytical processing) solution. But since it positions itself primarily as a real-time analytics offering, what about visualization and analytic tools support?
TigerGraph offers a browser-based SDK called GraphStudio to enable users to create graph models, map and load data sources and build graph queries. This works interactively and visually through clicks and drag-and-drop, and results are in JSON/CSV format.
GraphStudio can be invoked via RESTful APIs and Xu says it's simple to integrate TigerGraph with other BI tools like Tableau. At this point there is no pre-integration with BI vendors, and although Xu adds they plan to provide integrations with such vendors soon, this may be one of the missing pieces of the puzzle for TigerGraph.
TigerGraph's approach took some deciphering to comprehend, and its sales and marketing may be lagging a bit. But with what looks like an innovative architecture which has been in production with clients of global magnitude and superb performance, TigerGraph makes an impressive entry.
If you are in the graph market, as a customer you are probably intrigued and as a competitor you are concerned. TigerGraph works both on-premise and in the cloud, and its model is subscription per year according to the graph data size.
There probably is a hefty price tag that goes with TigerGraph, but for the ones that can afford it, it looks like it can deliver some substantial benefits. This should raise the bar for the competition in this space, and it will be interesting to see how things develop.