In love with the graph: Neo4j spreads the obsession of a new database one app at a time
Lyft's "Amundsen" metadata system is an example of how knowledge graphs are spreading throughout companies with grass-roots projects. It's all part of winning hearts and minds, in the view of Neo4j, the San Francisco startup spreading the religion of graphs.
Most enterprise software has a contingent of zealots, people so steeped in the technology that they are convinced it is the be-all and end-all, or those who have taken so many certification exams that it's all they know. The lovers of the knowledge graph seem of a bit deeper kind of persuasion.
"I stumbled on the idea of looking at complete networks of relationships, as opposed to individual elements, and I fell in love with the idea," says Amy Hodler, who is the analytics and AI program manager for Neo4j, a 12-year-old San Francisco startup that sells a database program of the same name, in which objects to be accounted for are represented as "nodes" in a network graph, joined by "edges" representing their acquaintance.
Hodler is not merely a fan of her company's work, she's an aficionado of all things graphical, like the writing of graph scholar Albert-László Barabasi -- "I have all his books" -- and more popular names, such as James Fowler, who penned The New York Times bestseller Connected ("that's a great book.")
To love the graph is, she argues, to see something others don't. "You could know all about a crow flying but you wouldn't know a flock," says Hodler.
There's a point to such passion in a world still being evangelized. Graph databases haven't yet taken over. The relational database still vastly rules the roost. And there are all kinds of other data stores, increasingly for various kinds of unstructured data, including Hadoop and the "No-SQL" crowd.
But the crowd that built Neo4j seems to have progressed by enthusiasm, starting from insight, as well as perhaps a bit of naïveté.
"We were young and stupid enough to say let's build a database, how hard can it be," says Emil Eifrem, founder and CEO of Neo4j. He and colleagues stumbled onto the idea when he was serving as CTO, fresh out of college, for a Swedish tech startup, Windh Technologies. Something just wasn't clicking with the use of the relational database for a content management system.
"I had been programming for half my life at that point," he reflects, "and in every project, the database had been a help, an accelerator, something that took care of stuff for me, but for some reason, it was slowing us down that time around."
It became clear, he says, that there was a "mismatch" between the data and the relational data structure of Oracle and Informix. An enterprise content management system, explains Eifrem, is like a big file system on the World Wide Web, with folders within folders, and symbolic links between them, "a lot of connected data," as he puts it. The row and column structure of a relational database, with its "join" operations and the like, didn't cut it.
What he and colleagues started to build on their own, what would become the basis of a company, was a database that can "model everything," Eifrem insists, with "three simple building blocks": Nodes, a representation of an object or entity; edges, the lines connecting nodes to one another; and "key/value pairs," symbols that store and retrieve things.
They didn't know it then, but a little company called Google was already making hay with this very approach, the "PageRank" algorithm that would become the basis of the world's biggest search engine. Eifrem argues that the central insight behind PageRank, what's called the "eigenvector centrality," is a sort of kinship between Google and all the others pursuing knowledge graphs, including Neo4j.
"The fact that they use connected data, that's what we do, we take that power that created nearly a trillion dollars in market cap, and we apply that to classic enterprise cases, things such as fraud detection and recommendation engines." Eifrem argues the "big Web companies" such as Google were a kind of first wave of knowledge graph use, followed by enterprise application use with Neo4j, and a third wave that is just emerging, using the graph to assist machine learning and other artificial intelligence approaches.
Although it's still a small market, the simple, elegant paradigm of a graph that shows relationships creates new fans every time it shows up in an application. There some high-profile applications already. For example, Daniel Himmelstein, then working as a graduate student at UC San Francisco, created a database of genetic and molecular interactions, called "Hetionet," a biological information network that can be used to study possible drug combinations. Its knowledge of nodes and edges produces spectacular graphs of data such as the one below.
Among the converts are some of the most high-profile young companies, including gig economy outfit Lyft. Over three months, product manager Mark Grover and a team of four engineers and one designer were able to bring together an initial version of a metadata repository, called "Amundsen," using Neo4j.
Lyft has petabytes of data and uses numerous production data stores, such as Hive, Presto, Redshift, and PostgreSQL. The problem, as Grover describes it, is that with the rapid growth of the company, people inside couldn't always be sure as to which repository was the best source of a given piece of information. That includes both data scientists and analysts who have to make over-arching decisions about where Lyft should spend money. It also includes regional operations managers, say, for the New York City region, who have to make sure the right numbers of Lyft drivers are at the right place and time, for example.
"One key problem we discovered early on was that people didn't know where the source of truth was, something as simple as an ETA for a car -- they wouldn't know which table to use," explains Grover.
Grover and team thought about the problem. It became apparent the crux of the matter was the network of usage of the data, meaning, which users might be linked together via their use of the data. "I create a table, and then you create a table derived from it, and we have a lineage which can be used to derive trustworthiness," explains Grover.
Amundsen became a place to graph those usage stats. A "Data Builder" program crawls those production data stores every twelve hours to gather the metadata that is placed in the Neo4j database. "We are able to rank tables and data assets based on how frequently they are used and by whom, sort of like a PageRank for structured data," he says. "Google takes you to the Web site, we take you to more information about a table based on the metadata."
The software can help data scientists understand who is using a given table, when was it last populated, and "the shape of the data," meaning, the min, max, distribution, etc., "You can start to use that information as a proxy for trust."
There are several places to take it from there, says Grover. For example, currently, weights are assigned to queries of the database that are static, but there is an intention down the line to add dynamic weighting, such as assigning more weights to queries from a given team member or job title. Groups within Lyft are finding new uses for Amundsen, such as data scientists looking for data that can be incorporated as features in machine learning models, including the home-grown ML system, "LyftLearn."
Amundsen can also be used now for "downstream" applications when a data engineer wants to notify all downstream consumers that she or he is going to make a change in the type of a column in a table. They can use Amundsen to find out who uses that table and notify them accordingly. A future application could be data quality monitoring, such as comparing the distribution of data in a 30-day window to catch things like data corruption.
From a Neo4j perspective, a novel application like Amundsen becomes the tip of the spear, to show people that working with the graph has unique applications that can be pulled together quickly in a way that couldn't be done with a relational system. That can spread from shop to shop, making converts. Amundsen is open-source, and the code is now being used by companies such as financial giant ING and enterprise cloud software provider Workday. (ZDNet has written about how Lyft competitor Uber is deep into knowledge graphs.)
That doesn't necessarily produce license sales in every case, but it contributes to winning hearts and minds. Understanding and adoption of the graph is emerging at multiple points. Google's DeepMind, for example, is exploring ways in which the graph can serve as a means of inserting "structured representations" into deep learning neural networks. That may make more sophisticated AI's ability to construct inferences from a set of "building blocks."
To the Neo4j folks, this is all the steady progression of the relentless logic of the graph.
"I think it's a change of thinking," in moving to graph databases, says analytics veep Hodler. "You experience this as you start to look at graphs." She professes to having "an easier time explaining graphs to non-technologists" than would an engineer explaining, say, "third-normal form" of an RDBMS to the average person.
CEO Eifrem is even more emphatic in likening the graph to something that sounds like destiny.
"AltaVista saw in black and white, and Google saw in color," he says of the search engine battles of yore. Likewise, "there are a lot of things connected in my world that I was not able to operate on because my tools were holding me back; now I just put them in Neo4j, and I can do all that good stuff."