Uber’s graph expert bears the scars of billions of trips
One of Uber's experts in building knowledge graphs, Joshua Shinavier advises fellow data scientists that "real data is messy, but the fact is, if you want to build an enterprise knowledge graph, you have to deal with it."
"I really tried to imagine, if I attended this conference two years ago, what kind of twenty minute talk would have been most valuable," said Joshua Shinavier, a research scientist at ride-sharing giant Uber.
"I chose a bit of a different format, less of a technical talk," he concluded. And with that came lots of practical lessons from managing tons of data at Uber.
Speaking Wednesday morning during day two of a two-day conference on "knowledge graphs," hosted by Columbia University's School of Professional Studies, Shinavier shared insights about how to use graphing tools to manage entities and relationships for the huge data management tasks at Uber.
His talk, he decided, would be rather less technical, because although many people know about "graph query languages," the focus of his discussion was instead how there are "a lot of organizational challenges" in building a graph at a company.
The scale of data, in this case, is rather huge. Shinavier described how Uber has 200,000 individual "managed data sets," and that after passing the "ten-billion-trip" mark in rides served last year, the company is on a daily basis amassing "low-thousands of entities" that have to be included in its knowledge graph.
Shinavier put up one slide showing a glass of water, which, of course, appeared either half-full or half empty. His point in doing so was to encourage his fellow data scientists to grapple with reality. "Real data is messy," he said, "but the fact is, if you want to build an enterprise knowledge graph, you have to deal with it."
Or, put another way, "life gives you lemons, and thousands of schema, and you have to deal."
Data is messy because of things such as Uber drivers manually entering data into their phones, he noted.
Among words of wisdom to the audience, Shinavier noted that "no one really likes RDF," the database query language used to retrieve structured information, "it's a hard sell." His advice if you want to use RDF: "Either marshal all the arguments you can in favor of it, or else do it discreetly, which is what I did," he confessed, eliciting much laughter from the audience.
Another lesson was to "beware the hype cycle," because "knowledge graphs are lots of other things by another name," he said, usually put in place because, "Someone in management got the bug [for graphs], and hires a bunch of people" to go and do them.
First steps, he said, in developing a knowledge graph involve establishing "some kind of system for a shared vocabulary," he said, adding, "this is a very important one to me."
Uber made less use of off-the-shelf tools for graphs because there is a lot of dedicated infrastructure and dedicated teams at the company, both of which should be taken advantage of, he said.
Another gem of wisdom was to "fit the data model to the data," because the data can be fairly unique in a given business. For example, "Most of our data is not in the shape of a property graph -- it's in relational schemas -- we needed something that fit that," he said. "You have to deal with alerts and notifications and migrations and other stuff…."
Shinavier rattled off some technical details, such as the three-layer-cake of the knowledge graph at Uber. One level is an "OLTP graph," that makes use of the open-source Cassandra data store. Then, there is a second level, an "analytics-based graph" that uses the Hadoop file system, with Cypher and Apache Spark. And third, there are "graph embeddings," though he quickly added, "don't ask me too much about graph embeddings, it's not my area."
When he came to the slide labeled "Risk and Safety Knowledge Graph," it was intentionally left blank "to save entropy," given that, as Shinavier said, "there is such a thing as bad actors who are not stupid," meaning, people who could get ideas for mischief.
Among the ongoing challenges at Uber are the need to have solid policies to protect the privacy of user data, especially in light of the European "GDPR" privacy legislation. However, things are tricky because, "it's fairly hard to define" what constitutes data that needs to be kept private, he said. "Inference is required to know if it's user data that needs to be protected," he said.
Rounding out his talk, Shinavier touched briefly on the "most fun thing" going on at Uber, which is something called "algebraic property graphs," which draw on set theory and category theory. The effort is to form a "common data model for RPC, storage and knowledge representation" at Uber. It's aligned with a W3C effort to define "property graph schema" and also being developed with an eye to something called the "Universal Structure" of the Apache TinkerPop4 project. TinkerPop is a computing framework for graph databases.
That work is due for publication in a forthcoming paper, he said.
In the Q&A that followed Shinavier's talk, he was asked if it's better to start with infrastructure before collecting any data, or if it's best to collect the data and then build. His response suggested both ways had merits. It's best to collect the data first and then tune infrastructure to suit it, was his initial reply to the question. But then he added that it was not a bad idea to set up a solid infrastructure beforehand.
Uber is set to go public on The New York Stock Exchange this Friday.
Are you working with knowledge graphs in your business? Let me know what you think them in the comments section.
Cloud services: 24 lesser-known web services your business needs to try