Anchorage may not be the most well-connected location in the world. But as it turns out, when people and data are well-connected, location may follow. Anchorage was host to SIGKDD's Conference on Knowledge Discovery and Data Mining in 2019 or KDD as it's commonly known. The conference is organized by the Association for Computing Machinery (ACM)'s Special Interest Group on Knowledge Discovery and Data Mining (SIGKDD).
KDD is one of the most well-known and popular events for data science and AI, attracting around 3.500 researchers in 2018 in London. Although the decision to host KDD 2019 in Anchorage caused some concerns, attendance did not really drop.
The 25th incarnation of KDD was a "who is who" in data science and AI. KDD was set up by people who have been into data science and AI before they were given their current monikers and attracted widespread attention.
KDD is a meeting point for research and industry. People who show their work in KDD often go through those revolving doors, with some of them wearing both hats at the same time. Case in point, the KDD applied data science invited speakers track, featuring data scientists from the likes of Airbnb, Alibaba, Amazon, Apple, Facebook, Google, NASA, LinkedIn, and Microsoft.
The goal was to invite highly influential speakers who have directly contributed to successful data mining applications in their respective fields. Looking at the topics picked by those speakers, as well as KDD proceedings, a theme started to emerge.
One of the things that seem to be top of mind for these people is pushing the limits of deep learning. This form of machine learning has achieved great accomplishments in the last few years. Many AI researchers believe deep learning on its own will never be much more than sophisticated pattern recognition: Great for facial recognition or language translation, but short of true intelligence.
Ruslan Salakhutdinov, director of AI research at Apple and professor of Computer Science in the department of machine learning at Carnegie Mellon University (CMU), focused on this very topic in his presentation: Integrating Domain Knowledge into Deep Learning.
The presentation, based on Salakhutdinov's notes from CMU, explored ways of incorporating domain knowledge in machine learning model architectures and algorithms. Three classes of domain knowledge were considered: relational, logical, and scientific knowledge.
Logical knowledge refers to what is formally called propositional and first-order logic, or in simpler terms, rule-based reasoning: E.g., if an object has a wing and a beak, it is a bird. Scientific knowledge, such as Newton's Laws of Motion, is encoded in more complex ways, such as partial and stochastic differential equations.
Relational knowledge refers to simple relations among entities, such as (father, Bob, Alice). This type of knowledge is available via relational databases or knowledge graphs. It may be the simplest one, compared to logical and scientific knowledge, but that does not make it simple to incorporate in machine learning.
Part of Salakhutdinov's presentation focused on reading comprehension and natural language processing (NLP). The current state of the art in NLP combines techniques acting on unstructured data (text) with techniques transforming it to structured data (knowledge graphs).
Embeddings are one of those techniques, initially used for text, now also extended and adapted to graphs. The idea in embeddings is to represent a higher-order structure, which machine learning algorithms cannot process directly, to a lower-order, vector structure that can be used by machine learning.
There are many ways of doing this, but ultimately in text, as in graphs, the goal is to map similar inputs to similar vector values. Work presented in KDD by IBM Research and Huawei was meant to advance state of the art in graph embeddings.
Another invited speaker for KDD was Hongxia Yang, Senior Staff Data Scientist and Director in Alibaba Group. Yang's presentation focused on AliGraph, a Comprehensive Graph Neural Network Platform.
As noted in Alibaba's work, an increasing number of machine learning tasks require dealing with large graph datasets, which capture rich and complex relationships among potentially billions of elements. Graph Neural Networks (GNN) become an effective way to address the graph learning problem.
GNNs are neural networks that operate directly on Graphs. A typical application of GNN is node classification: Every node in a graph is associated with a label, and the goal is to predict the label of the nodes without ground-truth. To work with GNNs, data scientists first need to convert graphs to adjacency matrixes, keeping both structural and property information intact as much as possible.
However, providing efficient graph storage and computation capabilities to facilitate GNN training and enable the development of new GNN algorithms is challenging. Yang presented AliGraph, a comprehensive graph neural network system, which consists of distributed graph storage, optimized sampling operators, and runtime.
The system is currently deployed at Alibaba to support a variety of business scenarios, including product recommendation and personalized search at Alibaba's E-Commerce platform. It can efficiently support not only existing popular GNNs, but also a series of in-house developed ones for different scenarios.
Experiments on a real-world dataset with 492.90 million vertices, 6.82 billion edges, and rich attributes show AliGraph to perform an order of magnitude faster than existing work in terms of graph building: Five minutes versus hours reported from the state-of-the-art PowerGraph platform. At the training, AliGraph runs 40% to 50% faster and demonstrates around 12-times speed up with the improved runtime.
Alibaba uses graph partitioning, separate storage of attributes and caching neighbors of important vertices to overcome challenges for efficient graph access, especially in a distributed environment of clusters. This very dense work outlines future directions in pursuing GNNs with more granularity, speed and accuracy, and adding Auto-ML functionality.
Last but not least, a group of researchers from Amazon presented joint work with CMU on estimating the importance of nodes in a knowledge graph. As they note, knowledge graphs have proven valuable for many tasks, including question answering and semantic search. Estimating node importance in knowledge graphs enables several downstream applications such as item recommendation and resource allocation.
While several approaches have been developed to address this problem for general graphs, they do not fully utilize the information available in knowledge graphs, or lack flexibility needed to model complex relationships between entities and their importance. To address these limitations, Amazon researchers explored supervised machine learning algorithms.
Building upon recent advancements in GNNs, they developed GENI, a GNN-based method designed to deal with distinctive challenges involved with predicting node importance in knowledge graphs. GENI performs an aggregation of importance scores instead of aggregating node embeddings, and in their evaluation, it performs 5% to 17% better in terms of quality of results than the state of the art.
Much of the above may sound rather exotic. Exotic or not, however, their implications, when used in the real world, are rather significant. AliGraph means that Alibaba seems to currently have the most advanced infrastructure for running GNN applications. GENI means that Amazon can identify important nodes in its knowledge graph better than anyone.
Apple's ambition in integrating diverse types of knowledge into deep learning may mean they are the first to advance the unification of deep learning and symbolic AI further than anyone else. And the list does not end here -- from visionary frameworks like Apple's to more use-case oriented applications.
Snapchat is using an Action Graph to characterize and forecast user engagement. Baidu is using a knowledge graph of job skills, Skill-Graph, built for comprehensively modeling the relevant competencies that should be assessed in job interviews. Alibaba, again, generates personalized product descriptions combining neural networks and the Chinese DBpedia knowledge base.
In a nutshell, graph-related and knowledge-based research and development are booming. A quick count in KDD's proceedings is telling. More than 300 papers are a lot, and we only had a superfluous look at a few that caught our eye. But about 20% of the 300+ publications seem to involve graphs and knowledge-based systems.
There was something else that also piqued our interest: the wealth of contributions from China. Not just Chinese organizations, some of which we mentioned above, but also Chinese researchers in non-Chinese organizations. This seems to validate expert opinions: China is growing rapidly in AI, too, and is set to become No. 1 if it's not already.
One more thing we noticed, though hardly original: The interplay between research and industry. As we noted earlier, much of the work published in KDD was a joint effort involving research and industry. And more often than not, researchers either jump ship to industry or work in both research and industry. On the one hand, this strips research of its talent; on the other, it brings ethos and rigor to industry.
These trends share a common characteristic: They did not seem very likely to occur for most people, just a few years back. Who would have thought: Knowledge-based R&D made in China looks set to rule the world.