Ten percent is a considerable margin to improve on the state of the art on anything. This is what Salesforce research has just achieved for common sense reasoning for deep learning language models.
In its paper, Explain Yourself! Leveraging Language Models for Commonsense Reasoning, presented tomorrow in the Association for Computational Linguistics (ACL) 2019 annual meeting, Salesforce researchers unveil two important contributions: CoSE, a dataset for Commonsense Explanations; and CAGE, a model for Commonsense Auto-Generated Explanation. ZDNet took the opportunity for a Q&A with two of the Salesforce Research Scientists who worked on this, Nazneen Rajani and Bryan McCann.
As a reminder, Salesforce research is focused on question answering, as a way to facilitate access to data via Einstein. We have previously seen how other Salesforce researchers investigated the use of knowledge graphs toward the same end.
Rajani and McCann's work takes a different approach, but also builds on a number of previous contributions. Common sense reasoning is an open problem for some of the world's leading researchers. For example, one of the key ingredients of building CAGE was OpenAI GPT. Dubbing this language model recently open sourced by Elon Musk's OpenAI as "too dangerous" to be released in the wild may have been overly precautionary.
Nevertheless, it is the state of the art in language models. As Rajani and McCann point out, these natural language processing networks are limited to text alone, as a poor substitute for living in the real world. So, researchers train the models by having them read a human-mind-boggling amount of text, including all of Wikipedia, thousands of books, and in other approaches, results from querying Google, too.
These models are tested using a multiple-choice test called Commonsense Question Answering (CQA), which contains questions that require common sense reasoning to answer. In typical deep learning fashion, models are trained on a few examples from CQA, then tested on a different set of questions. Compared to humans, these well-read neural networks have known to perform quite poorly on this task.
Rajani and McCann created a dataset modeled after CQA, but they also included explanations, in addition to answers to the questions. This is how they have created CoSE, a dataset for Commonsense Explanations. As Rajani said, CoSE v1.0 has 8500 examples and v1.11 has 10,962 examples including training and validation sets. For deep learning standards, this is not an awful lot of data.
Rajani and McCann acknowledge this, and growing the dataset is one of their goals for future work. McCann said they would like to extend this dataset collection process to other benchmarks in the field that contain both free-form text, structured information, and visual signal from images or video so that they can train models that explain many different domains.
Explanations were generated using crowdsourcing on Mechanical Turk. Turkers were asked to provide an answer to questions, explain the answer, and highlight the part of the question that lead them to the explanation. Let us note, that as recent research in knowledge graph quality processing using Mechanical Turk has shown, crowdsourcing is a feasible solution for such tasks.
Rajani mentioned there were some examples that needed to be re-annotated even though they had initial constraints on the quality of the explanations because they fell through the cracks. It took about three weeks to design the task and collect the data. CoSE can be used, and further enhanced, by other researchers, and it's made available on GitHub.
CoSE is an important building block, but it's only half the story. The other half is training models to take the CQA test, and to generate their own of reasoning. The CoS-E dataset is used to train a deep learning model alongside the original input question and answer choices.
Surprisingly, Rajani and McCann note, even though it does not have the CoSE dataset during the real test, the model performs much better on the real test after seeing examples of human reasoning only during training.
They speculate that the explanations in CoSE capture valuable information about the way the world works and the network learns to reason based on that information at test time. This well-read model is a pre-trained transformer neural network called BERT.
Then, Rajani and McCann trained a second neural network exclusively to generate common sense reasoning. They assume that this model starts out having read a lot of text, just as the test-taking model. They then show it the common sense questions and the different answer options. They do not show the correct answer, but train the model to generate and mimic explanations provided by humans in CoSE.
This way, the model is trained to take as input a question with different answer choices and generate an explanation. Because this process does not depend on knowing the correct answer, this network can create these commonsense auto-generate explanations (CAGE) on the real test, too. This is the part OpenAI GPT was used for.
Explanation generation was tested in two variations: Explain and then predict (reasoning) and predict and then explain (rationalization). Reasoning showed better results in the evaluation, but overall CAGE beat all other models for this task, including using CoSE only during training. Still, the state-of-the-art deep neural networks are lagging far behind human performance on this task.
McCann said that CAGE can be trained on a single GPU in just a couple hours, if you already have GPT, BERT, the CQA dataset, and CoSE. The Commonsense Reasoning Model was trained on 2GPUs to 8GPUs depending on the model size for up to 4 hours.
He noted that with CoSE and the original dataset both public, it is fairly easy for someone with access to a GPU to reproduce our CAGE model using the tools that are already available in the open-source community. He went on to add, however, that they will release their CAGE model in the future if there are requests for it.
The future: adding knowledge graphs to the mix, eradicating bias
Seeing this, we wondered whether Rajani and McCann think it would make sense to combine their results with knowledge graph work, also carried out in Salesforce, and how this relates to previous and future Salesforce work.
Rajani pointed out that the CQA dataset that we are testing on in this work was actually built out of a knowledge graph called ConceptNet. He went on to add they will be looking into how to extend this to knowledge graph population:
"We can use common sense to infer from statements like "Alex drove to work" the triplets "owns(Alex, car)" and "is_employed(Alex)". In this way, we can extract more information from the text available. As research into common sense reasoning progresses, we hope this leads to improved communication between AI and humans."
Let us note, that besides Salesforce, other researchers are already looking into using knowledge graphs for this. McCann, on his part, when asked about applicability in production environments noted that this research is in a nascent stage:
"Human explanations have rarely been incorporated into training, and when they have in the past, they often required a sacrifice in performance. We hope that because training with explanations improves performance in this work, people creating datasets in the future will be encouraged to collect human explanations as well.
We will need such datasets to train on a broader set of domains before this is applicable to production environments. We will also need to learn how to better control text generation -- in our case in particular, we have to increase the faithfulness of generated explanations to the final decisions of the model."
Concluding their work, Rajani and McCann noted the existence of (gender) bias in the data, which in turn leads to biased models. Rajani said it is currently quite hard to eradicate undesirable bias in models once they are trained:
"Mitigating bias by pre-processing the data is an option but as the size of datasets increases, this becomes not just difficult but also very inefficient. So we are currently exploring new techniques to help mitigate unwanted bias in trained models.
We are growing our datasets to other problems in NLP and vision. Once we have explanations for a variety of datasets and domains, we plan to explore the possibility of training explainable multi-task models."