Can these new tricks fix the disaster of chatbots?
The bar is set pretty low for chatbots, which exhibit fairly tedious and even idiotic streams of thought when engaging in chit-chat with people. A raft of new papers from Facebook and academic partners provide promising new directions, though the goal of human dialogue still seems fairly far away.
The bar has been set pretty low for chatbots, those computer programs that seek to engage a person in back-and-forth dialogue. So low it shouldn't be too hard to add a bit more intelligence to this dismal technology.
Last week, Facebook's artificial intelligence researchers, with the help of academics, unveiled multiple projects to give chatbots a variety of new qualities, ranging from being less repetitive to being more on-topic, to even displaying some semblance of emotion.
The papers were prepared for the annual conference of the Association for Computational Linguistics, or ACL, which started Sunday in Florence, Italy.
Some of the papers introduce new data sets that may help future chatbot work; they all claim some improvement over state-of-the-art benchmarks. They may all help make conversations less annoying, but they also fall short of the grandest promises, which are couched in the anthropomorphic language of human "reason" and "emotion."
Results from a massive competition among chatbots in December, at the NeurIPS conference on machine learning, revealed numerous problems with chatbots even when the best engineers are writing the code. Issues include inconsistencies in the facts and logic of sentences and mindless repetition of phrases.
The most interesting aspect of this week's work is that it marks a departure in the foci of research. Whereas the December competition showed a focus on the general consistency of topics in chat, these new works dig deeper into aspects of tone, mood, and texture of speech in a chatbot.
The most immediately practical of the new papers asks a great question, up-front, "What makes a good conversation?" Authors Abigail See of Stanford University, and Stephen Roller, Douwe Kiela, and Jason Weston of Facebook dig into some of those annoying aspects such as repetitiveness and lack of relevance.
The authors contend too little has been done in this regard. "The factors that determine human judgments of overall conversation quality are almost entirely unexplored," they write. Well, amen to that!
"Existing work has ignored the importance of conversational flow," they write, "as standard models (i) repeat or contradict previous statements, (ii) fail to balance specificity with genericness, and (iii) fail to balance asking questions with other dialogue acts."
To tackle the matter, the authors return to the data set used in the December NeurIPS competition, PersonaChat, developed by Saizheng Zhang, who holds dual roles at Facebook and Montreal's MILA Institute for AI, and colleagues. PersonaChat is a corpus of human conversations in which interlocutors are given a "persona," a collection of attributes, such as being a pet-lover. The conversation is then supposed to stay within the guidelines of that predilection.
The authors take a long-short-term memory, or "LSTM," neural net, designed for so-called "sequence-to-sequence," or "seq2sec," text representation, developed at Google, and pre-train it by inputting 2.5 million Twitter conversations, and then fine-tune it with PersonaChat. That's all fairly standard; where they bring particular innovation is in controlling four aspects of conversation: repetition, specificity, what they call "response-relatedness," and question-asking. To do so, the authors annotate sentences in the training data with a "z" value that controls for those four elements and append that z value to each input of the training data. They then fine-tune with the PersonaChat database until those z values are optimized.
Using the z value, See and colleagues can tune up or down the repetition, for example, across whole conversations, such as the repetition of certain content.
To test the results, the authors had humans converse with the machine and then answer multiple-choice questions, such as how interesting was the conversation on a scale of 1 to 4. The evaluation was the same kind used in the NeurIPS December competition.
The result is that their approach did, indeed, elicit better scores from human reviewers than baseline models on aspects of "interestingness" and "making sense." On some metrics, they matched the result of the winning entry in the December competition, the "Lost in Translation" chatbot, but they were able to do it with a fraction of the training data, they report.
Their main technical observation is that "repetition is by far the biggest limiting quality factor for naive sequence-to-sequence dialogue agents."
But when it comes to what makes "a good conversation," there are some surprises. In general, good conversation "is about balance – controlling for the right level of repetition, specificity and question-asking is important for overall quality," they write. However, "While our models achieved close-to-human scores on engagingness, they failed to get close on humanness – showing that a chatbot need not be human-like to be enjoyable."
Perhaps most tantalizing of the new works is "Towards Empathetic Open-domain Conversation Models," authored by Hannah Rashkin of the Paul G. Allen School of Computer Science and Engineering at the University of Washington and Eric Michael Smith, Margaret Li, and Y-Lan Boureau of Facebook. Wouldn't it be great if, when talking topics that are emotionally charged, such as "My cat just died," a chatbot did not respond with something tone-deaf like, "Great, I love cats."
That's the challenge taken on by Rashkin and the team, trying to bring empathy, as they characterize it, to the utterances that a chatbot makes in a dialogue situation. They compiled a data set of 25,000 conversations between one speaker and another, called "EmpatheticDialogues," gathered in a crowdsourced fashion from 810 people on Amazon's Mechanical Turk service.
The first speaker was given a prompt of a specific emotion, such as "joyful" or "apprehensive." The authors chose to create this new data set because, they note, scrapes of Web conversations such as from Twitter tend not to be like a real human conversation, they have too much "curated self-presentation," borrowing a phrase from sociologist Erving Goffman.
The human dialogue was then used to fine-tune a Transformer network, the super-popular natural language processing model from Google, encoding the utterances and their context. The researchers used one version of the experiment in which the Transformer responds to a sentence as a prompt by "retrieving" an utterance from one of three data sets, the EmpatheticDialogues data set itself, a prior data set called DailyDialog, by Yanran Li and colleagues, or a dump of 1.7 billion Reddit conversations. Another experiment had the Transformer generate original sentences as replies, and yet another experiment used an enhanced version of the Transformer named BERT that has been state of the art on a lot of language processing tests.
The authors write that when Transformer or BERT were used to retrieve utterances from their EmpatheticDialogues data set, the computer's performance got higher BLUE scores, a widely used benchmark for language performance. And, the sentences were rated more highly by humans as far as empathy, based on scores gathered on Mechanical Turk when people were shown the output of the computer in the context of the dialogue.
The authors give some examples of how the various models did. Given an utterance such as "My son failed his exams! Can you believe it! I don't know what to do," the best model, a fine-tuned version of BERT that uses retrieval, responds, "Oh no! Did he not study?" The worst-scoring model replied with the bizarrely phrased, "Hopefully his mistakes can help you know what not to do on the tests."
The authors write that they plan to "investigate how to integrate empathetic responding into more general dialogue when, for example, the needs for empathy have to be balanced with staying on topic or providing information."
Moon and colleagues write that theirs is the "first parallel dialog to knowledge graph, "where each mention of a KG entity and its factual connection in an open-ended dialog is fully annotated, allowing for an in-depth study of symbolic reasoning and natural language conversations." The authors crowdsourced 15,000 dialogues containing 91,000 utterances, which are then annotated with all the things that are discussed, be they movies, books, musicians, etc. The knowledge graph has 1000,000 entities and 1.1 million facts, they write.
They then created a "novel attention-based graph decoder" that does a walk over the graph of all those objects and their relationships, to select responses. To do so, they came up with a novel "loss function" to minimize, the supervised loss for "generating the correct entity at the next turn," and the loss as defined by taking the "optimal path within a knowledge graph."
The result should be that when someone says something like "Do you like Lauren Oliver. I think her books are great!" in which the famous fiction writer is mentioned, the computer replies with an appropriate-sounding response, such as "I do, Vanishing Girls is one of my favorite books." It does so by traversing the path on the graph "written by -> wrote" to an entity, "Annabel," Oliver's 2012 novel, and then can pick out another Oliver title, "Vanishing Girls" from 2015.
The authors compared how their novel decoder did against various baselines, such as seq2seq, and a "tri-LSTM" network. They write that their model got better "top-k" scores, a common measure of relevance. Also, they had three humans evaluate each of 250 exchanges and rate which products from each model was "the most natural for given dialog context." Their novel decoder handily beat the baselines on that as well.
The big picture, for Moon & co. is that theirs is the first use of knowledge graphs for "open-ended" kinds of discussion, rather than a structured conversation around a topic or a domain. Also, unlike previous use of knowledge graphs that predict the next "edge" in a network, their decoder can "learn an optimal path within existing paths that resemble human reasoning in conversations."
Will these works make a difference? At least chatbots will be a little less dull. Data sets such as "EmpatheticDialogues" and new techniques such as the knowledge graph decoder based on attention offer some tantalizing directions forward.
Still, there are some pitfalls here, such as the contention that chatbots are gaining "empathy," a concept still very little understood by philosophers, sociologists, and the average individual. The same goes for the contention that knowledge graphs follow "paths that resemble human reasoning in conversation," a claim that really is not proven in any fashion, but merely asserted.
Such terms carry a heavyweight of anthropomorphizing, and they muddy the waters by implying sentience on the part of what is, after all, just brilliant code.
Scary smart tech: 9 real times AI has given us the creeps