In modern artificial intelligence systems for understanding natural language, there's something of a tension -- between understanding the nature of language, on the one hand, and performing computer tricks that improve test-taking by manipulating data, on the other.
Or, at least, that's the sense one gets from Google's latest contribution to natural language understanding, a new neural network unveiled on Wednesday called "XLNet." The authors, who have appointments that include Google Brain and Carnegie Mellon University, showed performance by XLNet that had a meaningful improvement over several previous approaches on standardized tests that include question answering.
The secret is a new way to set objectives for the computer program, so it understands not merely frequency of words but the likelihood that words appear in a given order in sentences.
XLNet is the latest in a long line of software descending from the seminal invention in 2017 of what's known as the "Transformer," developed by Google researcher Ashish Vaswani and colleagues. The Transformer went on to inspire OpenAI's GPT-2, and Google's "BERT," and many other language-processing models.
With XLNet, the authors took the Transformer and modified it and the result, as they say, "further justifies language modeling research." It does a better job than BERT, they contend, in realistically training a computer on how language actually shows up in real-world documents.
In a sense, too, they've opened up a new front in the use of the Transformer: It now has not just one but two objectives that it carries out in tandem, to asses language probabilities, but also the assembly of sentences as permutations of possible combinations of words.
The paper, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," is posted on the arXiv pre-print server, and code is posted on Github. The paper was authored by Zhilin Yang, along with a team of colleagues who earlier this year introduced Google's "Transformer-XL," a more formidable version of the Transformer. They include Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le.
Zhilin Yang is official associated with Carnegie Mellon, as are Dai, Yiming Yang, Carbonell, and Salakhutdinov, but Le is officially associated with Google Brain and Dai has a joint affiliation between the two.
Yang and the team are correcting a shortcoming in how language has been modeled by computer programs. Programs such as GPT-2 are only looking at the first words, or tokens, in a phrase or sentence leading up to a particular character, but not what comes after it in a sentence. That's not good, some argue, for real-world tasks such as entailment.
To fix that, things such as BERT have come up with tricks that have their own pitfalls. BERT took the Transformer architecture and added a twist: it trained the Transformer with some of the words in a sentence "masked," either randomly substituted with other words or replaced with a string that literally says "MASK." That's based on what's called a "cloze" test, where you give people a sentence with blanks and force them to guess the words. BERT was trained to fill in the blanks, as a way to force it to compute many different probabilities of word combinations.
That's all well and good, write Yang & Co., but such an approach is unnatural — you don't find such masked words in real-world test data. More important, they write, "since the predicted tokens are masked in the input, BERT is not able to model the multiple ways in which words may depend on one another.
For example, in the sentence "New York is a city," Bert can figure out that the word "New" probably would be implied by a sentence fragment such as "is a city," as would the word "York." But BERT couldn't divine whether the word "York" might be made more likely by the presence of both "New" and "is a city." BERT, in other words, doesn't know that the two target words, as they're called, "New" and "York," are linked in a co-dependent fashion.
- What is AI? Everything you need to know
- What is deep learning? Everything you need to know
- What is machine learning? Everything you need to know
- What is cloud computing? Everything you need to know
To solve that problem, the authors strip away the masking idea, and they instead go back to a prior idea developed in 2016, based on the notion of "permutations," by Benigno Uria of Google's DeepMind. One can index the words in a phrase and then, without referring to the word itself, compute the various permutations of those words in sequence.
The authors call this a "two stream" approach, where the Transformer both looks at probabilities of words in context in the sentence, and then a second function that looks at the ordering of the words around the target word, the context, what is known in Transformer-speak as the "query." This second function has no knowledge of the target word itself. Think of it as combining two sets of information, one about the word itself and another about the context in which it is most likely to show up. This becomes, then, a new, combined objective to be optimized by XLNet.
The result is that XLNet, the authors write, has more "signals" than does BERT. But it also improves on prior language modeling efforts such as GPT-2, they argue. It essentially bridges two worlds.
The authors plan to extend the XLNet approach to problems in computer vision and reinforcement learning, they note.
All that's left now is for the team at Allen Institute for Artificial Intelligence to start picking apart XLNet.