Google’s latest language machine puts emphasis back on language

Carnegie Mellon and Google's Brain outfit have tried to undo some of the techniques of Google's BERT machine learning model for natural language processing. They propose a new approach called "XLNet." Built on top of the popular "Transformer" A.I. for language, it may be a more straightforward way to examine how language works.

Which programming language will earn you the most money? ZDNet's Steve Ranger talks programming languages and money, specifically which field can earn you the most pay and what programming jobs recent college grads should be looking into. Read more:

In modern artificial intelligence systems for understanding natural language, there's something of a tension -- between understanding the nature of language, on the one hand, and performing computer tricks that improve test-taking by manipulating data, on the other. 

executive guide

What is AI? Everything you need to know about Artificial Intelligence

A guide to artificial intelligence, from machine learning and general AI to neural networks.

Read More

Or, at least, that's the sense one gets from Google's latest contribution to natural language understanding, a new neural network unveiled on Wednesday called "XLNet." The authors, who have appointments that include Google Brain and Carnegie Mellon University, showed performance by XLNet that had a meaningful improvement over several previous approaches on standardized tests that include question answering.

The secret is a new way to set objectives for the computer program, so it understands not merely frequency of words but the likelihood that words appear in a given order in sentences.

XLNet is the latest in a long line of software descending from the seminal invention in 2017 of what's known as the "Transformer," developed by Google researcher Ashish Vaswani and colleagues. The Transformer went on to inspire OpenAI's GPT-2, and Google's "BERT," and many other language-processing models. 

Also: To Catch a Fake: Machine learning sniffs out its own machine-written propaganda

With XLNet, the authors took the Transformer and modified it and the result, as they say, "further justifies language modeling research." It does a better job than BERT, they contend, in realistically training a computer on how language actually shows up in real-world documents.

In a sense, too, they've opened up a new front in the use of the Transformer: It now has not just one but two objectives that it carries out in tandem, to asses language probabilities, but also the assembly of sentences as permutations of possible combinations of words.

The paper, "XLNet: Generalized Autoregressive Pretraining for Language Understanding," is posted on the arXiv pre-print server, and code is posted on Github. The paper was authored by Zhilin Yang, along with a team of colleagues who earlier this year introduced Google's "Transformer-XL," a more formidable version of the Transformer. They include Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le. 

Zhilin Yang is official associated with Carnegie Mellon, as are Dai, Yiming Yang, Carbonell, and Salakhutdinov, but Le is officially associated with Google Brain and Dai has a joint affiliation between the two.

Also: No, this AI can't finish your sentence

Yang and the team are correcting a shortcoming in how language has been modeled by computer programs. Programs such as GPT-2 are only looking at the first words, or tokens, in a phrase or sentence leading up to a particular character, but not what comes after it in a sentence. That's not good, some argue, for real-world tasks such as entailment. 


Google's XLNet uses two "streams" of investigation, one that looks at the probability of a given word in the distribution of words in the text, another that inspects the context of words around the original word, but is blind to the word in question seen by the first investigation.

To fix that, things such as BERT have come up with tricks that have their own pitfalls. BERT took the Transformer architecture and added a twist: it trained the Transformer with some of the words in a sentence "masked," either randomly substituted with other words or replaced with a string that literally says "MASK." That's based on what's called a "cloze" test, where you give people a sentence with blanks and force them to guess the words. BERT was trained to fill in the blanks, as a way to force it to compute many different probabilities of word combinations.

That's all well and good, write Yang & Co., but such an approach is unnatural — you don't find such masked words in real-world test data. More important, they write, "since the predicted tokens are masked in the input, BERT is not able to model the multiple ways in which words may depend on one another. 

For example, in the sentence "New York is a city," Bert can figure out that the word "New" probably would be implied by a sentence fragment such as "is a city," as would the word "York." But BERT couldn't divine whether the word "York" might be made more likely by the presence of both "New" and "is a city." BERT, in other words, doesn't know that the two target words, as they're called, "New" and "York," are linked in a co-dependent fashion.

Must read

To solve that problem, the authors strip away the masking idea, and they instead go back to a prior idea developed in 2016, based on the notion of "permutations," by Benigno Uria of Google's DeepMind. One can index the words in a phrase and then, without referring to the word itself, compute the various permutations of those words in sequence. 

The authors call this a "two stream" approach, where the Transformer both looks at probabilities of words in context in the sentence, and then a second function that looks at the ordering of the words around the target word, the context, what is known in Transformer-speak as the "query." This second function has no knowledge of the target word itself. Think of it as combining two sets of information, one about the word itself and another about the context in which it is most likely to show up. This becomes, then, a new, combined objective to be optimized by XLNet.

The result is that XLNet, the authors write, has more "signals" than does BERT. But it also improves on prior language modeling efforts such as GPT-2, they argue. It essentially bridges two worlds.   

The authors plan to extend the XLNet approach to problems in computer vision and reinforcement learning, they note.

All that's left now is for the team at Allen Institute for Artificial Intelligence to start picking apart XLNet.