No, this AI can’t finish your sentence

The New York Times wrote in November that Google's Bert natural language model can finish your sentences, but this week, the Allen Institute for Artificial Intelligence argues it's not so and that machines still cannot really reason.
Written by Tiernan Ray, Senior Contributing Writer

The hype around artificial intelligence has gotten so misleading that deflating it has become a subtext of some research in the field.

Take, for example, a subtly scathing report put out this week by the Allen Institute for Artificial Intelligence and the Paul Allen School of Computer Science at the University of Washington. Researcher Rowan Zellers and colleagues follow up on work last fall that showed they could stump some of the best natural language processing neural networks with a group of silly English-language phrases.

In the new work, they turn up the pressure to show state-of-the-art language models still can't reason correctly about what sentence should follow another.

They also take a swipe at the poor journalistic coverage of the discipline. Zellers and colleagues note that a New York Times article from November trumpeted that Google's "Bert" natural language neural network was able to beat their original challenge. "Finally, a Machine That Can Finish Your Sentence," ran the headline of that Times piece.

Well, apparently not. In this new report, the Zellers team show that by coming up with sentence completions that are increasingly preposterous, they can trick even poor Bert into a wrong answer.


A natural language inference task that Google's Bert and other language models consistently fail at: picking the second sentence that logically follows a first, or answering correctly a question from Wikihow.

Allen Institute for Artificial Intelligence

"Human performance is over 95%" in tests on completing sentences, they report, "while overall model performance is below 50% for every model," including Google's Bert.

"The underlying task remains unsolved," they write, meaning, the task of understanding natural language inference, the ability to infer one thing from another the way humans do.

The authors write that their work shows Bert and things like it are not learning any "robust commonsense reasoning." What they're actually doing is mastering a particular data set, something they call "rapid surface learners" -- picking up on cues such as stylistic traits.

In fact, Zellers and company go farther, theorizing a very tough road for learning such reasoning. The report proposes that if one increasingly ratchets up the difficulty of such datasets to stump every new generation of language model that Google or anyone else may propose, a kind of arms race could ensue. The potential result is that it could conceivably take 100,000 years of a graphics processing unit, or GPU, to reach "human" accuracy on tests.

Also: Why is AI reporting so bad?

"Extrapolating from an exponential fit suggests that reaching human- level performance on our dataset would require 109 GPU hours, or 100k years -- unless algorithmic improvements are made," they write.

Even the title of the new work, posted on the arXiv pre-print server, implies some impatience with the hype: "HellaSwag: Can a Machine Really Finish Your Sentence" -- note the italics!

HellaSwag is the new version of their "Swag" test from August. Swag stands for "Situations With Adversarial Generations." It is a set of sentence completion tasks that is designed to be hard for the best natural language processing technology, such as Bert.

In that original paper, the authors took videos from the Web and got human "crowd workers" to write two captions, one for a first and one for a second frame of video, frames that followed one after another.

The challenge of language models such as Bert was to pick which of several alternate proposals for the second caption was most logical as a follow-on to the first, in the form of a multiple-choice question.

To make it difficult, Zellers & Co. stuffed the human caption amongst three alternates that were generated by a neural network.


An example of answering a question that the computer reliably fumbles. The authors postulate Bert is picking up on words about technology when it chooses the wrong answer, answer d, in pink, versus the right answer, answer c. 

Allen Institute for Artificial Intelligence.

For example, if the first caption reads, "The lady demonstrates wrapping gifts using her feet," and is followed by a noun, "The lady," a correct second caption, written by humans, would be "cuts the paper with scissors." A misleading caption, generated by the computer, would be, "takes the desserts from the box and continues talking to the camera."

Zellers and company select the best misleading answers by finding the ones that are most real-seeming, a process they call "adversarial filtering." Using a group of neural networks, they keep generating captions until those neural networks can no longer tell the difference between what's a human-written caption and what's computer-generated.

With a set of sentences in hand, they challenged Bert and the other models to pick the sentence that is the most logical second caption, the human-generated one.

Also: OpenAI has an inane text bot, and I still have a writing job

They essentially generated text with a neural network to fool a neural network.

As they put it, "throwing in the best known generator (GPT) and the best known discriminator (BERT- Large), we made a dataset that is adversarial -- not just to BERT, but to all models we have access to."

There's a kind of poetic beauty in the approach, if you've ever seen the inane nonsense generated by a natural language model such as OpenAI's "GPT." (They in fact used GPT in HellaSwag to generate the misleading sentences.)

In the new paper, HellaSwag -- the new prefix stands for "Harder Endings, Longer Contexts, and Low-Shot Activities" -- Zellers and colleagues added to the original test by picking out sentence-answer examples from Wikihow, the website that answers common questions.

Must read

They find that Bert is much worse at picking out which sentences are an answer to Wikihow questions. Given a Wikihow task, such as what to do if you are driving and come to a red light, Bert and other models pick wrong answers, like "stop for no more than two seconds." In fact, Bert picks out the right answer only 45% of the time on such a test.

What's going on in all of this? Zellers and colleagues think the frustration of Bert on this new test shows just how superficial a lot of language learning is.

Bert and models such as "ELMo," developed by the Allen Institute, are "picking up on dataset-specific distributional biases."

The authors test how these language systems do when they strip away the "context," meaning the first caption, or, for Wikihow, the question. It doesn't affect Bert performance much, they find. Bert and ELMo and the rest, in other words, are not really using the first part, they're just clueing in to stylistic aspects of the second part.

"Existing deep methods often get fooled by lexical false friends," they write.

The paper concludes with a kind of call to arms for an arms race, a system of "evolving benchmarks," that will keep throwing more sophisticated wrong answers at language models to keep tripping up their ability to game the task by simply finding superficial patterns.

What's missing, though, is a human ability to "abstract awayfrom language" and instead "model world states," the authors write.

For now, then, even in a controlled setting, no, a machine cannot really finish your sentence.

Cloud services: 24 lesser-known web services your business needs to try

Editorial standards