No, this AI hasn’t mastered eighth-grade science

Researchers at the Allen Institute for AI have engineered a brilliant mash-up of natural language processing techniques that gets high scores on Regents exam questions for high school science, but the software is not really learning science in the sense most people would think, it’s just counting words.
Written by Tiernan Ray, Senior Contributing Writer

One of the most mindless features of modern education are standardized tests, which require pupils to regurgitate information usually committed to memory in rote fashion. Fortunately, a machine has now been made that can complete questions on a test about as well as the average student, perhaps freeing humans for more worthwhile types of learning. 

Just don't be confused that it has anything to do with learning as you typically think of it. 

Researchers at the Allen Institute for Artificial Intelligence in Seattle on Wednesday announced a new deep learning neural network program, called "Aristo" (a kind of play on Aristotle), giving it a good trumpet call with a story in The New York Times suggesting the thing can actually reason about science. What they did was make a program that can select correct answers on multiple-choice questions for the high school Regents exam for New York City, with an accuracy of 80% to 90%. 

Alas, there isn't much of any reasoning going on here, and it's not like this thing actually knows science. What has happened is that the deep learning network has calculated a good enough probability distribution of language to predict the words being used in the appropriate answer when confronted with four possible answers. They did it using a modified version of Google's popular "Bert" natural language algorithm.


Example of an "inference solver" at work, one of several techniques used in the Allen Institute's Aristo program for answering the Regents exam.

Allen Institute for AI

As explained in the formal paper, "From 'F' to 'A' on the NY Regents Science Exams: An Overview of the Aristo Project," posted on the arXiv pre-print server, deep learning is not yet at the standpoint of reasoning about science. Or much of anything else. (There's also a blog post with extensive additional materials.)

In the conclusion to the paper, written by Peter Clark, Oren Etzioni, Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld, and Michal Guerquin, the authors basically punt on the question of whether there's reasoning happening here. 

"Many of the benchmark questions intuitively appear to require reasoning to answer," they write, and then pose the question, "To what extent is Aristo reasoning to answer questions?"

Their answer pushes back the question. "Today, we do not have a sufficiently fine-grained notion of reasoning to answer this question precisely," they declared, adding, "but we can observe surprising performance on answering science questions." From that, the team concludes, "This suggests that the machine has indeed learned something about language and the world, and how to manipulate that knowledge, albeit neither symbolically nor discretely."

The word "learned" is a colloquial term but does also have a technical sense, in machine learning fields, and it is in the latter sense that the work is a stunning achievement, however you characterize it. The authors took Bert, which derives from a 2017 neural net that has taken the world by storm, called "Transformer." Bert computes a probability distribution of the co-occurrence of words in phrases. Their "AristoBERT" modifies Bert in some interesting ways. It takes a second piece of code, called an "information retrieval solver," which searches relevant documents to see if the actual words of the question and answer are in those documents (using Amazon's "ElasticSearch" toolkit.)

To Bert they add other things, such as a modified version of Bert that the Paul Allen School of Computer Science and Facebook jointly developed, and revealed earlier this year, called "RoBerta." RoBerta enhanced Bert by things such as dramatically expanding the training data set to 160 gigabytes of Wikipedia and news articles and other stuff, ten times the size of the original Bert, and by running the training for longer. 


The rapid progress on question answering took a big leap with the addition of the latest natural language processing approaches such as Google's Bert. 

Allen Institute for AI.

The AristoBert and AristoRoBerta networks are combined with the IR solver and various other solver programs into an ensemble, and voila, the result is pretty thrilling. The collection of programs got the right answer for eighth-grade science questions 91.6% of the time, and 83% of the time for 12th-grade subjects. "The momentum on this task has been remarkable," the authors observe, "with accuracy moving from roughly 60% to over 90% in just three years" versus their prior efforts. 

As elegant as this collage of programs is, it is not reasoning, per se. If it were to be compared to a human activity, it is more like memorizing billions of lines of writing and keeping associations of words or phrases in one's head and then being able to recall all that with a high degree of reliability. It's rather like a super-cramming session. The addition of Bert and Roberta, in particular, boosted the system's question-answering way above what the scientists were able to achieve with the simpler solver programs that just try to search for answers. That means having a more extensive probability model of the co-occurrence of words was the big element at work here. 

What the Aristo can't do is explain why some students really can do very little studying, certainly with far less text, and yet still figure out the answers to questions on the day of the test. Unlike humans, there's no indication Aristo could, say, write an essay exploring what was learned about science. Aristo couldn't even say why the word "energy" and the equation "e = mc2" show up in conjunction in many texts, though it could give you a measure of the "density" with which those expressions co-occur in millions of documents. 

There's no science knowledge here, then, there's a domain-specific map of language that's cutting edge. And that's certainly an accomplishment for which the authors can be proud. It may have important utility in business use of A.I. Think of all the interactions requiring completion of standard processes in front-office and back-office of a corporation that can be automated if the right vocabulary can be accurately and swiftly presented.  

More profoundly, Aristo may someday take over the entire test-taking activity, freeing the poor student of one of the most boring activities in all of education. 

Editorial standards