AI ain't no A student: DeepMind nearly flunks high school math

Researchers at Google's DeepMind built two different kinds of state-of-the-art neural nets to see if they could be trained to answer high school math problems. The result was an E grade, and a failure to add single-digit numbers above 6.
Written by Tiernan Ray, Senior Contributing Writer

Do you know the answer to the following problem in arithmetic? 

What is the sum of 1+1+1+1+1+1+1?

If you said "seven," you're right. And you're also better at math than state-of-the-art deep learning neural networks. 

AI researchers from Google's DeepMind this week published research in which they attempted to train neural networks to solve basic problems in arithmetic, algebra and calculus. The kinds of problems on which a high school student would be typically tested.

But the neural networks didn't fare too well. In addition to incorrectly guessing six as the answer to the above question, the neural networks got just 14 of 40 questions correct on a standard test. 

That's the equivalent of an E grade for a sixteen-year-old in the British school system, the researchers note.

Basically, at this point, AI is having a hard time really learning any basic math. 

Also: Google's DeepMind asks what it means for AI to fail

The paper, "Analysing Mathematical Reasoning Abilities of Neural Models," was created as a benchmark test set upon which others can build in order to develop neural networks for math learning, similar to how ImageNet was created as an image recognition benchmark test. 

The paper, authored by David Saxton, Edward Grefenstette, Felix Hill and Pushmeet Kohli of DeepMind, is posted on the arXiv preprint server. (See also comments by reviewers on OpenReview.)

Citing noted neural net critic Gary Marcus of NYU, the authors refer to the famous "brittleness" of neural networks, and argue for investigation into why humans are better able to perform "discrete compositional reasoning about objects and entities, that 'algebraically generalize'." 

They propose a diverse set of math problems should push neural networks into acquiring such reasoning, which includes things like "Planning (for example, identifying the functions in the correct order to compose)" when a math problem has parts that may or may not be associative or distributive or commutative. 


Grading on a curve: chart of performance of different neural nets on various kinds of questions, with the best accuracy at the top, for questions about "place value," such as which number is in the "tens" place in a long number; and worst accuracy at the bottom, for "base conversion," meaning, convert a given number from base 2, say, to base 16.


"It should be harder for a model to do well across a range of problem types (including generalization, which we detail below)," they write, "without possessing at least some part of these abilities that allow for algebraic generalization." Hence, the data set.

They came up with a slew of questions — none of them involving geometry, and none of them being verbal questions — of the following sort: 

Solve -42*r + 27*c = -1167 and 130*r + 4*c = 372 for r. 

Answer: 4

The authors synthesized the data set, rather than crowd-sourcing them, because it's easy that way to get a large number of examples. They submitted the questions to the machine as sentences in "freeform," so that they were not given to the computer in any way that would make the parsing of the question easier, such as a "tree" or "graph" data form. 

The basis for the questions was "a national school mathematics curriculum (up to age 16), restricted to textual questions (thus excluding geometry questions), which gave a comprehensive range of mathematics topics that worked together as part of a learning curriculum." They enhanced that basic curriculum, they write, with questions that "offer good tests for algebraic reasoning."

To train a model, they could have given some neural net math abilities, they note, but the whole point was to have it start from nothing and build up a math ability. Hence, they went with more or less standard neural networks. 

Also: A Berkeley mash-up of AI approaches promises continuous learning

"We are interested here in evaluating general purpose models, rather than ones with their mathematics knowledge already inbuilt," they write. 

"What makes such models (which are invariably neural architectures) so ubiquitous from translation to parsing via image captioning is the lack of bias these function approximators present due to having relatively little (or no) domain-specific knowledge encoded in their design."

The authors constructed two different kinds of "state of the art" neural networks to parse, embed, and then answer these questions. One was a "long short-term memory," or LSTM, neural network, which excels in handling sequential types of data, developed by Sepp Hochreiter and Jürgen Schmidhuber in the 1990s. 

They also trained the so-called "Transformer," a more recent style of recurrent neural network developed at Google that has become increasingly popular for a variety of tasks such as embedding sequences of text for processing natural language.

And they gave the neural nets some time to think, as "it may be necessary for the models to expend several computation steps integrating information from the question.

"To allow for this, we add additional steps (with zero input) before outputting the answer."

Also: Fear not deep fakes: OpenAI's machine writes as senselessly as a chatbot speaks

The results were so-so. For example, back to the question at the start of this article, basic addition failed when the numbers got higher than the first six counting numbers. The authors write that they  "tested the models on adding 1 + 1 + · · · + 1, where 1 occurs n times.

"Both the LSTM and Transformer models gave the correct answer for n ≤ 6, but the incorrect answer of 6 for n = 7 (seemingly missing one of the 1s), and other incorrect values for n > 7."

Why is that? As is often the case with neural networks, something else seems to be going on behind the scenes, because the networks were able to do fine when adding far larger numbers together in longer sequences, such as negative 34 plus 53 plus negative 936, etc., the authors observed. 

"We do not have a good explanation for this behaviour," they write. They hypothesize the neural nets are creating "sub-sums" as they parse the questions and operate on them, and that when they fail, it's because "the input is 'camouflaged' by consisting of the same number repeated multiple times."

In general, the neural nets did best at things such as finding the "place value" in a long number, like, say, picking out the "tens" place in a number such as 9343012. They were also fine at rounding decimal numbers and sorting sequences of numbers into size order.

Must read

The hardest problems for the system were "number-theoretic questions," such as factorization, breaking down numbers or other mathematical objects into constituent parts, and telling if a number is prime or not. But humans have trouble with those as well, they note, so it's no surprise. The other problems that tended to stump the machine were combinations of "mixed arithmetic," where all four operations are taking place. There, the machine's performance "drops to around 50%" accuracy. 

Why does the computer do fine when just adding or subtracting but get flummoxed when asked to do all of them? 

"We speculate that the difference between these modules in that the former can be computed in a relatively linear/shallow/parallel manner (so that the solution method is relatively easier to discover via gradient descent)," the authors muse, "whereas there are no shortcuts to evaluating mixed arithmetic expressions with parentheses."

All in all, on the high school curriculum, a collection of real-world problems, the authors call the nets' E grade "disappointing" and "moderate" performance.

They conclude that while the Transformer neural net they build performs better than the LSTM variant, "neither of the networks are doing much "algorithmic reasoning," and "the models do not learn to do any algebraic/algorithmic manipulation of values, and are instead learning relatively shallow tricks to obtain good answers on many of the modules."

Still, there's now a data set, and that's a baseline, they hope, upon which others will join them in training more kinds of networks. The data set is easily extended, they note, which should let researchers go right up to university level math. 

Hopefully, by then, neural nets will have learned to add past six.

Previous and related coverage:

What is AI? Everything you need to know

An executive guide to artificial intelligence, from machine learning and general AI to neural networks.

What is deep learning? Everything you need to know

The lowdown on deep learning: from how it relates to the wider field of machine learning through to how to get started with it.

What is machine learning? Everything you need to know

This guide explains what machine learning is, how it is related to artificial intelligence, how it works and why it matters.

What is cloud computing? Everything you need to know about

An introduction to cloud computing right from the basics up to IaaS and PaaS, hybrid, public, and private cloud.

Related stories:

Editorial standards