The data that trains AI increasingly calls into question AI

After 10 years of ImageNet, AI researchers are digging into the details of test sets and some are asking just how much knowledge has really been created with machine learning.
Written by Tiernan Ray, Senior Contributing Writer

It's been 10 years since two landmark data sets appeared in the world of machine learning, ImageNet and CIFAR10, collections of pictures that have been used to train untold numbers of models of computer vision deep learning neural networks. The venerable nature of the data has prompted some AI researchers to ask what goes on with those data sets, and what their longevity means about machine learning in the bigger picture.

As a result, 2019 may mark the year the data indicted some of the fundamental beliefs about AI.

Researchers in machine learning are getting much more specific and rigorous about understanding how the choice of data affects the success of neural networks.

And the results are somewhat unsettling. Recent work suggests at least some of the success of neural networks, including state-of-the-art deep learning models, is tied to small, idiosyncratic elements of the data used to train those networks.

Exhibit A is a study put out in February and revised in June by Benjamin Recht and colleagues at UC Berkeley, with the amusing title "Do ImageNet Classifiers Generalize to ImageNet?"

They tried to reconstruct ImageNet, in a sense, by duplicating the process of gathering images from Flickr and curating them, having people on Amazon's Mechanical Turk service look at the images and assign labels.


The original screen from back in 2009 instructing Amazon Mechanical Turk workers to pick images that fit with labels. It kicked off a decade of development of more and more advanced computer vision neural networks.

Stanford University.

The goal was to create a new "test" set of images, a set that's like the original group of pictures, but never seen before, to see how well all the models that have been developed on ImageNet in the past decade generalize to new data.

The results were mixed. The various deep learning image recognition models that followed one another in time, such as the classic "AlexNet" and, later, more-sophisticated networks such as "VGG" and "Inception," still showed improvement from generation to generation. In fact, on this new test set, levels of improvement were actually amplified.

Also: No, this AI can't finish your sentence

However, the best performing networks showed sharp declines in accuracy from the level they had achieved on the original ImageNet.

"All models see a large drop in accuracy from the original test sets to our new test sets," wrote Recht and colleagues. A drop of six percentage points, in one case, was equivalent to losing "five years of progress in a very active period of machine learning research," they wrote, meaning, the progress that happened from 2013 to 2018.


Results from Recht et al. 2019 showing numerous deep learning models developed over the years fail to maintain their levels of accuracy when trained on new data that's supposed to be similar to the original ImageNet and CIFAR test sets.

UC Berkeley

What's going on? When Recht & Co., played with the sample of images in the test set, they found that some of the accuracy improved when tested against images that were more common, in the sense that they were picked more often by the human annotators whose job on Mechanical Turk was to do the labeling. From that, the authors conclude that "current ImageNet models still have difficulty generalizing from 'easy' to 'hard' images." The neural networks, in other words, are achieving progress, but it falls short of the kinds of robust ability one would hope is resulting from all that training.

Recht and colleagues write that "This brittleness puts claims about human-level performance into context," adding, "it also shows that current classifiers still do not generalize reliably even in the benign environment of a carefully controlled reproducibility experiment."

Also: Artificial general intelligence is a Rorschach Test: Perhaps we need orangutans?

Recht's paper was followed a couple months later by a similar kind of archeological study by NYU researcher Chhavi Yadav and Facebook AI researcher Léon Bottou. Called "Cold Case: The Lost MNIST Digits," the paper aims to reconstruct missing examples from an even older data set, a collection of hand-written digits compiled back in the early 1990s. MNIST has been a mainstay of computer vision tests for decades, going back to when Facebook's Yann LeCun first developed the "convolutional neural network," the bedrock of today's machine learning models of image recognition.


The first sixteen digits of the original MNIST test of digit recognition, circa 1994.


The findings are the same in Yadav and Bottou's study, though the conclusions they draw seem more upbeat. Models fared worse on new data than they did the traditional test, but still there is progress in machine learning.

Yadav's study is actually a fascinating detective story. The original test was split between 60,000 examples for training and 60,000 for testing of the trained network, made up of hand-written digits, compiled by the US census workers and high school students. The goal was to train the convolutional neural network to correctly classify which digits are which. But the vast majority of the 60,000 test samples were discarded at the time, leaving only 10,000. So Yadav and Bottou had to reconstruct those missing 50,000 sample images. By recreating the "processing algorithms" used to make MNIST, and with a number of approaches such as tracking the "anti-aliasing" of pixels, they were able to trace each character back to its original writer. Thus was born a new test set of 50,000 characters never before tested on.

Also: To Catch a Fake: Machine learning sniffs out its own machine-written propaganda

Unlike Recht's work, Yadav and Bottou's is closer to the original test, they write. Still, a whole bunch of neural networks stumble in the same way as in Recht's study. "The QMNIST50 error rates are consistently higher," they write, referring to the new test set. And, as in the Berkeley study, the successive models of neural network showed improvement from one to another. "Hence classifier ordering remains preserved."

While the loss of accuracy might be a concern, Yadav and Bottou conclude on a hopeful note that "Although the practice of repeatedly using the same testing samples impacts the absolute performance numbers, it also delivers pairing advantages that help model selection in the long run."


Similar to Recht et. al., Yadav and Bottou's study of performance by vision systems on a new version of MNIST data shows diminished accuracy, but the progress from one system to another is maintained.

NYU, Facebook AI

If a neural net isn't fully generalizing the way it should, what is it doing? Another recent study points out how deep learning can pick and choose elements of the data that are very idiosyncratic. Researchers at the Allen Institute and the Paul Allen School of Computer Science and Engineering at the University of Washington in May unveiled "Grover," a natural language processing neural network that can detect the fake writing created by another neural net. One of the key findings of that report was that neural nets leave a trace or signature in the way they predict word combinations -- what the authors call "artifacts."

Basically, the part of an NLP system that chooses words to assemble into sentences leaves out a vast terrain of less common words, known as the "tail" of the vocabulary. By understanding the distribution of language, the patterns by which the neural net operates start to come into sharp relief -- at least, to the computer. The state of the art natural language systems can fool human readers, the Allen team found. But underneath the covers, what these programs are doing is something less than human writing. AI is not really finishing your sentence, as some would like to believe, so much as creating impressive mash-ups of words.

At the end of the day, praise or blame for what neural nets do may come back to the objectives set for neural nets by researchers. That's the sense one gets from a paper published in Nature Communications in March, written by Zhenglong Zhou and Chaz Firestone of Johns Hopkins University's Department of Psychological & Brain Sciences.

Must read

Zhou and Firestone examined what happens when convolutional neural networks are subjected to "adversarial images," such as an object being inserted into a picture and throwing off the classifier's guess about what the picture is. One thing they found is that when classifiers miss, and pick the wrong label for an image, it is in some sense a result of the fact that the computer is not being allowed to fully express what it sees when an image is perturbed by adversarial changes.

"Machine-vision systems also don't engage in 'free classification,'" they write. "They simply pick the best label in their provided vocabulary." It's the task, in other words, that shapes the neural network, perhaps constraining the intelligence of what it could otherwise express.

As the authors write in their conclusion, "Whereas humans have separate concepts for appearing like something vs. appearing to be that thing -- as when a cloud looks like a dog without looking like it is a dog, or a snakeskin shoe resembles a snake's features without appearing to be a snake, or even a rubber duck shares appearances with the real thing without being confusable for a duck -- CNNs are not permitted to make this distinction, instead being forced to play the game of picking whichever label in their repertoire best matches an image (as were the humans in our experiments)."

None of this work refutes progress in machine learning, of which the benefits are demonstrable. But it does call into question a lot of assumptions about what's going on under the covers, and what is intelligence versus what is merely superb engineering and science.

The tech that changed us: 50 years of breakthroughs

Editorial standards