IBM’s AI flies back and forth through time in Flappy Bird
IBM proposes a mash-up of deep learning approaches to make possible continues learning in the game Flappy Bird. A key insight is optimizing the functions of neural networks both forward and backward in time.
The smartphone video game Flappy Bird was removed from smartphones in 2014 by its creator, Dong Nguyen, because it was too addictive. But the program lives on as an inspiration to deep learning researchers.
Specifically, International Business Machines scientists this week unveiled research into how machines can continually learn tasks, including playing Flappy Bird, improving over time rather than learning one level of play and stopping at that.
Known as lifelong learning, or continuous learning, the area has been studied for decades but remains a formidable research challenge.
Aside from offering an important new tool for AI, the work is something of a meditation on what it means for learning to take place both forward and backward in time.
Flappy Bird was one of their chief tests. In that game, you have to fly the little animated bird safely through a collection of pillars. The IBM researchers defined each change in the aspect of the game, such as the height of the pillars, as a novel task. Neural networks then have to extrapolate from one task to the next by maximizing what has already been learned in prior tasks.
Called Meta-experience replay, or MER, the work is a bit of a mash-up between a couple of prior approaches in the literature of neural networks.
The work, Learning To Learn Without Forgetting By Maximizing Transfer And Minimizing Interference, was written by a group from IBM, MIT, and Stanford University, consisting of Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro, and is posted on the arXiv pre-print server. The paper is being presented at the International Conference on Learning Representations, happening in May.
The problem that occurs in continuous learning has been studied for decades. It was formulated by researchers Gail Carpenter and Stephen Grossberg in 1987. It's called the stability-plasticity dilemma. An artificial intelligence system, they wrote, needs to be "capable of plasticity in order to learn about significant new events, yet it must also remain stable in response to irrelevant or often repeated events."
In the words, according to Riemer and his team, the weights of a deep learning network must be developed in a way that preserves and extends what's optimized at each point in time. The goal is to minimize interference, the disruption of what's been learned, and at the same time maximize future learning by allowing weights to change based on new information.
To do it, the authors mixed together two strains of weight optimization: One called experience replay, and one called Reptile.
In the first case, they build on code developed by Facebook researchers David Lopez-Paz and Marc'Aurelio Ranzato in 2017, called Gradient Episodic Memory for Continual Learning, or GEM. GEM uses various techniques to prevent the erasure of past weights and ensure stability.
Reptile, on the other hand, developed last year by Alex Nichol, Joshua Achiam and John Schulman of OpenAI, focuses on how to carry forward learning on past tasks to help the learning of new tasks as they are encountered, a form of transfer learning.
The challenge of plasticity-stability is to reconcile past and present weight selections. The key is that the gradient, the optimization procedure for every sample of data, should be additive. It should always lead to better weight selections at any point in time, not ones that detract from what's been developed nor that hold back weight improvement down the line.
The authors decided that GEM and Reptile are limited in the sense that they are only concerned with one direction of time.
GEM wants to preserve the past by protecting past weights, and Reptile wants to change weights only at the moment new examples are learned. What's needed instead, argue Riemer and colleagues, is a notion of symmetry, where the value of weights is improved to an extent in both directions of time.
"In our work we try to learn a generalizable theory about weight sharing that can learn to influence the distribution of gradients not just in the past and present, but in the future as well."
It's a matter of "aligning" the gradients "and thus weight sharing," they write, "across examples arises [sic] both backward and forward in time."
"We would like to influence gradient angles from all tasks at all points in time," rather than for a single point in time, they write.
To find a kind of ideal gradient descent, they "interleave" examples from the past with each new example of data, taken one at a time, and use an objective function that optimizes the gradient over current and past examples.
The authors tested their approach on two different neural network benchmark tests. One is a version of the traditional "MNIST" data set of handwritten digits, developed by the National Institute of Standards and Technology. The goal is to identify labeled examples of digits written in a variety of forms and through permutations such as rotation.
The second test is the flappy bird test, using a reinforcement learning approach, based on an existing kind of neural network known as a Deep Q Network, or DQN.
In both cases, the authors cite superior accuracy scores in relation to benchmarks, especially compared to Lopez-Paz and Ranzato's GEM.
The DQN equipped with MER, they write, "becomes a Platinum player on the first task when it is learning the third task" in Flappy Bird.
"DQN-MER exhibits the kind of learning patterns expected from humans for these games, while a standard DQN struggles to generalize as the game changes and to retain knowledge over time," they write.
On top of moving backward and forward across gradients, from past to future, there are a couple of noteworthy items in this work.
For one thing, the neural nets deal with the fact the successive tasks are different distributions of data, what's known as "non-stationarity." That poses a challenge for the networks to generalize. Unlike in some other settings, the neural networks constructed in this case have no explicit signal that each new task is, in fact, new. The rules of the game change and the network simply adapts.
What's more, rather than being processed in batches, as is common in most neural networks, each new example from data is a single example, processed one at a time. That has important implications for being able to learn from sparse signals in the data.
Two important questions remain for the work. One is whether the diversity of tasks in something like Flappy Bird is challenging enough. IBM's Riemer responded in an email to ZDNet that the work will take on more diverse sets of tasks over time.
"We are excited to try it out on more expansive and diverse collections of tasks in the future," says Riemer.
At the same time, he argues the subtlety of tasks here is valuable. "Considering subtle non-stationarities in environment conditions can be interesting and revealing as well," he says. "When non-stationarities in the environment are very severe, it can make it easy for models to detect them. As a result, noticing more subtle changes can sometimes reflect a more refined ability to adapt to changing environment conditions."
Second, the task of Flappy Bird is a "toy" problem, rather than a real-world challenge. Riemer says the team aims to broaden its work to encompass deeper challenges in future. They have "recently been exploring environments that are even more non-stationary both in terms of containing a large amount of more diverse 'tasks' and in terms of having fewer examples per 'task'."
There's a lot to be learned from simple problems, says Riemer. At the same time, "the interest of our team at IBM is certainly to test the limits of these capabilities and build AI solutions that can eventually be used to solve real business problems for our customers."
Photos: From the first PCs to the ThinkPad – classic IBM machines