An AI to make practical decisions and to play Flappy Bird
Babak Hodjat and his lean team of machine learning scientists at Cognizant believe they’ve shown a form of evolutionary computation that can transform corporate decision making. Winning at Flappy Bird is just the start.
The science of applied artificial intelligence doesn't get the same kinds of headlines as the pure research efforts of Google or Facebook or others. Mostly that's because what gets built by companies is obfuscated by those same companies, either for proprietary reasons or because the companies actually have nothing much to speak of.
Now and then, something breaks through. Last week, Babak Hodjat, who runs the machine learning operations of software giant Cognizant Technology Solutions, had something to show, so ZDNet traveled to the loft office near San Francisco's Embarcadero where Hodjat and a team of 18 staffers develop algorithms.
The ostensible event was the publication, on the arXiv pre-print server, of a paper showing how Hodjat's style of machine learning could compete with the kind made famous by DeepMind's AlphaZero.
Before digging into the paper, ZDNet accepted a challenge against the machine, a game of Flappy Bird.
The side-scroller video game, which was once a very addictive smartphone game, lives on in a version for PCs that is part of the open-source "PyGame" toolkit created by Norman Tasfi. PyGame is a laboratory used by machine learning scientists to train their programs to solve "reinforcement learning" problems. The game of Go was famously solved by DeepMind's AlphaZero through a cutting-edge reinforcement learning approach. While Flappy Bird is nowhere near as complex as Go, it's a good starting place for any AI project in reinforcement learning. (Other companies, including IBM, have explored Flappy Bird as a testbed for AI.)
The bot-controlled bird sailed through the game. It had been trained by Hodjat's evolutionary algorithm.
The journalist didn't fare as well. On the screen, ZDNet got through only three pipes before the game was over. Another apparent failure of person versus machine.
"Not bad," Hodjat graciously commended the journalist.
Hodjat and his team came to Cognizant just over a year ago, when Cognizant bought some of the intellectual property of his 11-year-old startup, Sentient Technologies, as described in a ZDNet article last year. In what's known as "evolutionary computation," Sentient tries many algorithms, including conventional artificial neural networks, in parallel, for "fitness," to select an optimal network to perform a task.
The insight in the paper published last week, titled Effective Reinforcement Learning through Evolutionary Surrogate-Assisted Prescription, is that many business problems can be articulated in a way that makes evolutionary computation applicable for the first time. The paper is something of a coming-out party for the evolutionary approach. To make it work has required Hodjat and his colleagues to refashion what reinforcement learning is.
In standard reinforcement learning, a neural net develops what's called a value function that is the target the net tries to optimize, to have the best reward. The value function, in turn, drives the formulation of a policy function that determines what moves the program should take to achieve the value function's goal. The metaphor is that of having a destination (value) and a myriad of ways to get their (policy).
In Hodjat and the team's formulation, called "Evolutionary Surrogate-Assisted Prescription," or ESP, the value function and the policy function are re-crafted as "predictor" and "prescriptor," respectively. The predictor is trained on historical data about solutions to a problem, based on a company's experience. The prescriptor is "evolved" through a process of exploration of different designs, which are tested against the predictions of the predictor to select one of perhaps a hundred possible prescriptors that is the most fit of all.
The crucial distinction with ESP, versus the kind of reinforcement learning that DeepMind does, is that the predictor is a "surrogate" that guides the prescriptor without constant data from the game. It's a closed-loop during the training phase. It's as if DeepMind's AlphaZero processed game data but didn't play any games of Go. Then, once every round, as a new best prescriptor is created, the program breaks out of the closed-loop, and tests that chosen prescriptor by subjecting it to some "real world" data from the task environment.
The surrogate approach provides some novel benefits to ESP. It allows the training of the program to be "sample efficient," in the sense that the program doesn't need as much input from the real world. The signals become "sparse," so big data is not as essential. That could conceivably be an advantage in real problems where trial and error is costly (such as perhaps self-driving vehicles).
It also allows for what Hodjat said is a "regularization" of the prescriptor, whereby the prescriptor is a more general kind of solution to a problem, therefore able to be applied, perhaps, to a wider selection of problems.
Moreover, by occasionally dipping into the real world, to test the prescriptor, the program can also change the objective for which it's striving. The real world can be a shifting landscape of different objectives that periodically re-train the ESP system. That may allow for a more nuanced form of reinforcement learning, with multiple goals versus the single objective of "winning," as in the game of Go. Such a nuanced perspective could be more realistic for business problems.
It was ESP that "learned," if you will, how to play Flappy Bird, with an input vector containing the game state forming the training set of the prescriptors. In a surprising turn, a good solution to the game was found without using neural networks. Instead, ESP was able to use an older search-based approach called "random forest." ESP did better than the kind of neural net approach perfected by DeepMind, known as "deep reinforcement learning," in terms of time required to find a solution.
"We don't know exactly why we are able to do better than the typical approach," conceded Hodjat. "We hypothesize that it has to do with the surrogate and how it regularizes the approach."
Flappy Bird is a nice demo and a nice controlled example for research. But the set of real business problems that can be addressed by ESP is large, insists Hodjat.
"Decision tasks are everywhere -- in the millions, probably," said Hodjat. The formal definition is anything that has data to provide contexts and actions within those contexts, and then some outcome that is the goal.
Not all of those business problems are of the type of reinforcement learning, says Hodjat. In fact, "Most business decision systems are not RL because we can typically associate the outcome to the decision, even if it is lagged," in contrast to reinforcement learning, where the actions that lead to a successful outcome have to be figured out.
"For instance, cost and risk of insuring a property can be directly associated to the decision (e.g., whether or not to underwrite a property, and the choice of policy and premium dollars), even if the decision took place six months ago and it is only now that we can measure the outcome (e.g., risk)," explained Hodjat.
Reinforcement learning is harder because the computer has to compute "credit assignment," finding the critical series of moves. By computing the predictor, ESP solves that by "spreading the reward/penalty peanut butter on to the time-series of decision frames, from the final outcome back through time," as Hodjat describes it.
In either case, whether it's a reinforcement learning problem, or something simpler, EPS offers the prospect of augmenting human decision making to a scale previously unavailable, Hodjat told ZDNet.
"Decision problems typically are not huge simply because humans cannot handle very large problems," he observed. "Automating decision making actually makes it possible to scale them up, i.e. tackle larger problems, or expand the features in current problems in order to make better decisions."
With proof of the approach, the challenge for Hodjat and the team is to spread the gospel of evolutionary approaches by signing up customers to use the technology.
"Usually it is relatively easy to discover new opportunities for ESP" with clients, said Hodjat. "We ask the customer to think about what their most impactful decision points are, and then whether there is historical data about such decisions in the past, and whether we can put it together into a context-action-outcome table, or can start collecting such data," he explained.
"Usually in about 30 mins we've identified a couple of such opportunities and can start scoping them out in detail." That's a process that involves both his meetings with customers and also an expanding team of engineers certified to implement the approach. Cognizant has been setting up satellite offices, "pods," as they're known, in offices in Austin, Bangalore, Amsterdam and elsewhere.
The next threshold for the San Francisco unit will be to get real-world evidence that its projects with clients are leading to success and a positive return on investment. Hodjat is confident the payoff is there, though he says a year is too little data to demonstrate that value. Sometime in the coming year, the work should begin to bear fruit.
In the meantime, he is confident the scientific structure expressed in the Flappy Bird challenge is significant.
"We are going to be the leaders in this," Hodjat said, referring to the use of machine learning for business. "In those areas where it is solving practical problems that are meaningful for companies."