Slight of hand: OpenAI’s trick to make its Rubik’s robot hand work

A robotic hand solving Rubik’s cube has all the makings of hype headlines, but underlying OpenAI’s latest achievement is some subtler, more interesting science.
Written by Tiernan Ray, Senior Contributing Writer

The news that grabbed the artificial intelligence community, and the popular imagination, on Tuesday, was the announcement by the San Francisco-based research institute OpenAI that it had taught a robotic hand to solve Rubik's cube — and single-handedly, at that.

The single appendage -- a right hand, in this case -- was spray-painted the blue and orange colors of the OpenAI logo, and the Twitter post of the matter elicited memes from the second "Terminator" movie, where Arnold Schwarzenegger reveals the metallic skeleton of his own hand. 

Lost in the shuffle is just what is new here, if anything, and what of it may or may not be machine learning and artificial intelligence -- the science, in other words. 

In a fifty-one page paper, "Solving Rubik's Cube with a Robot Hand," the authors, Ilge Akkaya and seventeen colleagues, explain their innovation. It's not the robotic hand itself, a product called "Dexterous Hand" from the company Shadow Robot, which has already been on the market for some time now (and is available for purchase from Shadow's Web site).

Nor is the idea of simulating robotics in the computer for training of machine learning new: OpenAI has been using this approach for some time now, to explore the challenges of movement of robotics without the cost of doing so in the real world, as have other groups such as the AI lab of professor Sergey Levine at UC Berkeley.


Look, Ma, one hand! A combination of neural networks produces a system of controls that can make a robotic hand move a Rubik's cube to a solution, although only some of the time. 


And the neural networks used in this system, both the computer simulation and the real-world control of the Dexterous Hand, are not new. OpenAI has been building on work in reinforcement learning and sensor manipulation. The two components of its reinforcement learning, a so-called value network, and a policy network, are built up of many layers of fully-connected neural networks and long short-term memory networks, with various modifications to the network over the years. 

And when the Dexterous Hand manipulates the Rubik's cube, based on instructions from that reinforcement learning, the whole situation is kept on track by virtue of a convolutional neural network that takes into account multiple camera images of the state of the faces of the Rubik's cube -- something that is, again, an adaption of past research practices. 

None of that is the real breakthrough here. The real innovation in Tuesday's announcement, from a science standpoint, is the way many versions of possible worlds were created inside the computer simulation, in an automated fashion, using an algorithm called ADR. 

ADR, or "Automatic domain randomization," is a way to reset the neural network at various points based on different appearances of the Rubik's cube and different positions of the robotic hand, and all kinds of physical variables, such as friction and gravity. It's done by creating thousands of variations of the values of those variables inside the computer simulator while the neural network is being trained.

The key problem was how to make enough random changes in the environment during the training of the robot's policy neural network so that the computer retains a "memory" of lots of possible states of affairs. You could just come up with lots of changes to the environment variables, such as the position the hand is in, but such hand-coding (pun firmly intended) is tedious and time-consuming. 

Instead, ADR is an algorithm that changes the variables automatically and iteratively, as the policy network is trained to solve the Rubik's cube. The ADR, in other words, is a separate piece of code that is designed to increase random variation in training data to make things increasingly hard for the policy neural network. You could think of ADR as a kind of adversary to the neural network. Or perhaps just as a set of weights in a gym routine that get gradually heavier and heavier during strength training. 


Star of the show: the real innovation in Tuesday's announcement, from a science standpoint, is the way many versions of possible worlds were created inside the computer simulation, in an automated fashion, using an algorithm called ADR. 


As Akkaya and colleagues describe it, "The distribution over environments is sampled to obtain environments used to generate training data and evaluate model performance [...] As training progresses and model performance improves sufficiently on the initial environment, the distribution is expanded [...] every improvement in the model's performance results in an increase in randomization."

That's all well and good, but, you may ask, how does a robotic hand use this ADR randomness in the real world? 

Using ADR, the real world Dexterous Hand can adapt to changes such as when it drops the cube on the floor and the cube is placed back in the hand at a slightly different angle. The performance of the Dexterous Hand after being trained with ADR is vastly better than without it, when only a handful (sorry, again, for the pun) of random variants are thrown at it using the prior approach of manually-crafted randomness, the authors report. 

What's happening, they opine, is the emergence of a kind of "meta-learning." The neural network that has been trained is still, in a sense "learning" at the time it is tested on the real-world Rubik's cube. What that means is that the neural network is updating its model of what kinds of transitions can happen between states of affairs as events happen in the real world. The authors assert that they know this is happening "inside" the trained network because they see that after a perturbation -- say, the Dexterous Hand is hit with some object that interrupts its effort -- the robot's activity suddenly plunges, but then steadily improves, as if the whole policy network is adjusting to the changed state of affairs. 

"This was exactly what we predicted," they write. "If the policy truly learns at test time by updating its recurrent state, we would expect it to become gradually more efficient."

From all that, they conclude, "We find clear signs of emergent meta-learning. Policies trained with ADR are able to adapt at deployment time to the physical reality, which it has never seen during training, via updates to their recurrent state."

So, the star of this show is not robotics, strictly speaking, and it's certainly not the ability to manipulate objects or to solve a Rubik's cube. (In fact, the Dexterous Hand actually fails a great deal of the time, so a lot of work is left to be done.)

No, the real star of the show is a way to automate the construction of many, many simulations of the world that can then be used to force a computer model to make better predictions about situations in the event of uncertainty. It's a trick, but it's also pretty neat science. 

Editorial standards