Watching YouTube videos may someday let robots copy humans
AI scientists at the University of California at Berkeley trained a neural network to reconstruct the acrobatics humans perform in YouTube videos, and to then manipulate a simulated actor to perform those motions. The work has implications for training robotic systems to copy human activity.
Copying human behavior has obvious value for making robots that can move in sophisticated ways. But mimicking humans is nowhere near possible with current robotics, which struggle to achieve even basic movements.
New research by artificial intelligence scientists at the University of California at Berkeley may point the way to more effective mimicry. Researchers were able to get a computer-simulated humanoid figure to ape human movements, from back-flips to "Gangnam Style" dance moves, just by feeding the computer YouTube video clips.
The researchers trained a "deep" neural network, using what's known as reinforcement learning, where the humanoid figure in the computer program is rewarded as its limbs increasingly approximate the motion of a human in a video. It's a bit like how Google trained its "AlphaGo" program to find optimal solutions to the strategy game Go.
Such work has been done before, but usually with more controlled types of video, never by just cramming the computer full of YouTube clips.
The authors took their cue from the big-budget movie technique known as motion capture. In motion capture, a human actor is filmed from multiple angles performing various movements while wearing a suit filed with reflective "markers." The markers enable the computer to build a model of the points in space of each limb.
Motion capture has led to brilliant computer-generated characters in big-budget films. In "The Lord of the Rings" trilogy, actor Andy Serkis famously "performed" as the character Gollum by being filmed extensively wearing the motion-capture suit. (Lead animator Joe Letteri penned an excellent essay about the process in Nature magazine in 2013.)
But motion capture is expensive, obviously, requiring special suits, the camera set up, the facilities, etc.
The Berkeley scientists found something simpler: track movements without any markers by feeding AI on YouTube videos.
Peng and colleagues earlier this year offered up a system called "DeepMimic" that still relied on motion capture. Switching to raw YouTube video is therefore an ambitious departure. The only other previous study that relied on such video, from 2012, still resorted to using some "prior" information drawn from motion-capture. And that study produced simulated motion that was too "robotic."
The new research takes two stages. First, a neural net "reconstructs" what the human in the YouTube video is doing. The authors drew on work from 2014 by Google researchers in which a single image of a person could be analyzed by a convolutional neural net, or CNN, and the position of limbs could be deciphered even though body parts were only partly visible in the image.
Peng and colleagues, in the present study, added a twist: They rotated the image of each frame. By doing so, the computer got better at understanding unusual positions of the human body, such as when someone is upside-down during a back-flip. Each of these "poses" were assembled into a "trajectory" of limbs from one frame to the next, to reconstruct the entire movement of the human in the video.
Once a jumping-jack or a back-flip motion was reconstructed, the second stage took place: A computer-simulated humanoid "character" was made to replicate that movement. The reinforcement learning system gains "rewards" during the training phase as it more accurately copies the movements.
While all of this is a computer simulation, there are tantalizing suggestions that pertain to robotics. When reconstructing the human motions, the authors created one version of the simulation that uses not a humanoid but instead a simulated version of the "Atlas" robots made by Boston Dynamics, complete with the weight (375 pounds) and height (6 feet) of the robot.
There are limits to what the computer can copy, the authors found. In particular, their animated character didn't do an entirely credible job copying the Gangnam Style dance. It also struggled with what's called a "kip up," where a person starts out on their back and flips their legs over their head to reach a standing position. "We have yet to be able to train policies that can closely reproduce such nimble motions," the authors concede.