Watching YouTube videos may someday let robots copy humans

AI scientists at the University of California at Berkeley trained a neural network to reconstruct the acrobatics humans perform in YouTube videos, and to then manipulate a simulated actor to perform those motions. The work has implications for training robotic systems to copy human activity.
Written by Tiernan Ray, Contributing Writer on

Copying human behavior has obvious value for making robots that can move in sophisticated ways. But mimicking humans is nowhere near possible with current robotics, which struggle to achieve even basic movements.

New research by artificial intelligence scientists at the University of California at Berkeley may point the way to more effective mimicry. Researchers were able to get a computer-simulated humanoid figure to ape human movements, from back-flips to "Gangnam Style" dance moves, just by feeding the computer YouTube video clips.

The researchers trained a "deep" neural network, using what's known as reinforcement learning, where the humanoid figure in the computer program is rewarded as its limbs increasingly approximate the motion of a human in a video. It's a bit like how Google trained its "AlphaGo" program to find optimal solutions to the strategy game Go.

Such work has been done before, but usually with more controlled types of video, never by just cramming the computer full of YouTube clips.

Also: Startup Kindred brings sliver of hope for AI in robotics

The research, available on the arXiv pre-print server, which is to appear in the Association for Computing Machinery's Transactions on Graphics journal next month, is authored by Xue Bin Peng, Angjoo Kanazawa, Pieter Abbeel, and Sergey Levine, all of UC Berkeley, along with Jitendra Malik of the University of British Columbia. Levine's lab is doing extensive work on robot training, so his involvement in the report is significant. (There's also a nice blog post about the work.)

The authors took their cue from the big-budget movie technique known as motion capture. In motion capture, a human actor is filmed from multiple angles performing various movements while wearing a suit filed with reflective "markers." The markers enable the computer to build a model of the points in space of each limb.

Motion capture has led to brilliant computer-generated characters in big-budget films. In "The Lord of the Rings" trilogy, actor Andy Serkis famously "performed" as the character Gollum by being filmed extensively wearing the motion-capture suit. (Lead animator Joe Letteri penned an excellent essay about the process in Nature magazine in 2013.)

But motion capture is expensive, obviously, requiring special suits, the camera set up, the facilities, etc.

The Berkeley scientists found something simpler: track movements without any markers by feeding AI on YouTube videos.


Peng and colleagues earlier this year offered up a system called "DeepMimic" that still relied on motion capture. Switching to raw YouTube video is therefore an ambitious departure. The only other previous study that relied on such video, from 2012, still resorted to using some "prior" information drawn from motion-capture. And that study produced simulated motion that was too "robotic."

The new research takes two stages. First, a neural net "reconstructs" what the human in the YouTube video is doing. The authors drew on work from 2014 by Google researchers in which a single image of a person could be analyzed by a convolutional neural net, or CNN, and the position of limbs could be deciphered even though body parts were only partly visible in the image.

Peng and colleagues, in the present study, added a twist: They rotated the image of each frame. By doing so, the computer got better at understanding unusual positions of the human body, such as when someone is upside-down during a back-flip. 
Each of these "poses" were assembled into a "trajectory" of limbs from one frame to the next, to reconstruct the entire movement of the human in the video.

Once a jumping-jack or a back-flip motion was reconstructed, the second stage took place: A computer-simulated humanoid "character" was made to replicate that movement. The reinforcement learning system gains "rewards" during the training phase as it more accurately copies the movements.

A video of the results, on YouTube, of course, is quite fun to watch!

The AI algorithms got so good, they could predict from a single frame of video what movement an acrobat might carry out in subsequent video frames.

Also: Google suggests all software could use a little robot AI

While all of this is a computer simulation, there are tantalizing suggestions that pertain to robotics. When reconstructing the human motions, the authors created one version of the simulation that uses not a humanoid but instead a simulated version of the "Atlas" robots made by Boston Dynamics, complete with the weight (375 pounds) and height (6 feet) of the robot.

There are limits to what the computer can copy, the authors found. In particular, their animated character didn't do an entirely credible job copying the Gangnam Style dance. It also struggled with what's called a "kip up," where a person starts out on their back and flips their legs over their head to reach a standing position. "We have yet to be able to train policies that can closely reproduce such nimble motions," the authors concede.

Beep Boop Bop: A brief history of robots, Part I

Previous and related coverage:

Robotics in business: Everything humans need to know

An executive guide to the technology and market drivers behind the $135 billion roboticsmarket.

Living Machines: A quick history of robots (Part I)

Derided as abominations or celebrated as ingenious feats of human engineering, robots have been around longer than you think.

Wanted: Robot life coach (no, really)

The unique job is a hint of what's to come as robots increasingly join us in the human world

Gorgeous robots made of high tech paper are mesmerizing

Fifty students were given this shapeshifting paper and told to let their imaginations run wild. Here's what they came up with.

Related stories:

Editorial standards