China's AI scientists teach a neural net to train itself

Researchers at China's Sun Yat-Sen University, with help from Chinese startup SenseTime, improved upon their own attempt to get a computer to discern human poses in images by adding a bit of self-supervised training. The work suggests continued efforts to limit the reliance on human labels and "ground truth" in AI.

More and more, AI is trying to make machines teach themselves with a minimum of human guidance. So-called self-supervision is an element that can be added to lots of machine learning tasks so that a computer learns with less human help, perhaps someday with none at all. 

Scientists at China's Sun Yat-Sen University and Hong Kong Polytechnic University use self-supervision in a new bit of research to help a computer learn the pose of a human figure in a video clip. 

Understanding what a person is doing in a picture is its own rich vein of machine learning research, useful for a whole number of things including video surveillance. But such methods rely on "annotated" data sets where labels are carefully applied to the orientation of the joints of the body.  

Also: Watching YouTube videos may someday let robots copy humans

That's a problem because larger and larger "deep" neural networks are hungry for more and more data, but there isn't always enough labeled data to feed the network.

So, the Sun Yat-Sen researchers set out to show a neural network can refine its understanding by continually comparing the guesses of multiple networks with one another, ultimately lessening the need for the "ground truth" afforded by a labeled data set. 


China's AI scientists show how their machine learning model refined its "prediction" of the 3D pose of an actor from an image by adding some self-supervision code to the last part of the neural network.

(Image: Wang et. al. 2019)

As the authors put it, the prior efforts for inferring a human pose have achieved success, but at the expense of a "time-consuming network architecture (e.g., ResNet-50) and limited scalability for all scenarios due to the insufficient 3D pose data."

The authors demonstrate success in beating other AI methods in predicting the pose of a figure across a series of benchmark tests. They also show they even beat their own results from 2017 with the addition of this new self-supervision approach.

Also: MIT ups the ante in getting one AI to teach another

The paper, 3D Human Pose Machines with Self-supervised Learningis posted on the arXiv pre-print server and is authored by Keze Wang, Liang Lin, Chenhan Jiang, Chen Qian, and Pengxu Wei. Notably, Qian is with SenseTime, the Chinese AI startup that sells software for various applications such as facial recognition, and which distributes a machine learning programming framework called "Parrots." 

In their original paper from 2017, the authors used an annotated data set, the "MPII Human Pose" data set compiled in 2014 by Mykhaylo Andriluka and colleagues at Germany's Max Planck Institute for Informatics. They used that labeled data set to extract two-dimensional human body parts from still images -- basically, stick-figure drawings of the limbs oriented in space. They then converted those 2D body-part representations into 3D representations that indicate orientation of the limbs in three-dimensional space.

In the new paper, the authors do the same "pre-training" via the MPII data set, to extract the 2D poses from the images. And just as in 2017, they use another data set, "Human3.6M," to extract the ground truth for 3D, as well. Human3.6M has 3.6million images taken in a laboratory setting of paid actors carrying out a variety of tasks, from running to walking to smoking to eating.

Also: Google suggests all software could use a little robot AI

What's new this time is that in the final part of their neural net, they throw away the 2D and 3D annotations. They instead compare the prediction their 3D model makes about what its 2D version should be to the 2D images that were produced in the first step. "After initialization, we substitute the predicted 2D poses and 3D poses for the 2D and 3D ground-truth to optimize" the model "in a self-supervised fashion." 

They "project the 3D coordinate(s)" of the 3D pose "into the image plane to obtain the projected 2D pose" and then they "minimize the dissimilarity" between this new 2D pose and the first one they had derived "as an optimization objective."

In a sense, the neural network keeps asking if its 3D model of the body is predicting accurately in three dimensions what it thought at the beginning of the process in two dimensions, learning about how 3D and 2D correspond. 

There is a lot of now-standard machine learning stuff here: A convolutional neural network, or CNN, allows the system to extract the the 2D stick figure. That approach is borrowed from an earlier piece of work by Carnegie-Mellon researchers in 2014 and a follow up they did in 2016.


A diagram of the full neural network set-up for 3D Pose Machines, including a convolutional neural network to extract 2D figure understanding, followed by long a short-term memory network to extract temporal information key to 3D understanding, followed by a final self-supervised comparison between predictions to improve the results. 

(Image: Wang et. al. 2019)

Then, a long short-term memory, or LSTM, a neural network specialized to retain a memory of sequences of events, is used to extract the continuity of the body from multiple sequential video frames to create the 3D model. That work is modeled after work done in 2014 by Alex Graves and colleagues at Google's DeepMind, which had originally been built for speech recognition. 

What's novel here is imposing self supervision to make the whole thing hold together without ground-truth labels. By taking this last step, the authors were able to lessen the need for 3D data and instead lean upon 2D images. "The imposed correction mechanism enables us to leverage the external large-scale 2D human pose data to boost 3D human pose estimation," they write.

Must read

The authors not only delivered better results on the Human3.6M database, they saw a dramatic speed-up against the established approaches. Running on a single Nvidia "GTX1080" GPU, it took their neural nets 51 milliseconds to process an image versus as much as 880 milliseconds for other approaches. They also saw a dramatic speedup versus their prior, 2017 approach. The results validate what they call a "lightweight architecture" for their neural network.

The researchers will have plenty of competition for the foreseeable future. Other approaches have taken a similarly "lightly supervised" approach to predicting poses, and even capturing human motion. For example, the robotics laboratory of professor Sergey Levine of UC Berkeley last October reported being able to train simulated robots to imitate human activities as seen in unlabeled YouTube videos. Perhaps the Chinese work and efforts such as Levine's will reach some fusion down the road. In any event, the value of self-supervised learning is clearly a main point of AI research.

Previous and related coverage:

What is AI? Everything you need to know

An executive guide to artificial intelligence, from machine learning and general AI to neural networks.

What is deep learning? Everything you need to know

The lowdown on deep learning: from how it relates to the wider field of machine learning through to how to get started with it.

What is machine learning? Everything you need to know

This guide explains what machine learning is, how it is related to artificial intelligence, how it works and why it matters.

What is cloud computing? Everything you need to know about

An introduction to cloud computing right from the basics up to IaaS and PaaS, hybrid, public, and private cloud.