Nvidia researchers make it easier to train robots to pick up stuff

With a unique combination of synthetic images, the researchers found a way to more easily and more effectively train a computer vision algorithm for robots.

Increasingly, robots are working alongside humans. Those robots, however, don't handle changes to their environment very well. If an object is out of place, it can become difficult for a robot to identify and manipulate that object.

To find and pick up an object in the real world -- even if it's misplaced -- a robot needs a computer vision algorithm that can identify the 3D position and orientation of an object in a scene -- what's known as the " 6-DoF (degrees of freedom) pose."

Researchers have been working for a while to address this challenge, but training these algorithms is still difficult. This week at the Conference on Robot Learning in Zurich, a team of Nvidia researchers is presenting a novel deep learning-based system that may offer a solution.

By training their computer vision algorithm with synthetic images, they've managed to bypass the complex, labor-intensive process of preparing photographic images for training. On top of that, by using a unique combination of synthetic images, the Nvidia team has trained an algorithm that can actually outperform a network trained on real images.

This represents the first time an algorithm trained only with synthetic data has been able to beat a network trained on real images for object pose estimation on several objects of a standard benchmark. This will make training algorithms for robots much easier.

"With synthetic data, we can generate an almost infinite amount with labels that come essentially for free," Stan Birchfield, a lead robotics researcher at Nvidia, explained to ZDNet.

"Ultimately, what we're trying to do is make it possible for a person to teach a robot a new task in a short period of time," Birchfield said. This will unlock the potential for robots to assist people in a variety of settings including factories, the home or health care facilities.

More work was needed in this space because of the nature of computer vision research. While researchers have made significant strides in this field, they typically test their algorithms against fixed data sets.

"That methodology doesn't always translate into the real world and the context of a robotics system," Birchfield said. "We're showing a system that not only demonstrates good quantitative results on a particular data set but also works in the context of robotics system."

The Nvidia team can mount a standard RGB camera to a robot and used the algorithm to enable the robot to see, pick up, and move images.

The researchers trained the network using Nvidia Tesla V100 GPUs on a DGX Station, with the cuDNN-accelerated PyTorch deep. They used a custom plugin developed by Nvidia for Unreal Engine to generate the synthetic data.

In the past, synthetic data was insufficient for training computer vision algorithms because computer-generated images simply didn't look real.

"The trend until recently, about a year or so, was to try to produce images that looked more and more realistic," Birchfield explained. "The problem that researchers found was that to make the images more realistic, they had to hire artists and had to spend lots of time crafting scenes to look exactly like the real world. That reduced the amount of variety -- you could model one particular room, but not a variety of rooms."

The more variety, the better trained the algorithm is.

Last year, researchers started sacrificing some photorealism in favor of variety with "domain randomized" sets of training images -- ones in which the parameters used to generate the images are varied. For instance, Birchfield said, "The lighting is randomized -- there are some light images, some dark images... The objects are placed in nonrealistic ways, like objects just floating in space."

The Nvidia team reached their breakthrough by using a combination of non-photorealistic domain randomized data and photorealistic data to leverage the strengths of both.

"Our hope is other researchers will find this technique useful for their research," Birchfield said.