As artificial intelligence advances, we look to a future with more robots and automations than ever before. They already surround us -- the robot vacuum that can expertly navigate your home, a robot pet companion to entertain your furry friends, and robot lawnmowers to take over weekend chores. We appear to be inching towards living out The Jetsons in real life. But as smart as they appear, these robots have their limitations.
Google DeepMind unveiled RT-2, the first vision-language-action (VLA) model for robot control, which effectively takes the robotics game several levels up. The system was trained on text data and images from the internet, much like the large language models behind AI chatbots like ChatGPT and Bing are trained.
Our robots at home can operate simple tasks they are programmed to perform. Vacuum the floors, for example, and if the left-side sensor detects a wall, try to go around it. But traditional robotic control systems aren't programmed to handle new situations and unexpected changes -- often, they can't perform more than one task at a time.
RT-2 is designed to adapt to new situations over time, learn from multiple data sources like the web and robotics data to understand both language and visual input, and perform tasks it has never encountered nor been trained to perform.
A traditional robot can be trained to pick up a ball and stumble when picking up a cube. RT-2's flexible approach enables a robot to train on picking up a ball and can figure out how to adjust its extremities to pick up a cube or another toy it's never seen before.
Instead of the time-consuming, real-world training on billions of data points that traditional robots require, where they have to physically recognize an object and learn how to pick it up, RT-2 is trained on a large amount of data and can transfer that knowledge into action, performing tasks it's never experienced before.
"RT-2's ability to transfer information to actions shows promise for robots to more rapidly adapt to novel situations and environments," said Vincent Vanhoucke, Google DeepMind's head of robotics. "In testing RT-2 models in more than 6,000 robotic trials, the team found that RT-2 functioned as well as our previous model, RT-1, on tasks in its training data, or 'seen' tasks. And it almost doubled its performance on novel, unseen scenarios to 62% from RT-1's 32%."
The DeepMind team adapted two existing models, Pathways Language and Image Model (PaLI-X) and Pathways Language Model Embodied (PaLM-E), to train RT-2. PaLI-X helps the model process visual data, trained on massive amounts of images and visual information with other corresponding descriptions and labels online. With PaLI-X, RT-2 can recognize different objects, understand its surrounding scenes for context, and relate visual data to semantic descriptions.
PaLM-E helps RT-2 interpret language, so it can easily understand instructions and relate them to what is around it and what it's currently doing.
As the DeepMind team adapted these two models to work as the backbone for RT-2, it created the new VLA model, enabling a robot to understand language and visual data and subsequently generate the appropriate actions it needs.
RT-2 is not a robot in itself -- it's a model that can control robots more efficiently than ever before. An RT-2-enabled robot can perform tasks ranging in degrees of complexity using visual and language data, like organizing files alphabetically by reading the labels on the documents and sorting them, then putting them away in the correct places.
It could also handle complex tasks. For instance, if you said, "I need to mail this package, but I'm out of stamps," RT-2 could identify what needs to be done first, like finding a Post Office or merchant that sells stamps nearby, take the package, and handle the logistics from there.