Google uses AI language models to improve home helper robots

Large language models can help robots identify the skills they need to complete a certain task.
Written by Liam Tung, Contributing Writer
Kitchen robot
Image: onurdongel / GettyImages

Researchers at Everyday Robots are tapping large-scale language models to help robots avoid misconstruing human communications in ways that might trigger inappropriate or even dangerous actions.

Google Research and Alphabet-owned Everyday Robots integrate what they call 'SayCan' (language models with real-world grounding in pre-trained skills) and its largest language model -- PaLM, or Pathways Language Model.

This combination, called PaLM-SayCan, shows a path forward for simplifying human-to-robot communications and improving robotic task performance.

"PaLM can help the robotic system process more complex, open-ended prompts and respond to them in ways that are reasonable and sensible," explains Vincent Vanhoucke, distinguished scientist and head of robotics at Google Research.

While large language models like OpenAI's GPT-3 can simulate how humans use language and assist programmers through auto code complete suggestions like GitHub's Copilot, these don't crossover into the physical world that robots may one day operate in within a home setting.

On the robotics side, robots used in factories today are rigidly programmed. Google's research shows how humans could one day use natural language to ask a robot a question that requires the robot to understand the context of the question, and then carry out a reasonable action in a given setting.

For example, today, prompting GPT-3 with "I spilled my drink, can you help?", receives the response: "You could try using a vacuum cleaner." That's possibly a dangerous action. Google's conversational or dialogue-based AI, LaMDA, gives the response: "Do you want me to find a cleaner?", while another model, FLAN, says: "I'm sorry, I didn't mean to spill it." 

The team at Google Research and Everyday Robots tested the PALM-SayCan approach with a robot in a kitchen environment.

Their approach involved 'grounding' PaLM in the context of a robot taking high-level instructions from a human where the robot needs to figure out what's a useful action and what it's capable of in that environment.

Now, when a Google researcher says "I spilled my drink, can you help?", the robot returns with a sponge and even tries to place the empty can in the right recycling bin. Further training could involve adding a skill to wipe up the spill.

Vanhoucke explains how grounding the language model works in PaLM-SayCan.

"PaLM suggests possible approaches to the task based on language understanding, and the robot models do the same based on the feasible skill set. The combined system then cross-references the two to help identify more helpful and achievable approaches for the robot."

Besides making it easier for people to communicate with robots, this approach also improves the robot's performance and ability to plan and execute tasks. 

In their paper 'Do As I Can, Not As I Say', Google researchers explain how they structure the robot's planning capabilities to identify one of its 'skills' based on a high-level instruction from a human, and then assess how likely each possible skill is for fulfilling the instruction.

"Practically, we structure the planning as a dialog between a user and a robot, in which a user provides the high level-instruction, e.g. 'How would you bring me a coke can?' and the language model responds with an explicit sequence e.g. 'I would: 1. Find a coke can, 2. Pick up the coke can, 3. Bring it to you, 4. Done'."

"In summary, given a high-level instruction, SayCan combines probabilities from a language model (representing the probability that a skill is useful for the instruction) with the probabilities from a value function (representing the probability of successfully executing said skill) to select the skill to perform. This emits a skill that is both possible and useful. The process is repeated by appending the selected skill to robot response and querying the models again, until the output step is to terminate."

Editorial standards