"Where's my phone?" is, for many of us, a daily interjection – often followed by desperate calling and frantic sofa searches. Now some new breakthroughs made by Facebook AI researchers suggest that home robots might be able to do the hard work for us, reacting to simple commands such as "bring me my ringing phone".
Virtual assistants as we know them are utterly incapable of identifying a specific sound, and then using it as a target for where they should navigate across a space. While you could order a robot to "find my phone 25 feet southwest of you and bring it over", there is little an assistant can do if it's not told exactly where it should go.
To address this gap, Facebook's researchers built a new open-source tool called SoundSpaces, designed for so-called "embodied AI" – a field of artificial intelligence that's interested in fitting physical bodies, like robots, with software, before training the systems in real-life environments.
Instead of using static datasets, like most traditional AI methods, embodied AI favours an approach that leverages reinforcement learning, in which robots learn from their interactions with the real, physical world.
In this case, SoundSpaces lets developers train virtual embodied AI systems in 3D environments representing indoor spaces, with highly realistic acoustics that can simulate any sound source – in a two-story house or an office floor, for example.
Incorporating audio sensing in the training enables AI systems to correctly identify different sounds, but also to guess where the sound is coming from and then use what they heard as a sound-emitting target.
The algorithm is fed data taken from room acoustics modelling; for example, it can model the acoustic properties of specific surfaces, understand the way sounds move through particular room geometries, or anticipate how audio propagates through walls. On hearing a sound, therefore, the AI system can work out whether the emitting object is far or near, left or right, and then move towards the source.
Facebook's research team tasked an AI system with finding its way through a given environment to find a sound-emitting object, such as a phone ringing, without pointing the algorithm to any specific goal location. In other words, the virtual assistant is capable of 'hearing' and 'seeing', and of bridging between different sensory data to reach a goal defined by its own perceptions.
In parallel, the researchers released a new tool called SemanticMapNet, to teach virtual assistants how to explore, observe and remember an unknown space, and in this way create a 3D map of their environment that the systems can use to carry out future tasks.
"We had to teach AI to create a top-down map of a space using a first-person point of view, while also building episodic memories and spatio-semantic representations of 3D spaces so it can actually remember where things are," Kristen Grauman, research scientist at Facebook AI Research, told ZDNet. "Unlike any previous approach, we had to create novel forms of memory."
SemanticMapNet will, therefore, allow robots to inform whether they locked their front door, or how many chairs were left in the meeting room on the sixth floor.
The technology lets embodied AI systems recognise particular objects, such as a sofa or a kitchen sink, from their first-person view, before mapping them on a 3D representation of the space that is objective and allocentric, meaning that the map is independent of the robot's current location in it.
Traditional methods, on the other hand, rely on the system's first-person perception throughout the process, which results in errors and inefficiencies. Small objects, for example, are easily missed, while the size of bigger ones is frequently underestimated.
What's more, Facebook's research team also fitted their virtual embodied assistants with the ability to anticipate the layout for parts of a room that they cannot see.
Thanks to a protocol called the "occupancy anticipation approach", the AI system can effectively predict parts of the map that it's not directly observing. For example, looking into a dining room, the robot can anticipate that there is free space behind the table, or that the partially visible wall extends to a hallway that is out of view.
Using this technology, the scientists found that the robots outperformed "the best competing method" with over 30% better map accuracy for the same amount of movements carried out by the system.
The new tools developed by Facebook's AI team are accessible on AI Habitat, the company's simulation platform that's designed to train embodied AI systems in realistic 3D environments.
Grauman said that the long-term vision for the project is for embodied AI systems to use various different "senses" – like vision and hearing, for example – in order to carry out tasks in real-world settings. Ultimately, this would improve the usefulness of virtual assistants, which could perform a much wider variety of tasks.
"With this project, we are trying to move beyond today's capabilities and into scenarios like asking a home robot 'Can you go check if my laptop is on my desk? If so, bring it to me?' Or, the robot hearing a thud coming from somewhere upstairs, and going to investigate where it is and what it is," said Grauman.
To familiarise virtual assistants with the real world, the Habitat platform includes Facebook Reality Labs' dataset of photo-realistic 3D environments, Replica, which contains detailed reconstructions of various spaces. Habitat is also compatible with existing datasets like Gibson and Matterport3D.
The next step, therefore, will be to transfer the skills developed on the virtual platform to actual robots. Early experiments in transferring skills from Habitat to a physical robot have been described as "promising" by Facebook's researchers.
However, Grauman pointed out that it's hard to tell exactly when we can expect embodied virtual assistants in our homes. While research labs are hard at work setting up the right conditions for the technology, significant challenges remain.
For example, when it comes to advanced applications, robots will have to act on subjective context, personalising their responses to individual preferences. It'll take a while, therefore, before embodied virtual assistants can answer question like: "is my favourite pizza on the menu at the new spot in town?"
Still, when applied to technologies like driverless cars, embodied AI could have huge benefits. On-board systems could learn about their environment as they drive, and anticipate objects and obstacles. The technology could also be applied to search-and-rescue robots, which could hear and find people in a crisis.
And back in our homes, one thing is certain: if robots can help us locate a stubbornly ringing phone, any upgrade to existing technologies will be welcome.