Called 'SoundSpaces,' the first audio-visual platform for embodied AI can accomplish a task like checking to see whether you locked the front door or retrieving a cell phone that's ringing in an upstairs bedroom.
AI assistants of the future must learn to plan their route, navigate effectively, look around their physical environment, listen to what's happening around them, and build memories of the 3D space.
These smarter assistants will require new advances in embodied AI, which seeks to teach machines to understand and interact with the complexities of the physical world as people do.
Leveraging SoundSpaces, Facebook introduced a new task for embodied AI: AudioGoal.
"To our knowledge, this is the first attempt to train deep reinforcement learning agents that both see and hear to map novel environments and localize sound-emitting targets," Facebook AI said in a statement.
With this approach, the researchers achieved faster training and higher accuracy in navigation than with single modality counterparts.
'SoundSpaces' provides a new audio sensor, making it possible to insert high-fidelity, realistic simulations of any sound source in an array of real-world scanned environments.
Unlike traditional navigation systems that tackle point-goal navigation, the home robot doesn't require a pointer to the goal location.
This means an agent can now act upon "go find the ringing phone" rather than "go to the phone that is 25 feet southwest of your current position." It can discover the goal position on its own using multi-modal sensing.
To build 'SoundSpaces', Facebook used a state-of-the-art algorithm for room acoustics modeling and a "bidirectional path tracing algorithm" to model sound reflections in the room geometry.
With 'SoundSpaces', researchers can train an agent to identify and move toward a sound source even if it's behind a couch, for example, or to respond to sounds it has never heard before, said Facebook AI.