Researchers at the Robotics and Embodied AI Lab at Stanford University set out to change that. They first built an audio data collection system, consisting of a GoPro camera and a handgrip with a microphone designed to filter out background noise. Human demonstrators used the gripper for various household tasks and then used that data to teach the robotic arms how to perform the task on their own. The team’s new training algorithms help robots pick up cues from audio signals to perform more efficiently.
“So far, robots have been trained on videos that are silent,” says Zeyi Liu, a Stanford Ph.D. and lead author of the study. “But there’s so much useful data in the audio.”
To test how much more successful a robot can be if it is able to “listen”, the researchers chose four tasks: turning a bun in a pan, erasing a blackboard, putting on two strips of Velcro and pouring dice from a cup . In each task, the sounds provide cues that cameras or touch sensors struggle with, such as knowing whether the eraser is making proper contact with the board or whether the cup contains dice.
After performing each task a few hundred times, the team compared the success rates of sound training and vision-only training. The results, published in a paper on arXiv which has not been peer-reviewed, was very promising. When using vision alone in the dice test, the robot could tell 27% of the time if there were dice in the cup, but this increased to 94% when sound was included.
It’s not the first time sound has been used to train robots, says Shuran Song, head of the lab that produced the study, but it’s a big step toward doing this at scale: “We’re making it easier to use sound collected ‘in the wild’ nature’, instead of being limited to collecting it in the lab, which is more time-consuming.”
The research signals that audio may become a more sought-after data source in the race to train AI robots. Researchers are teaching robots faster than ever before using imitation, showing them hundreds of examples of tasks being performed instead of hand-coding each one. If sound could be collected at scale using devices like the one in the study, it could give them a whole new “sense,” helping them adapt more quickly to environments where visibility is limited or unhelpful.
“It’s safe to say that sound is the most understudied detection method [in robots],” says Dmitri Berenson, an associate professor of robotics at the University of Michigan, who was not involved in the study. This is because most of the research on training robots to handle objects has involved industrial pick-and-place tasks, such as sorting objects into bins. These tasks do not benefit much from sound, instead relying on tactile or visual sensors. But as robots expand into tasks in homes, kitchens and other environments, sound will become increasingly useful, Berenson says.
Consider a robot trying to find which bag or pocket contains a set of keys, all with limited visibility. “Maybe even before you touch the keys, you can hear them creaking,” says Berenson. “This is an indication that the keys are in this pocket instead of others.”
However, sound has limits. The team points out that the sound won’t be as useful with so-called soft or flexible objects like clothing, which don’t generate as much usable sound. The robots also had difficulty filtering out the sound of their motor noises during the tasks, as this noise was not present in the human-generated training data. To fix this, the researchers needed to add robot sounds—hums, hums, and actuator noises—to the training sets so the robots could learn to tune them.
The next step, Liu says, is to see how much better the models can be made with more data, which could mean adding more microphones, collecting spatial audio, and integrating microphones into other types of data collection devices.