To teach an AI agent a new task, such as how to open a kitchen cabinet, researchers often use reinforcement learning—a trial-and-error process where the agent is rewarded for actions that get it closer to the goal.
In many cases, a human expert must carefully design a reward function, which is an incentive mechanism that motivates the agent to explore. The human expert must iteratively update this reward function as the agent explores and tries different actions. This can be time-consuming, inefficient and difficult to scale, especially when the work is complex and involves many steps.
Researchers from MIT, Harvard University and the University of Washington have developed a new reinforcement learning approach that does not rely on a specially designed reward function. Instead, it leverages crowdsourced feedback gathered from many non-expert users to guide the agent as it learns to achieve its goal.
While some other methods also attempt to use feedback from non-experts, this new approach allows the AI agent to learn faster, despite the fact that user-generated data is often full of errors. This noisy data may cause other methods to fail.
Additionally, this new approach allows feedback to be collected asynchronously so that non-expert users around the world can contribute to teaching the agent.
“One of the most time-consuming and challenging parts of designing a robotic agent today is engineering the reward function. Today, reward functions are designed by expert researchers – a paradigm that cannot be scaled if we want to teach our robots many different tasks. Our work suggests a way to scale robot learning by crowdsourcing the design of the reward function and enabling non-experts to provide useful feedback,” says Pulkit Agrawal, assistant professor in MIT’s Department of Electrical Engineering and Computer Science (EECS). who leads the Improbable AI Lab at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
In the future, this method could help a robot learn to perform specific tasks in a user’s home quickly, without the owner having to show the robot physical examples of each task. The robot could explore on its own, with crowd-sourced non-expert feedback guiding its exploration.
“In our method, the reward function guides the agent in what to explore, rather than telling it exactly what to do to complete the task. So even if the human supervision is somewhat imprecise and noisy, the agent is still able to explore, which helps it learn much better,” explains lead author Marcel Torne ’23, a research assistant at the Improbable AI Lab.
Torne is joined on the paper by his MIT advisor, Agrawal. senior author Abhishek Gupta, assistant professor at the University of Washington. as well as others at the University of Washington and MIT. The research will be presented at the Conference on Neural Information Processing Systems next month.
Noisy feedback
One way to collect user feedback for reinforcement learning is to show a user two pictures of states achieved by the agent and then ask the user which state is closer to a goal. For example, maybe a robot’s goal is to open a kitchen cabinet. One image might show that the robot opened the cupboard, while the second might show that it opened the microwave. A user would choose the photo of the “best” situation.
Some previous approaches try to use this crowdsourced binary feedback to optimize a reward function that the agent would use to learn the task. However, because non-experts are likely to make mistakes, the reward function can become very noisy, so the agent can get stuck and never reach its goal.
“Basically, the agent would take the reward function very seriously. It will try to match the reward function perfectly. So instead of directly optimizing the reward function, we just use it to tell the robot which areas to explore,” says Torne.
He and his colleagues decoupled the process into two separate parts, each driven by its own algorithm. They call their new reinforcement learning method HuGE (Human Guided Exploration).
On the one hand, a target selector algorithm is constantly updated with crowdsourced human feedback. Feedback is not used as a reward function, but rather to guide the agent’s exploration. In a sense, non-expert users drop breadcrumbs that gradually guide the agent towards its goal.
On the other hand, the agent explores on its own, in a self-supervised manner guided by the target selector. It collects images or videos of actions it tries, which are then sent to humans and used to inform the target picker.
This limits the area for the agent to explore, leading it to more promising areas that are closer to its target. But if there is no feedback, or if the feedback is slow to arrive, the agent will continue to learn on its own, albeit in a slower way. This allows feedback to be collected infrequently and asynchronously.
“The exploration loop can keep going autonomously, because it’s just going to explore and learn new things. And then when you get a better signal, you’ll explore in more specific ways. You can just keep them spinning at their own pace,” adds Torne.
And because the feedback just gently guides the agent’s behavior, it will eventually learn to complete the task even if users provide incorrect answers.
Faster learning
The researchers tested this method in a series of simulated and real tasks. In the simulation, they used HuGE to efficiently learn tasks with long action sequences, such as stacking blocks in a specific order or navigating a large maze.
In real-world tests, they used HuGE to train robotic arms to draw the letter “U” and pick and place objects. For these tests, they collected data from 109 lay users in 13 different countries spanning three continents.
In real-world and simulation experiments, HuGE helped agents learn to reach the goal faster than other methods.
The researchers also found that data generated by non-experts performed better than synthetic data, which was generated and labeled by the researchers. For non-experienced users, tagging 30 images or videos took less than two minutes.
“This makes it very promising in terms of being able to scale this method up,” adds Torne.
In related work, which the researchers presented at the recent Robot Learning Conference, they enhanced HUGE so that an AI agent could learn to perform the task and then autonomously reset the environment to continue learning. For example, if the agent learns to open a cabinet, the method also instructs the agent to close the cabinet.
“Now we can learn it completely autonomously without needing human resets,” he says.
The researchers also emphasize that, in this and other learning approaches, it is important to ensure that AI agents are aligned with human values.
In the future, they want to continue improving HuGE so that the agent can learn from other forms of communication, such as natural language and natural interactions with the robot. They are also interested in applying this method to train multiple agents simultaneously.
This research is funded, in part, by the MIT-IBM Watson AI Lab.