Evaluating Multimodal Interactive Agents - Google DeepMind

To train agents to interact well with humans, we need to be able to measure progress. But human interaction is complex and measuring progress is difficult. In this work we developed a method, called the Standardized Test Suite (STS), for evaluating factors in time-extended, multimodal interactions. We examined interactions consisting of human participants asking agents to perform tasks and answer questions in a 3D simulation environment.

The STS methodology places agents in a set of behavioral scenarios extracted from real human interaction data. Agents see a repeating script box, receive an instruction, and then are given control to complete the offline interaction. These agent continuations are recorded and then sent to the evaluators to be commented as pass or fail. Agents are then ranked according to the proportion of scenarios in which they succeed.

Many of the behaviors that are second nature to humans in our daily interactions are difficult to describe in words and impossible to formalize. So the mechanism we rely on to solve games (like Atari, Go, DotA, and Starcraft) with reinforcement learning won’t work when we’re trying to teach agents to have fluid and successful interactions with humans. For example, consider the difference between these two questions: “Who won this game of Go?” vs. “What are you looking at?” In the first case, we can write a piece of computer code that counts the stones on the board at the end of the game and determines the winner with certainty. In the second case, we have no idea how to encode it: the answer may depend on the speakers, the size and shapes of the objects involved, whether the speaker is joking, and other aspects of the context in which the utterance is given. People intuitively understand the myriad relevant factors involved in answering this seemingly mundane question.

Interactive evaluation by human participants can serve as a touchstone for understanding agent performance, but this is noisy and expensive. It is difficult to control the exact instructions people give agents when they interact with them for evaluation. This type of assessment is also real-time, so it’s too slow to rely on for rapid progress. Previous work has relied on proxies for interactive evaluation. Proxies, such as losses and scripted detection tasks (e.g. “lift the x” where x is chosen randomly from the environment and the success function is painstakingly constructed by hand), are useful for quickly gaining knowledge about agents, but they don’t actually correlate that well with interactive evaluation. Our new method has advantages, most notably providing control and speed in a metric that closely aligns with our ultimate goal—creating agents that interact well with humans.

The development of MNIST, ImageNet, and other human-annotated datasets has been essential to advances in machine learning. These data sets allowed the researchers to train and evaluate classification models for a one-time cost of human inputs. STS methodology aims to do the same for human-agent interaction research. This rating method still requires people to comment on agent resumes. However, early experiments suggest that it may be possible to automate these annotations, which would allow fast and efficient automated evaluation of interactive agents. In the meantime, we hope that other researchers can use the methodology and system design to accelerate their own research in this area.

AI lie detectors are better than humans at spotting lies

What are AI agents? | MIT Technology Review

A way to let robots learn by listening will make them more useful

AI companies are finally being forced to cough up training data

NanoNets AI solution feeds delivery information to Jamix

Guide to Statistical Analysis: Definition, Types, and Careers

Understanding YOLOv5 Loss: A Comprehensive Analysis

Master Advanced Prompt Engineering with LangChain for Context-Aware Language Models

Arduino vs Raspberry Pi: What’s the difference?

Top 20 Generative AI Applications/ Use Cases Across Industries

Evaluating Multimodal Interactive Agents – Google DeepMind

What are AI agents? | MIT Technology Review

What Is Matter? We Explain the New Smart Home Standard (2024)

How foundation agents can revolutionize AI decision-making in the real world

Google Rolls Back A.I. Search Feature After Flubs and Flaws

Google is bringing a slew of AI-powered software features to Chromebook Plus laptops

17 Best Android Phones (2024): Unlocked, Cheap, Foldable

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Our Picks

AI lie detectors are better than humans at spotting lies

LLM Alignment: Reward-Based vs Reward-Free Methods | by Anish Dubey | Jul, 2024

AI and Data Extraction for Business Benefits

Subscribe to Updates

Evaluating Multimodal Interactive Agents – Google DeepMind

Related Posts