Research
Training an AI to communicate in a more useful, correct and harmless way
In recent years, large language models (LLMs) have succeeded in a range of tasks such as question answering, summarization and dialogue. Dialogue is a particularly interesting task because it features flexible and interactive communication. However, LLM-powered dialog agents can express inaccurate or fabricated information, use discriminatory language, or encourage unsafe behavior.
To create safer dialog agents, we need to be able to learn from human feedback. Applying reinforcement learning based on the input of research participants, we explore new methods for training dialog agents that promise a safer system.
In ours most recent paperwe import Sparrow – a dialogue factor that is useful and reduces the risk of unsafe and inappropriate responses. Our agent is designed to chat with a user, answer questions, and search the Internet using Google when it’s useful to look for evidence to inform their answers.
Our new AI conversation model automatically responds to an initial human prompt.
Sparrow is a research model and proof of concept designed to train dialog agents to be more useful, correct, and harmless. By learning these properties in a general dialog environment, Sparrow advances our understanding of how we can train agents to be safer and more useful – and ultimately help create safer and more useful artificial general intelligence (AGI). .
Sparrow refuses to answer a potentially harmful question.
How Sparrow works
Training a conversational AI is a particularly difficult problem because it is difficult to identify what makes a dialogue successful. To address this problem, we turn to a form of reinforcement learning (RL) based on people’s feedback, using study participants’ preference feedback to train a model on how useful a response is.
To get this data, we show our participants multiple model answers to the same question and ask them which answer they like best. Because we show answers with and without evidence retrieved from the web, this model can also determine when an answer should be supported with evidence.
We ask study participants to evaluate and interact with Sparrow either physically or otherwise, continuously expanding the dataset used to train Sparrow.
But increasing utility is only part of the story. To make sure the model’s behavior is safe, we need to constrain its behavior. And so, we set an initial simple set of rules for the model, such as “don’t make threatening statements” and “don’t make hateful or offensive comments.”
We also provide rules about potentially harmful advice and not claiming to be an individual. These rules have been updated by studying existing work on language impairment and consulting with experts. We then ask our study participants to speak to our system, with the goal of tricking it into breaking the rules. These conversations then allow us to train a separate “rule model” that indicates when Sparrow’s behavior violates any of the rules.
Towards better artificial intelligence and better judgements
Verifying the correctness of Sparrow’s answers is difficult even for experts. Instead, we ask our participants to determine whether Sparrow’s answers are reasonable and whether the evidence Sparrow provides actually supports the answer. According to our participants, Sparrow gives a reasonable answer and backs it up with evidence 78% of the time when asked a real question. This is a big improvement over our base models. However, Sparrow is not immune to making mistakes, such as hallucinating facts and giving answers that are sometimes off topic.
Sparrow also has room to improve its compliance. After training, participants still tricked it into breaking our rules 8% of the time, but compared to simpler approaches, Sparrow is better at following our rules under adversarial probe. For example, our original dialogue model broke the rules about 3 times as often as Sparrow when our participants tried to trick it into doing so.
Sparrow answers a question and a follow-up question using evidence, then follows the “Don’t impersonate a human” rule when asked a personal question (sample as of September 9, 2022).
Our goal with Sparrow was to create flexible mechanisms for enforcing rules and rules on dialogs, but the specific rules we use are preliminary. Developing a better and more complete set of rules will require both the input of multi-disciplinary experts (including policy makers, social scientists and ethicists) and participatory input from a variety of affected users and groups. We believe that our methods will still apply to a more stringent set of rules.
Sparrow is a major step forward in understanding how to train dialog agents to be more useful and safer. However, successful communication between humans and dialog actors must not only avoid harm, but align with human values for effective and beneficial communication, as discussed in recent work on aligning language models with human values.
We also emphasize that a good agent will still refuse to answer questions in contexts where it is appropriate to defer to humans or where doing so has the potential to prevent harmful behavior. Finally, our initial research focused on an English-speaking agent, and further work is needed to ensure similar results in other languages and cultural contexts.
In the future, we hope that conversations between humans and machines can lead to better judgments about AI behavior, allowing humans to align and improve systems that may be too complex to understand without the help of machines.
Want to explore a conversational path for secure AGI? we are currently hiring research scientists for our Scalable Alignment team.