DeepMind published a series of papers on large language models (LLM) last year, incl an analysis of Gopher, our large language model. Language modeling technology, which is also currently being developed by many other labs and companies, promises to enhance many applications, from search engines in a new wave of chatbot type chat assistants and beyond. A paper this series presents several reasons why “raw” language models like Gopher do not meet our standards for safely deploying this technology in user-facing applications, especially if there are no guardrails to manage problematic and potentially harmful behavior.
Our latest work focuses on one of these concerns: Language models like Gopher can “fake” events that look plausible but are actually bogus. Those familiar with this problem know to do their own data checking, rather than trusting what the language models say. Those who are not, may end up believing something that is not true. This article describes GopherCite, a model that aims to address the language model illusion problem. GopherCite tries to support all of its factual claims with evidence from the web. He uses Google Search to find relevant web pages on the internet and quotes a passage that tries to show why his answer is correct. If the system is unable to form an answer that can be sufficiently supported by evidence, it tells the user “I don’t know” instead of giving an unsupported answer.
Supporting simple factual claims with easily verifiable evidence is one step towards making language models more credible, both to users interacting with them and to commentators assessing the quality of the samples. A comparison between the behavior of the “raw” Gopher and our new model is useful to illustrate this change.
Based on GopherCite’s answer, you’ll notice that Gopher made up a fact (“Lake Placid hosted the 1936 Winter Olympics”) without warning. When a verified quote from a related Wikipedia page from GopherCite appears, we can confirm that Lake Placid has hosted the Olympics only twice, in 1932 and 1980.
To change Gopher’s behavior in this way, we trained Gopher according to human preferences. We asked participants in a user study to choose their preferred answer from a pair of candidates, according to criteria such as how well the evidence supports the answers given. These labels were used as training data for both supervised learning on high-score samples and reinforcement learning from human preferences (RLHP). We also followed this approach Our recent work on red teaming.
We are not the only ones interested in this problem of factual inaccuracy in language models. Our colleagues at Google recently made progress on the documentation on their latest LaMDA system, having a conversational model interacts with Google Search and sometimes shares relevant URLs. Indeed, GopherCite’s training program uses a similar methodology to LaMDA’s, but a critical difference is that we aim to provide a specific piece of relevant evidence, rather than simply pointing the user to a URL. Based on motivations similar to ours, OpenAI has recently announced job developing a closely related system called WebGPT, which also implements RLHP to align with the GPT-3 language model. While GopherCite focuses on reading large document inputs, WebGPT carefully curates the context presented in the language model by interacting multiple times with a web browser. He also cites evidence to support his answers. The similarities and differences between these systems and ours are discussed in our paper, and we also demonstrate that GopherCite very often provides convincing evidence for its claims.
We conducted a user study with paid participants to evaluate the model on two types of questions: data search questions typed into Google Search (released by Google in a dataset called “NaturalQuestions”), and explanation-seeking questions asked by Reddit users in a forum called “/r/eli5” (“Explain it like I’m 5 [years old]”). Participants in our study determined that GopherCite answers fact-finding questions correctly – and with satisfactory evidence – about 80% of the time, and does so for explanation-seeking questions about 67% of the time. When we allow GopherCite to refrain from answering certain questions, its performance improves dramatically among the questions it chooses to answer (see the paper for details). This explicit opt-out mechanism is a key contribution of our work.
But when we evaluate the model against a set of “adversarial” questions, which try to trick the model into parroting a fiction or misunderstanding that is stated online, GopherCite often falls into the trap. For example, when asked “what does Red Bull give you?”, here’s how he responds:
We believe that this failure mode and others discussed in our work can be avoided by enriching the setting, moving from a “one-shot” response to a user’s question, to one in which the model can ask clarifying questions to the user and to engage in dialogue. For example, we could allow future models to ask the user if they want an answer that is literally true or one that is true on the edge of the fictional world of a Red Bull ad.
In summary, we believe GopherCite is an important step forward, but building it has taught us that citing evidence is only one part of an overall strategy for security and credibility. More fundamentally, not all claims require evidence – and as we have shown above, not all claims supported by evidence are true. Some claims require multiple pieces of evidence along with a logical argument that explains why the claim follows. We will continue to work in this area and aim to overcome the issues presented with further research and development as well as dedicated socio-technical research.
Our paper covers much more detail about our methods, experiments, and relevant context than the research literature. We also created an FAQ about GopherCite, which was answered by the model himself after reading the paper’s introduction (using candidate samples curated by the authors):