In the case of AI Overviews suggesting a pizza recipe that contains glue—from a Reddit joke post—it’s possible that the post seemed relevant to the user’s original question about cheese not sticking to pizza, but something went wrong in the recovery process, Shah says. “Just because it’s relevant doesn’t mean it’s right, and the production side of the process doesn’t dispute that,” he says.
Similarly, if a RAG system encounters conflicting information, such as a policy manual and an updated version of the same manual, it cannot work out which version to derive its answer from. Instead, it can combine information from both to create a potentially misleading answer.
“The big language model generates fluent language based on the sources provided, but fluent language is not the same as correct information,” says Suzan Verberne, a professor at Leiden University who specializes in natural language processing.
The more specific a subject, the greater the potential for misinformation in the output of a large language model, he says, adding, “This is a problem in the medical field, but also in education and science.”
According to the Google spokesperson, in many cases when AI Overviews returns incorrect answers, it’s because there isn’t a lot of high-quality information available on the web to display for the query, or the query is more suited to satirical websites or joke posts.
The vast majority of AI Reviews provide high-quality information and that many of the examples of bad AI Reviews answers were in response to unusual questions, they say, adding that the number of AI Reviews containing potentially harmful, obscene or otherwise infringing content are counted less than one in 7 million unique questions. Google continues to remove AI Reviews for some queries in accordance with its content policies.
It’s not just bad training data
Although the pizza glue gaffe is a good example of AI reviews pointing to an unreliable source, AI Reviews can still generate misinformation from genuinely correct sources. Melanie Mitchell, an artificial intelligence researcher at the Santa Fe Institute in New Mexico, googled “How much Muslim presidents has the US had?’, to which AI Overviews responded: ‘The United States has had a Muslim president, Barack Hussein Obama.’
While Barack Obama himself is not Muslim, making AI Overviews’ answer incorrect, he was drawing his information from a chapter in an academic book titled Barack Hussein Obama: America’s first Muslim president? So not only did the AI system miss the entire point of the essay, it interpreted it in exactly the opposite way, Mitchell says. “There are some problems here for AI. One finds a good source that is not a joke, but the other correctly interprets what the source says,” he adds. “This is something that AI systems struggle to do, and it’s important to note that even when it has a good source, it can make errors.”