When Google announced the launch of its artificial intelligence search feature earlier this month, the company promised, "Google will complete Google searches for you."
This new feature, called "AI Overviews," will provide short summaries generated by artificial intelligence, highlighting key information and links at the top of the search results page.
Unfortunately, artificial intelligence systems are inherently unreliable. Within days of AI Overviews going live in the United States, users began sharing numerous amusing cases on social media.
It suggested that users add glue to their pizza or eat at least one small stone every day.
It also claimed that former U.S. President Andrew Johnson obtained a college degree between 1947 and 2012, but he had passed away as early as 1875.On May 30th local time, Google Search Director Liz Reid stated that the company has been making technical improvements to the system to reduce the likelihood of generating incorrect answers, including better mechanisms for detecting nonsensical queries.
The company also restricted the inclusion of irony, humor, and user-generated content in responses, as this information could lead to misleading suggestions.
Advertisement
But why does AI Overviews return unreliable and potentially dangerous information? Is there any way to solve this problem (if there is)?
To understand why artificial intelligence search engines make mistakes, we need to look at how they work.
We know that AI Overviews uses a version of the generative artificial intelligence model Gemini.Gemini is a member of Google's large language models (LLMs) family and has been customized for Google Search.
The model has been integrated with Google's core web ranking system, designed to extract relevant results from its website index.
Most large language models are merely predicting the next word (or token) in a sequence, which makes the content they generate appear smooth, but also makes them prone to fabricating false information.
They do not have real information as evidence, but instead choose each word purely based on statistical calculations.
This can lead to "hallucinations." Chirag Shah, a professor at the University of Washington who specializes in online search, said that the Gemini model in AI Overviews is likely to address this issue by using an artificial intelligence technique called retrieval-augmented generation (RAG).The technology allows large models to check specific (information) sources outside of their training data, such as certain web pages.
Once a user inputs a query, the system checks according to the documents that constitute the system's information sources and generates a response.
Because it can match the original query with specific parts of the web page, it can provide the source of the answer, which is something ordinary large models cannot do.
One of the main advantages of retrieval-augmented generation technology is that the responses it generates to user queries should be more timely, accurate, and relevant than the responses of typical models that generate answers based solely on training data. This technology is often used to prevent large models from producing "hallucinations".
However, a Google spokesperson did not confirm whether AI Overviews used retrieval-augmented generation technology.The retrieval-augmented generation technology is not flawless. In order for large models that use this technology to come up with a good answer, they must correctly retrieve information and correctly generate a response.
When one or both of these steps fail, the model will provide a poor answer.
The AI Overviews recommendation to add glue to pizza stems from a humorous reply on the Reddit forum.
This post is likely related to the user's initial query about how to solve the problem of cheese not sticking to the pizza, but there was a problem during the retrieval process.
Just because the content is relevant does not mean it is correct, and the information generation step of this process does not question this.Similarly, if a retrieval-augmented generation system encounters conflicting information, such as an old and a new version of a policy manual, it will not be able to determine which version to draw information from to construct a response.
It may combine information from both, resulting in a potentially misleading answer.
Suzan Verberne, a professor at Leiden University in the Netherlands who specializes in natural language processing, said: "Large language models generate fluent replies based on the sources of information you provide, but fluent replies are not the same as correct information."
She said that the more specific a topic is, the higher the chance of incorrect information appearing in the output of large language models.
And she added: "This issue is not only present in the medical field, but also in the field of education and science."A Google spokesperson stated that in many instances, when AI Overviews return incorrect answers, it is due to the lack of high-quality information available on the internet, or because the user's query matches with satirical websites or humorous posts most closely.
The spokesperson indicated that AI Overviews provides high-quality information in the vast majority of cases, and many incorrect instances are related to uncommon queries.
It added that the probability of AI Overviews including harmful, obscene, or other unacceptable content in its responses is one in seven million, meaning that for every seven million unique queries, there would be one poor response.
It also stated that it would continue to remove AI Overviews for certain queries according to its content policy.
Despite the "pizza glue" error well illustrating how AI Overviews can point to unreliable sources, the system may also generate incorrect information from factually correct sources.Melanie Mitchell, an artificial intelligence researcher at the Santa Fe Institute in New Mexico, USA, searched "How many Muslim presidents has the United States had?"
AI Overviews responded, "The United States has had one Muslim president, Barack Hussein Obama."
Barack Obama is not a Muslim, so AI Overviews' response is incorrect. However, it is information extracted from a book titled "Barack Hussein Obama: The United States' First Muslim President?"
Thus, the artificial intelligence system not only failed to grasp the full point of the article but also interpreted it in a completely opposite way from what was expected.
There are several issues with artificial intelligence; one is finding a good information source that is not a joke, and another is correctly interpreting the message source.This is something that artificial intelligence systems find difficult to do. It is important to note that even if they have a good source of information, they can still make mistakes.
Ultimately, we will realize that artificial intelligence systems are unreliable. As long as they generate text word for word using probabilities, there will always be a risk of "hallucinations" occurring.
Although AI Overviews may improve as Google makes adjustments, we can never be sure if it will be 100% accurate.
The company says it is adding trigger restrictions for AI Overviews, which will be activated if the query is not very helpful, and additional "trigger improvements" for health-related queries.
Webern said that the company could add a step in the information retrieval process to flag risky queries and have the system refuse to generate answers in these situations.A Google spokesperson stated that the company's goal is not to display AI Overviews on dangerous topics or fragile situations.
Technologies such as reinforcement learning from human feedback, incorporating this feedback into the training of large models, also helps to improve the quality of their answers.
Similarly, large models can be specifically trained for questions they cannot answer. It is also useful to let them carefully evaluate the quality of the retrieved documents before generating an answer, so correct guidance is very helpful.
Although Google has added a label to the answers in AI Overviews, which reads "Generative AI is experimental," it should consider making it clearer to people that the feature is being tested and emphasize that it is not yet ready to provide completely reliable answers.
"It is still in beta and will be for a while, and before it is no longer in beta, it should be an option, rather than being forced on users as part of the core search," said Shah.