A hallmark of popular generative artificial intelligence programs such as ChatGPT is that they have a time cut-off in terms of which facts they have absorbed. For example, OpenAI recently updated its GPT-4 program to have access to data about events that took place up until April 2023; prior to that update, the tool was trained only on data from as recently as 2021.
AI scientists, however, are working on ways to allow generative AI programs to reliably access ever-changing data about timely and pressing questions, such as, "What is King Gizzard's most recent studio album?" (Answer: The Silver Cord.)
In that spirit, Google and OpenAI this month published a joint effort called FreshLLM that induces GPT-4 to use information retrieved from Google searches. The core of FreshLLM is a new method for prompting a language model, called "FreshPrompt," which includes results from a search engine.
By including in the input prompt for GPT-4 the top search results from Google, and then showing a valid answer to a query based on those search results, GPT-4 was induced to use evidence from the Web search to craft its output. The result significantly improved the program's answer to questions involving timely information.
"FreshPrompt significantly improves performance over competing search engine-augmented approaches," write lead author Tu Vu of Google and colleagues, in the research paper, "FreshLLMs: Refreshing large language models with search engine augmentation," which is posted on the arXiv pre-print server.
The FreshPrompt technique, however, is only one part of the story. In order to test how GPT-4 and competing programs perform when using Web data, Vu and colleagues had to come up with a list of questions that would pose a challenge with real-world, up-to-date facts.
To do so, the team -- with the help of colleagues and online freelancers --wrote questions about "developments in the world" that were crafted to include what they call "fresh knowledge"-- meaning, "knowledge that has changed recently or new events" -- and that were also questions "plausible for a real person to type into a search engine."
They came up with 600 questions, called FreshAQ, that range from never-changing -- "Has Virginia Woolf's novel about the Ramsay family entered the public domain in the United States?" -- to fast-changing -- such as "What is Brad Pitt's most recent movie as an actor?" Most but not all answers are sourced from Wikipedia.
The GitHub code for the project links to a Google Doc spreadsheet of the entire FreshQA database of questions. Reading the list of 600 is an instant shot of trivia immersion. "Which author had the most bestselling novels in the United States last year according to Publishers Weekly?" (Answer: Colleen Hoover.) "How many accounts have exceeded 100 million followers on Instagram?" (Answer: 38).
The authors compiled false-premise questions as well, because you have to know that what is asserted in the question itself is not actually the case, such as "What year did the first human land on Mars?"
Predictably, GPT-4, and other large language models tested, such as Google's Pathways Language Model, PaLM, struggled with the FreshQA questions, and did better when they were given the help of FreshPrompt. "This is mainly due to the lack of access to up-to-date information, as they produce 'outdated' answers," note Vu and team. Many programs will refuse to provide an answer.
Adding the FreshPrompt, they relate, "significantly improves FreshQA accuracy" on GPT-4. The technique "dramatically diminishes the presence of outdated and hallucinated answers," they add. On questions of facts beyond 2022, GPT-4's score goes from an abysmal 8% accuracy to 70.2%, they relate. Across all the FreshQA questions, including for older facts, the accuracy rises from 28.6% to 75.6%.
For the false-premise questions, the difference is night and day. The language model has to assert that the question is a false one in order to receive credit. Using the FreshPrompt, GPT-4 went from 33.9% accuracy on false-premise questions to 71%. Granted, that means GPT-4 can still be duped into accepting a false-premise question almost a third of the time.
The authors found that FreshPrompt was able to surpass other research that also uses search engine queries to "augment" language models. That includes, for example, Perplexity.ai, a combination of GPT-3.5 and Bing Search. The average accuracy on Perplexity, across all FreshQA questions, was 52.2% accurate, only a little bit better than random chance. Again, for GPT-4, using FreshPrompt, the authors were able to get 75.6% accuracy.
One important difference, they note, is how many bits of evidence are included in the FreshPrompt from the Web search. More is better, in general. "Our results suggest that the number of retrieved evidences for each question is the most important ingredient for achieving highest accuracy."
The authors note there are some real challenges moving forward. For one thing, it's time-consuming to keep updating FreshPrompt, which involves checking that the answers are still relevant. The team expresses a hope that the open-source community can help, or that updating can be automated by generative AI. For the time being, Vu and team have committed to keeping FreshQA fresh.
Disclosure:Tiernan Ray owns no stock in anything that he writes about, and there is no business relationship between Tiernan Ray LLC, the publisher of The Technology Letter, and any of the companies covered.