Amazon proposes a new AI benchmark to measure RAG

Choosing the right algorithm for RAG could yield more AI improvements than scaling to larger and larger language models, say AWS researchers.
Written by Tiernan Ray, Senior Contributing Writer

An outline of Amazon's proposed benchmarking process for RAG implementations of generative AI.

Amazon AWS

This year is supposed to be the year that generative artificial intelligence (GenAI) takes off in the enterprise, according to many observers. One of the ways this could happen is via retrieval-augmented generation (RAG), a methodology by which an AI large language model is hooked up to a database containing domain-specific content such as company files. 

However, RAG is an emerging technology with its pitfalls. 

Also: Make room for RAG: How Gen AI's balance of power is shifting

For that reason, researchers at Amazon's AWS propose in a new paper to set a series of benchmarks that will specifically test how well RAG can answer questions about domain-specific content. 

"Our method is an automated, cost-efficient, interpretable, and robust strategy to select the optimal components for a RAG system," write lead author Gauthier Guinet and team in the work, "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation," posted on the arXiv preprint server.

The paper is being presented at the 41st International Conference on Machine Learning, an AI conference that takes place July 21- 27 in Vienna. 

The basic problem, explains Guinet and team, is that while there are many benchmarks to compare the ability of various large language models (LLMs) on numerous tasks, in the area of RAG, specifically, there is no "canonical" approach to measurement that is "a comprehensive task-specific evaluation" of the many qualities that matter, including "truthfulness" and "factuality."

The authors believe their automated method creates a certain uniformity: "By automatically generating multiple choice exams tailored to the document corpus associated with each task, our approach enables standardized, scalable, and interpretable scoring of different RAG systems."

To set about that task, the authors generate question-answer pairs by drawing on material from four domains: the troubleshooting documents of AWS on the topic of DevOps; article abstracts of scientific papers from the arXiv preprint server; questions on StackExchange; and filings from the US Securities & Exchange Commission, the chief regulator of publicly listed companies.

Also: Hooking up generative AI to medical data improved usefulness for doctors

They then devise multiple-choice tests for the LLMs to evaluate how close each LLM comes to the right answer. They subject two families of open-source LLMs to these exams -- Mistral, from the French company of the same name, and Meta Properties's Llama

They test the models in three scenarios. The first is a "closed book" scenario, where the LLM has no access at all to RAG data, and has to rely on its pre-trained neural "parameters" -- or "weights" -- to come up with the answer. The second is what's called "Oracle" forms of RAG, where the LLM is given access to the exact document used to generate a question, the ground truth, as it's known.

The third form is "classical retrieval," where the model has to search across the entire data set looking for a question's context, using a variety of algorithms. Several popular RAG formulas are used, including one introduced in 2019 by scholars at Tel-Aviv University and the Allen Institute for Artificial Intelligence, MultiQA; and an older but very popular approach for information retrieval called BM25.

Also: Microsoft Azure gets 'Models as a Service,' enhanced RAG offerings for enterprise generative AI

They then run the exams and tally the results, which are sufficiently complex to fill tons of charts and tables on the relative strengths and weaknesses of the LLMs and the various RAG approaches. The authors even perform a meta-analysis of their exam questions --to gauge their utility -- based on the education field's well-known "Bloom's taxonomy."

What matters even more than data points from the exams are the broad findings that can be true of RAG -- irrespective of the implementation details. 

One broad finding is that better RAG algorithms can improve an LLM more than, for example, making the LLM bigger. 

"The right choice of the retrieval method can often lead to performance improvements surpassing those from simply choosing larger LLMs," they write.  

That's important given concerns over the spiraling resource intensity of GenAI. If you can do more with less, it's a valuable avenue to explore. It also suggests that the conventional wisdom in AI at the moment, that scaling is always best, is not entirely true when it comes to solving concrete problems.

Also: Generative AI is new attack vector endangering enterprises, says CrowdStrike CTO

Just as important, the authors find that if the RAG algorithm doesn't work correctly, it can degrade the performance of the LLM versus the closed-book, plain vanilla version with no RAG. 

"Poorly aligned retriever component can lead to a worse accuracy than having no retrieval at all," is how Guinet and team put it.

Editorial standards