Hooking up generative AI to medical data improved usefulness for doctors

By asking GPT-4 to retrieve recommendations from a database, the model's responses to spontaneous medical questions went from terrible to potentially useful.
Written by Tiernan Ray, Senior Contributing Writer

Outline of the RAG approached used by Heidelberg scholars.

Heidelberg University Hospital

Generative artificial intelligence (AI) has shown a remarkable ability to answer questions on structured tests, including achieving well above a passing score on the United States Medical Licensing Examination. 

But in an unstructured setting, when the AI models are fed a stream of novel questions crafted by humans, the results can be terrible, the models often returning several inaccurate or outright false assertions, in the phenomenon known as 'hallucinations'.

Also: How GenAI got much better at medical questions - thanks to RAG

Researchers at Heidelberg University Hospital in Heidelberg, Germany, reported in the prestigious New England Journal of Medicine (NEJM) this week that hooking generative AI models up to a database of relevant information vastly improved the model's ability to answer unstructured queries in the domain of oncology, the treatment of cancer.

The approach of retrieval-augmented generation (RAG), letting the large language models tap into external sources of information, dramatically improved the spontaneous question-answering, according to authors Dyke Ferber and the team at Heidelberg in a study they describe this week in NEJM, "GPT-4 for Information Retrieval and Comparison of Medical Oncology Guidelines." (A subscription to NEJM is required to read the full report.)

Also: OpenAI just gave free ChatGPT users browsing, data analysis, and more

The study was prompted by the fact that medicine faces a unique information overload -- there are more recommendations for best practices being generated all the time by medicine's professional organizations. Staying current on those suggestions burdens physicians trying to handle a population that is living longer and expanding the demand for care. 

Groups such as the American Society of Clinical Oncology (ASCO), Ferber and team related, "are releasing updated guidelines at an increasing rate," which requires physicians to "compare multiple documents to find the optimal treatments for their patients, an effort in clinical practice that is set to become more demanding and prevalent, especially with the anticipated global shortage of oncologists."

Ferber and team hypothesized that an AI assistant could help clinicians sort through that expanding literature.

Indeed, they found that GPT-4 can reach levels of accuracy with RAG sufficient to serve at least as a kind of first pass at summarizing relevant recommendations, thus lightening the administrative burden on doctors.

Also: Dana-Farber Cancer Institute finds main GPT-4 concerns include falsehoods, high costs

The authors tested OpenAI's GPT-4 by having human oncology experts submit 30 "clinically relevant questions" on pancreatic cancer, metastatic colorectal cancer, and hepatocellular carcinoma, and having the model produce a report in response with statements about recommended approaches for care. 

The results were disastrous for GPT-4 on its own. When asked in the prompt to "provide detailed and truthful information" in response to the 30 questions, the model was in error 47% of the time, with 29 out of 163 statements being inaccurate, as reviewed by two trained clinicians with years of experience, and 41 statements being wrong. 

"These results were markedly improved when document retrieval with RAG was applied," the authors reported. GPT-4 using RAG reached 84% accuracy in its statements, with 60 of 71, 62 of 75, and 62 of 72 correct responses to the three areas of cancer covered in the 30 questions. 

"We showed that enhancing GPT-4 with RAG considerably improved the ability of GPT-4 to provide correct responses to queries in the medical context," wrote Ferber and team, "surpassing a standard approach when using GPT-4 without retrieval augmentation."

To compare native GPT-4 to GPT-4 with RAG, they used two prompting strategies. In its native, non-RAG form, GPT-4 was asked, "Based on what you have learned from medical oncology guidelines, provide detailed and truthful information in response to inquiries from a medical doctor," and then one of the questions about how to treat a particular instance of cancer.

Also: MedPerf aims to speed medical AI while keeping data private

GPT-4 in this native prompting was used both with what's called 'zero-shot' question answering, where only the prompt question is offered, then with few-shot prompting, where a document is inserted into the prompt, and the model is shown how the document can answer a similar question. 


A RAG approach allows GPT-4 to tap into a database of clinical knowledge.

Heidelberg University Hospital

In the RAG approach, the prompt directs GPT-4 to retrieve "chunks" of relevant medical documents provided by ASCO and the European Society for Medical Oncology (ESMO) from a database. Then, the model must reply to a statement such as, "What do the documents say about first-line treatment in metastatic MSI tumors?"

The two human clinicians at Heidelberg University Hospital scored the responses for accuracy by manually comparing GPT-4's replies to the supplied documents. 

"They systematically deconstructed each response into discrete statements based on the bullet points provided by GPT-4," wrote Ferber and team. 

"Each statement was carefully evaluated according to its alignment with the respective information from the ASCO and ESMO documents," and, "for each question, the clinicians performed a detailed manual review of the guidelines corresponding to each query to define our ground truth."

Also: Google's MedPaLM emphasizes human clinicians in medical AI

That manual evaluation shows an important aspect of the RAG approach, Ferber and team noted: it can be checked. "By providing access to the retrieved guideline documents, the RAG mechanism facilitated accuracy verification, as clinicians could quickly look up the information in the document chunk," they wrote.

The conclusion is promising: "Our model can already serve as a prescreening tool for users such as oncologists with domain expertise," write Ferber and team.

There are limitations, however, to RAG. When GPT-4 used RAG to retrieve relevant passages that provided conflicting advice about treatment, the model sometimes replied with inaccurate suggestions. 

"In cases in which GPT-4 must process information from conflicting statements (clinical trials, expert views, and committee recommendations), our current model was not sufficient to reliably produce accurate answers," write Ferber and team.

It turns out you still have to do some prompt engineering. Ferber and team were able to mitigate inaccuracies by asking GPT-4 to identify the conflicting opinions in the literature, and then provide a revised response, which turned out to be correct.

Editorial standards