Preventing AI hallucinations in our research and health tools
February 24, 2025
By Ian Evans

Yutong Liu & The Bigger Picture / Better Images of AI / AI is Everywhere / CC-BY 4.0
To safeguard the AI platforms used by researchers and clinicians, our colleagues are combining human oversight with innovative technology
When large language models (LLMs) arrived in late 2022, Michael Wooldridge, Professor of Computer Science at the University of Oxford, described them as “weird things beamed down to Earth that suddenly make possible things in AI that were just philosophical debate until three years ago.”
LLMs can parse and summarize huge amounts of information. And they can engage in iterative “conversations” to get researchers or clinicians to the answers they need.
But the rise of LLMs has brought with it a growing recognition of their potential to generate text that is not grounded in reality. High-profile examples involve tools that tell users to use glue on pizza and eat one rock per day opens in new tab/window — but not every instance is so easily spotted. These “hallucinations” raise important ethical and practical considerations, including issues related to misinformation, bias, and the potential for harm.
Therefore, it’s crucial to understand the forms hallucinations can take and devise tactics to prevent them.
Two types of hallucinations
Dr Georgios Tsatsaronis opens in new tab/window, VP for Data Science, Research Content Operations, at Elsevier explained the issue: “When we talk about hallucinations in AI, we’re mainly talking about two things. The first is truthfulness to sources: To what extent is the response generated by an LLM mappable and linkable to the underlying sources that are being used?
“The second is around ‘factuality.’ You might have good truthfulness to the sources, but the responses the tool generates might be factually completely incorrect. That’s an aspect of hallucination that pertains to your overall architecture and the sources that you’re using.” The famous instance of an AI telling users to eat one rock a day is an example of the latter, with the tool representing its source accurately, albeit a source that was deliberately flippant opens in new tab/window.

Georgios Tsatsaronis, PhD
When it comes to tackling the possibility of hallucinations in Elsevier’s AI tools, Georgios and his team are less concerned about these types of hallucinations. He elaborated:
The architectures we build contain high-quality, peer-reviewed content. If you’re questioning the factuality, that is an issue for the wider system. So when we talk about how we address hallucinations at Elsevier, we mostly focus on truthfulness to the source.
“The architectures we build contain high-quality, peer-reviewed content. ... So when we address hallucinations at Elsevier, we mostly focus on truthfulness to the source.”

GT
Georgios Tsatsaronis, PhD
VP for Data Science at Elsevier
“Factfulness” itself is a complex issue. Dr Zubair Afzal opens in new tab/window, Head of Data Science at Elsevier, described the various forms hallucinations can take, including “numeric contradictions, fabricated information and entity misidentifications.”
Numeric contradictions are where an AI might report that a study found a vaccine to be 90% effective based on a sample of 500 participants, indicating that 450 showed immunity. However, it could also claim that only 300 participants were tested for immunity, creating a clear contradiction in the reported data. Entity misidentification, meanwhile, is when an AI system incorrectly identifies entities within a given context. This can manifest in various ways, such as mistaking a person’s name for an institution’s or vice-versa or mislabeling an organization as an individual. These potential issues and more are monitored by Elsevier’s AI product teams.

Zubair Azfal, PhD
Elsevier’s approach to responsible AI
Elsevier brings together trusted content, human expertise and responsibly applied AI technologies to help researchers, educators and healthcare professionals worldwide advance discovery, innovation and patient care.
LLM as a judge — and other tactics
To tackle these hallucinations, Zubair and his team use a range of strategies, including limiting the freedom of the AI model to generate answers solely on its own. The team uses a combination of human input and AI tools. As Zubair explained:
The first approach we use is humans (Subject Matter Experts), who give us the first feedback on an LLM’s outputs. They’re really vigilant and work according to the instructions we provide. But it’s a time-consuming task, and it becomes impossible to scale up to the level you need. To address this challenge, we then employ additional LLMs, different from the initial one, which are guided by human insights.
Using “LLM as a judge” is becoming increasingly common. It involves using one LLM to generate content and a second to evaluate that content. Zubair continued:
We use Model A to generate a response and then provide that response to Model B along with the same content Model A had access to. We ask questions like, ‘Can you verify this? Are there contradictions? Are these neutral statements without biased language?’ That’s an effective approach we can scale quickly — with the LLM able to evaluate in a short space of time what it would take a human months to do.
Elsevier’s developers also make use of “red-teaming,” where a group (and/or an automated tool) is tasked with taking on the perspective of an adversary or competitor to challenge assumptions, identify vulnerabilities and test the effectiveness of products. Zubair explained:
The idea is to break the system, find loopholes and improve it. We set out to find ways to make the LLM hallucinate — just as people set out to make an LLM break character, swear, make mistakes and so on. You start with a simple probe, and then formulate your next question based on the strategy you have established to break the system. Once you understand how to break the system, you can mitigate the issue.
“The idea is to break the system, find loopholes and improve it.”

ZA
Zubair Afzal, PhD
Head of Data Science at Elsevier
Zubair described this work as “very active research” that requires constant vigilance: “We keep going at it because it’s not the kind of issue that is ever solved completely. You need to keep building your understanding of any weaknesses in the LLM so you can mitigate the hallucination effect.”
An ongoing quest
Georgios noted that with multiple product teams across Elsevier looking at ever-improving methods for addressing hallucinations in different sectors, teams are always learning from each other.
“We’ve talked about the different types of hallucination, where they can be captured and where they can be mitigated, and that’s something that doesn’t change across the various sectors we operate in. The approaches we come up with can be re-purposed elsewhere.” Georgios gave the example of a recently developed tool for ensuring accuracy within Scopus AI. The tool categorizes hallucinations into different types.
Suppose a question is posed regarding the role of a specific protein in plant cell death, known as apoptosis. The AI language model may encounter research papers discussing how these proteins function in apoptosis but in organisms other than plants.
In a hallucination scenario, the LLM might erroneously generalize the information and assert that the protein's role in apoptosis extends to plants as well, despite the source content clearly indicating otherwise. This is known as AI generalizing a claim.
Alternatively, the LLM might misinterpret the concept and erroneously substitute ‘plants’ with a different term, leading to a distortion of the original information, which is a different type of hallucination.
Another example is around medical misinformation. An AI without appropriate guardrails might analyze a patient's symptoms and conclude that they are indicative of a rare disease, suggesting a treatment plan that includes a specific medication. However, it could also erroneously state that the patient has a common condition that requires a completely different treatment approach.
Developed for Scopus AI, this API is now being implemented into other Elsevier tools to further improve their accuracy, as just one element of a multi-facted approach to tackling hallucinations. As Zubair explained, “It means more people will be able to use standard tools to identify hallucinations across different use cases.”
Conclusion
No AI tool is immune to the potential pitfalls of inaccuracies and misinterpretations, but by recognizing the diverse manifestations of hallucinations and tailoring solutions to specific use cases, AI practitioners can navigate the complexities of natural language processing with greater precision and effectiveness. In this way, the benefits of generative AI can be fully unlocked to serve the research and health communities.
Contributor

IE