Skip to main content

Unfortunately we don't fully support your browser. If you have the option to, please upgrade to a newer version or use Mozilla Firefox, Microsoft Edge, Google Chrome, or Safari 14 or newer. If you are unable to, and need support, please send us your feedback.

Elsevier
Publish with us
Connect

From quicksand to bedrock: How data quality shapes AI

21 January 2025

By Keith Hayes II

Aerial photo of a construction site by Richard Newstead/Moment via Getty Images

Source: Richard Newstead/Moment via Getty Images

“There’s no way to turn bad data into good results,” says engineering expert Chris Cogswell — but you can transform disorganized data into a reliable asset. Here’s how.

GenAI is changing the way we live, work and learn. In science, engineering, healthcare and technology, 72% of researchers and clinicians surveyed in our Insights 2024: Attitudes toward AIreport believe AI will have a transformative impact on their area of work.

But in the race to harness the power of AI, many organizations face a common challenge: data quality.

Unless your organization is new, you’re probably sitting on decades of physical documents, conflicting formats and siloed systems that need digitization and integration — a challenge in itself, but when you add in low-quality datasets and a lack of expertise, your foundation can begin to crumble. Without high-quality, well-curated data, even the most advanced AI tools struggle to deliver meaningful, accurate and usable results.

“It won’t really matter how good the AI system you use is if the data is bad,” says Dr Chris Cogswell, Global Engineering Consultant at Elsevier. “The results will always be off by some factor. There’s no way to turn bad data into good results.”

For organizations to unlock AI’s potential, the first step is clear: Transform disorganized data into a reliable asset. That’s why we partnered with the AIChE Institute for Learning & Innovation opens in new tab/window to tackle the tricky topic of data quality. Our webinar — AI & Knowledge Management: Foundations for Success opens in new tab/window — is hosted by Chris.

Digging into the value of AI

AI is often discussed in terms of its potential to transform industries, accelerate processes and drive progress — but what does this actually mean?

“Ultimately, the goal that most companies have for their beginning artificial intelligence systems — they want to take all their internal data, all the information about their business … and crunch it all together to find trends that can help them do it better,” Chris says.

AI models have capabilities that extend far beyond human limitations. They excel at processing vast quantities of data, uncovering patterns and generating insights that would otherwise remain hidden. An AI system can analyze complex datasets to reveal correlations and trends that even a team of experts might overlook. This ability to identify potential links between variables makes AI invaluable for critical decision-making, strategy development and innovation.

Time savings is another significant area where AI delivers measurable value. Computers may have revolutionized data management decades ago, but AI, machine learning and large language models take operational efficiency to the next level.

“I think one of the real areas where AI is going to help revolutionize the industry is making data collection, categorization and retrieval centralized, much easier to control and much less time-consuming.”

Chris Cogswell, PhD

CC

Chris Cogswell, PhD

Customer and Engineering Global Consultant at Elsevier

Despite advances in technology, many organizations still rely on manual processes for inputting, sorting and categorizing documents. Properly trained AI models can automate these tasks, freeing up human resources and delivering greater accuracy and consistency.

Integrating AI into your organization’s knowledge management systems offers another critical advantage: future-proofing. As technology advances, the volume of accessible data grows exponentially, making the ability to adapt swiftly essential. During the presentation, Chris shared a quote by Ed Wright opens in new tab/window, Principal Information Analyst at Johnson Matthey, from one of our Foundations for Effective AI webinars opens in new tab/window:

Getting your data house in order is similar to what they say about growing a tree: The best time to start is 20 years ago. The second-best time is right now.

Your present actions affect your organization’s future success, which makes the foundation of your AI initiatives critical.

“AI right now is something of a party trick … but the real value in its application is yet to be found. This is especially true in the engineering, science and technology industries, where the application of these tools often requires significant groundwork to start seeing results that are meaningful and trustworthy.”

Chris Cogswell, PhD

CC

Chris Cogswell, PhD

Customer and Engineering Global Consultant at Elsevier

Bad data, big problems

Most of us are familiar with the hallucinations and quirky errors AI models seem to be plagued with. Chris illustrates this issue with a common example: AI’s consistent struggle to understand human hands when generating images.

At first glance, these images register as normal and valid, but upon closer inspection, we quickly spot the extra fingers, twisted structures and other glaring inaccuracies. This simple example underscores a much larger issue regarding complex datasets and knowledge systems.

It might be easy to spot an extra finger on an AI-generated image, but it’s much harder to pinpoint a minuscule error in an enormous dataset that cascades into critical, costly issues. As Chris explains, the right training data is critical.

“It’s not just about having enough data or feeding it lots of data, it’s about pruning data. It’s about having data coming from a trusted source and choosing which of those datasets are worth including.”

The lack of digital data is another major hurdle in building successful AI tools. For many organizations — particularly those with decades of history — knowledge and information are locked away in physical copies and mismatched formats. White papers, journal articles and other physical records present a challenge when attempting to integrate internal data into AI training.

“A lot of companies believe that they can just digitize these internal documents themselves, but digitization is actually a huge technical challenge … which in itself may require the use of multiple algorithmic and AI tools,” Chris explains.

Poorly digitized or incomplete data riddled with errors and inconsistencies can undermine the foundation of your AI system, rendering it unreliable.

“You should be thinking about how you’ll train this thing you’ve brought into your organization before you bring it into the organization,” Chris says. Training an AI model isn’t a one-time project; it’s an ongoing process that requires consistent updates, quality controls and subtle refinements as it learns and adapts to the environment.

Transforming chaos into control

In the webinar, Cogswell emphasizes that strong AI models are built on flexible design, good data, proper training and expert input.

Design for flexibility

To expand on Ed Wright’s quote, you don’t want to design your AI system for the seed you plant today but for the tree you’ll have in 20 years. Flexibility is a best practice when it comes to any technological initiative but especially for rapidly evolving AI tools. As Chris explains:

You want to have a system that you can change down the line if you need to. That's extremely important because things will change over time. So if you're spending money to make a system that only works for your company today, that’s going to become not useful very quickly.

Focus on training

Proper training is a major focus of Chris’s webinar. “You need to train your AI, and you do this by giving it good data,” he stresses.

He likens an AI system to an expensive puppy. When you bring it into your space, you hope it’s well-trained, but that’s usually not the case. It requires proper, consistent and ongoing training using the right tools and methods:

Oftentimes, an AI system without training … will give you results that look great on the outside but when you dig into it as an expert, or you’re using that information for an important task, you often find very large errors that make those results unusable.

Use high-quality data

“If you want your AI to operate effectively, you need to have good data going in to start that process off right,” Chris says. But what exactly is good data?

Good data is comprehensive, without errors and free from extraneous unimportant information that could trigger false results. For example, when discussing Elsevier’s tools, like ScienceDirect opens in new tab/window, Scopus and Engineering Village opens in new tab/window, Chris notes that “all of these have data sets, and these data sets have been curated and cleaned to ensure that no erroneous, extraneous or false reports make their way into these sets.”

Partner with experts

Most organizations don’t have the in-house skills and know-how to build and train complex AI models. In fact, 50% of respondents in our webinar poll claim they use a mixed approach when implementing AI tools — a combination of in-house and partner expertise.

Partnering with an expert in the field is crucial for tackling the unknown unknowns, as Chris Cogswell puts it. Known knowns are the information that’s readily available within your organization and systems, but “it’s those unknown unknowns, the things you don’t know ahead of time, that can cause these problems,” he explains. “And that’s where an organization like Elsevier, with access to these large datasets, can be truly useful.”

How Elsevier curates high-quality, usable datasets

“As the world has digitized, Elsevier has been at the forefront of creating ontologies and taxonomies for scientific literature,” Chris says.

Elsevier’s approach to building high-quality, trusted datasets starts with data scaffolding — a process of structuring data and filling in missing gaps using internal and external sources.

The next steps are data enrichment and knowledge graphs using ontological details to find context and provide deeper insights. This can include creating connections between terms, finding hidden links or figuring out why certain terms are linked. “The knowledge graph is the linkages between documents or terms used to find insights in how they link together,” Chris explains.

This process allows Elsevier to take a vast quantity of information from numerous trusted sources and make it readable and usable for training AI models. For example, our chemical and bioreactivity datasets draw from a vast array of sources, tailored to your needs depending on your integration preferences.

“We have a long history in building ontologies and using them to create successful products,” Chris says. “And those are the kind of services we can provide to your organization, either as a data service vendor or as a part of your push towards creating AI data retrieval tools.”

Defy the unknown unknowns with Elsevier

When it comes to AI tools, data and training, Chris's message is clear: Good data improves your outcomes, but getting your data in shape is not an easy task — especially if you lack the resources and expertise required.

“At Elsevier, we partner with clients to build knowledge graphs using our own internal data, as well as the data internal to your organization,” Chris says. “We apply our skills in categorization, ontology building and knowledge graph development to ultimately build training data sets that will allow your AI tools to operate in the most successful way possible.”

Explore Elsevier’s datasets and learn how SciBite semantic technology can help you build a strong foundation for AI design.

Contributor

Keith Hayes II is Portfolio Marketing Manager for Elsevier’s Engineering portfolio.

KHI

Keith Hayes II

Portfolio Marketing Manager, Engineering

Elsevier

Read more about Keith Hayes II