AI in R&D

AI building blocks

Across industry, R&D leaders are exploring how artificial intelligence (AI) can reinvent scientific research. AI, including machine learning (ML) and generative AI (GAI), has the potential to considerably shorten cost- and resource-intensive processes in many disciplines.

To meet this AI future head-on, organizations need to combine high-quality data with the best data tooling and data infrastructure. This solid foundation for AI comprises four essential building blocks:

The most relevant and comprehensive data
Domain expertise from humans in the loop
Robust data structuring and rich ontologies
Data management and enrichment

By applying these building blocks, R&D organizations will build a solid innovation base and counteract potential pitfalls in the rush to embrace AI.

The right data for AI

The nuance of scientific questions in fields from drug discovery to materials development demands high-quality, verified training data. The right data provides greater confidence in AI outcomes.

Sourcing the best quality data

AI models require high-quality data from a variety of sources that are relevant to research questions. Sources can include third-party databases/datasets, open source/public access databases, published literature, and internal and proprietary data. For example, a predictive AI chemistry model requires a breadth of inputs that includes not only internal data, such as information on failed reactions, but also published literature. A model informed by incomplete data will produce inferior results whose shortcomings may not be immediately identified, leading to incorrect and expensive decisions.

Some estimates suggest that asset damage stemming from decisions made by AI agents without human oversight could reach $100 billion by 2030 S’ouvre dans une nouvelle fenêtre.¹ This is a significant fear for businesses, causing them to want AI to serve as a co-pilot rather than a “black box” autopilot. Moreover, LLMs and GAI models are known to “hallucinate” to fill gaps in data. There are also potential safety and health risks of missing important information in areas such as drug discovery.

The importance of data provenance

Reusability issues have been a common stumbling block in R&D for many decades. Researchers operating in highly regulated industries such as life sciences must be confident in the provenance of the high-quality datasets they source for use in AI models. This helps ensure reusability and provides an auditable data trail of evidence-based decision making to regulatory agencies. Organizations require detailed background on the source of datasets, which means creating policies and practices that codify responsible AI practices and clearly document the origin of datasets from third-party providers as well as internal data.² This is essential for producing trustworthy, verifiable and reusable research.³

Co-occurrences of concepts — On the left: The number of gene-disease-compound relationships in Alzheimer’s disease are shown. In abstracts only, there are 23 co-occurrences. With abstracts plus full text, that number increases to 117 co-occurrences.
On the right: Elsevier compared 23 million PubMed abstracts and 2.5 million full-text articles from Elsevier's biological publications. 57% of co-occurrences first published 2003-2012 (2.6 million) were found only in the full-text articles, not in the abstracts.

Avoiding the use of only abstracts

Scientific literature is an essential source of data for AI model building. However, models should be trained on the full text of articles and not only on abstracts. Often, a paper’s abstract does not represent all the findings contained in an article, and certain types of information — such as adverse events, mutations and cell processes — are less likely to be included in an abstract. Moreover, when a data pipeline pulls only from abstracts, co-occurrences that only appear in the full text and can take years to appear in other abstracts are missed.⁴

Over-reliance on public and open access data

Repositories of publicly available data are often used for AI training. Similarly, open access literature and data can be employed for model training. These sources are valuable but limited; using only publicly available or open access data risks missing important information contained elsewhere. For example, one study found that 45% of relationships relevant to drug repurposing projects for rare diseases can be found only in controlled access sources.

A chart showing how if your data pipeline only pulls data from abstracts, you will miss co-occurrences that only appear in the full text. — Analysis of time taken for co-occurrences that appear in the body of an article to appear in the abstracts of other articles. (Source: Elsevier)

Checklist: The most relevant and comprehensive data

☑ The quantity and diversity of data will ensure confidence in model training. ☑ The data sources are high quality, up to date and verified. ☑ There is no over-reliance on abstracts and open-source data.

Domain expertise and knowledge from humans in the loop

Scientific research demands domain expertise. No off-the-shelf, general-purpose AI can solve specific research questions and problems. Similarly, AI for AI’s sake will only lead to time and money spent without relevant business outcomes.

Identifying and defining problems that will benefit from AI

The decisions and predictions that researchers make — such as which protein site a molecule will bind to — involve precise variations and require a high degree of accuracy and specificity. Finding answers to these questions starts with investing in domain-specific knowledge to identify which use cases can benefit from the application of AI in the first place.

For example, technology experts who understand complex metadata used in a field such as biology can construct relevant models. Metadata could include “…the solubility and stability of compounds, possible contaminants, variation in temperature and humidity during the experiments, sources of reagents and other materials, and expiration dates.” (Makarov et al.)

Determining required capabilities and research context

Data scientists who also have domain expertise understand the context of questions asked in relation to the data available. Their insights enable research organizations to better understand which AI approaches will be effective and which are likely to fail. (Holzinger et al.) By tapping into expert technical knowledge, companies avoid spending time and money building solutions that will not actually solve problems. Domain experts further ensure vocabularies and ontologies are constructed to structure datasets so that queries return relevant results without missing essential data.

Collaborating with researchers to access relevant datasets

Technologists with knowledge of a scientific domain can advise research organizations on where to source the best datasets to build a specialist model. They can then further refine and improve datasets to make them machine-readable because they have the chemistry, biology and materials understanding to know which facts are relevant. The other important area of domain expertise is a comprehensive understanding of data licensing, copyright and intellectual property legislation. This avoids legal or regulatory issues emerging — for example, companies may be unaware they lack text-mining rights on a third-party dataset.

Potential sources of datasets to power AI projects in R&D organizations include public datasets, CROs, service providers, software vendors, academic groups, commercial data providers, regulatory authorities and more. — Potential sources of datasets to power AI projects in R&D organizations (Source: SciBite S’ouvre dans une nouvelle fenêtre)

Checklist: Domain expertise and knowledge from humans in the loop

☑ R&D experts are working with AI as a co-pilot and feel augmented by technology. ☑ You have identified specific use cases and workflows that will benefit from AI. ☑ You have written and implemented responsible AI and robust data provenance policies. ☑ Domain experts and data scientists collaborate for data access and data skills as required. ☑ You have implemented scale and KPIs to measure and quantify outcomes.

Robust data structuring and rich ontologies

Complex datasets from multiple sources used in scientific R&D workflows require structuring and normalizing before insights can be revealed and applied. It is not a matter of simply taking the data and plugging it in.

The power of ontologies in AI

Much of the data that R&D organizations source are not AI-ready. Data are siloed and stored in myriad formats with insufficient metadata, making it difficult to retrieve, analyze and use in AI applications. Ontologies are human-generated, machine-readable descriptions of categories and their associated sub-types.⁵Ontologies also define semantic relationships to other classes and capture synonyms, which is essential where there are multiple ways to describe the same entity in scientific literature and other datasets.

In the life sciences, for example, the same gene can be referred to in different ways. Consider PSEN1, which can also be PSNL1 or Presenilin-1. The controlled language and vocabulary delivered by rich ontologies harmonizes data to make it ready for AI model building.

Constructing domain-specific taxonomies and knowledge graphs

Whereas ontologies define multidimensional relationships, taxonomies define and group classes within a single specific domain. Taxonomies and ontologies are used in the creation of knowledge graphs, a powerful method of data science representation that connects data to visually represent a network of facts using entities and relationships. Knowledge graphs are a purpose-built solution that can handle domain-specific terminology and deliver results that go beyond the “flat” search of a relational database. There can be considerable interplay between knowledge graphs and large language models (LLMs) to the benefit of researchers.⁶ LLMs aid in the generation of a knowledge graph and lower the barrier to entry when it comes to the interrogation of graphs, enabling users of all experience levels to benefit.

The role of data science and technology experts

Structuring data using ontologies and taxonomies is highly specialized work. Few R&D organizations have employees with the right mix of skills needed for these tasks, and many organizations lack the technological maturity at the required scale.

For example, organizations may have the right dataset and knowledgeable chemists or biologists who can understand the inputs. These people are experts in their field but have little experience in data-specific tasks. External data scientists play a crucial role in aiding companies to structure data for successful AI, particularly for niche and specific use cases, such as drug repurposing or new materials development. Collaborating with technology experts also ensures that proprietary data and IP are well protected and remain within the organization’s “firewall,” preventing the inadvertent sharing of data in insecure public platforms.

Systems of description range from the weak semantics of a list to the strong semantics of an ontology, which is a formal description of a domain with classes, relationships and logical axioms. — Implementing taxonomies and ontologies improves data quality and usability. (Source: Copyright Clearance Center and SciBite)

Checklist: Robust data structuring and rich ontologies

☑ You have implemented a framework for data integration and normalization. ☑ You used domain-specific ontologies to structure data. ☑ You constructed and applied domain-specific taxonomies.

Data management and enrichment

Continuous data management and enrichment ensures long-term AI success and reduces the time and resources needed to clean and prepare data for model building.

Mitigating challenges around data integration

Datasets come at different levels of AI readiness and in multiple formats and structures. For example, formats could include experimental data in an electronic lab notebook, real-world data from a clinical study, textual references from scientific literature, instrument readings from a machine sensor, or patent data. R&D teams must embed data management practices that normalize and integrate both internal and external data. Creating such a data lifecycle means investing in frameworks for data management, including ontologies and taxonomies.

Data and semantic enrichment to enhance results

Semantic enrichment empowers R&D organizations to release the full potential of data in structured and unstructured public and proprietary datasets. The process transforms text into clean, contextualized data, free from ambiguities, through annotation, tagging concepts and metadata. For example, semantic enrichment software can recognize and extract relevant terms or patterns in text and harmonize synonyms, such as “heart attack” and “myocardial infarction.” This approach eliminates “noise” and reduces AI hallucinations.

The role of custom services in data management

In the same way that building ontologies and taxonomies typically requires the help of outside experts, so does the process of managing and semantically enriching datasets. Domain experts can create training datasets and build context-aware custom vocabularies that implement a shared language across all research functions. Vocabularies can include an organization’s bespoke terms, such as product names, as well as recognized concepts and terms used in its scientific discipline and industry, including by regulatory bodies. This approach ensures that R&D organizations use new data in their AI applications and unlock the value from legacy data that may go back many years.⁷

Checklist: Data management and enrichment

☑ There is a clear strategy for continuous data life cycle management. ☑ Datasets are semantically enriched and contextualized. ☑ You are able to effectively use legacy and existing data in new applications.

Related resources for AI in R&D

Explore additional resources that can help as you build your AI foundation for R&D.

7 success signals in AI adoption for R&D

Common R&D workflows are accelerated.
New high-value use cases are successfully implemented.
R&D expenditures have decreased.
Data reusability challenges are solved.
AI hallucinations in models are eliminated.
IT infrastructure and intellectual property are safeguarded.
AI models are replicated for use in other areas of the organization.

Webinars and other resources

Two recent webinars explore the key concepts covered on this page and more:

AI in small molecule drug discovery is a resource for medicinal and synthetic chemists, reviewing AI in the chemistry DMTA cycle and in predictive retrosynthesis, including techniques and models.

Datasets, data tooling and consulting services

Subscribe to "Foundations for Effective AI"

Sign up to receive a quarterly email from Elsevier on AI in R&D, best practices, building blocks and more. Plus, stay up to date on upcoming webinars and events.