Worth its weight in gold: getting your legacy data in order
2024๋ 3์ 21์ผ
์ ์: Ann-Marie Roche

Daniel Allan/Image Source via Getty Images
Your R&D-driven company is sitting on reams of research data stashed in countless silos and formats โ and laced with ever-evolving jargon. Where do you begin when you want to set your data free?
The expression โdata is the new goldโ resonates with the people of Johnson Mattheyย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ โ a global company that started as a gold assayer for the Bank of England in 1817. Now focused on sustainable technologies that are catalyzing the net zero transition, โJMโ has just achieved a milestone in safeguarding its intellectual property and making it accessible for researchers and algorithms to trigger future innovation.
Using Elsevierโs SciBite, JM is taking control of its โunstructured data problemโ with data science and AI technologies unlocking โ and interconnecting โ all this knowledge. We spoke with three digital players at JM about the journey.
Webinar: Foundations for effective AI
Dive deeper into how Johnson Matthey is leveraging data, technology and tooling to pursue sustainable technologies using SciBite. Dr Nathan Barrow, Ed Wright and Owen Jones talk about how they got the foundational elements in place to drive more effective AI-driven outcomes.
Digital power trio
โGetting your data house in order is similar to what they say about growing a tree: the best time to start is 20 years ago,โ said Principal Information Analyst Ed Wrightย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ. โThe second-best time is right now.โ
Ed, along with a number of other in-house digital champions, saw the potential โ and long-term value โ of organizing the companyโs data using the FAIRย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ data principles of Findability, Accessibility, Interoperability and Reusability.
Leading the charge, Dr Nathan Barrowย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ oversees the overall digital transformation of JMโs R&D space โ seeking out the most suitable technologies while driving the cultural change required to maximize this tech. In other words, he helps employees transition from all those legacy systems to more modern ones โ while keeping the data in these older systems accessible.
As Data Science Strategy Lead for one of JMโs teams, Owen Jonesย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ oversaw a specific use case. He worked to establish the pipeline that prepared and brought together different data sources that span a dizzying amount of time, formats and terminologies.
Meanwhile, Ed was the driving force in untangling the complexities of the chemicals industry to build standardized vocabularies and ontologies to organize this disparate data into a unified โ and FAIR โ whole.

Nathan Barrow, PhD, R&D Digital Transformation Lead, Johnson Matthey
Making data part of the culture
While many were involved, these three represent the core challenges a company is likely to encounter when taking control of its data. And all three regard what theyโve already achieved as a career highlight.
โOne of the things I am most proud of is getting the SciBite platform into Catalyst Technologiesย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ and raising the awareness that we had an unstructured data problem that we need to tackle,โ said Owen.
โWe managed to turn this into a thing and change the culture,โ Ed added.
And now, with the proof-of-concepts and first use cases deemed a success, itโs onward and upward to other data deposits and other departments โ while bringing the latest AI innovations into the loop.

Owen Jones, Data Science Strategy Lead, Johnson Matthey
Cutting-edge for two centuries
JM has been innovating for over 200 years. The company expanded from assaying gold to refining precious metals and beyond โ pivoting as new challenges and opportunities arose. Today, they rate as a global leader in sustainable technologies that help some of the worldโs leading energy, chemicals and automotive companies decarbonize, reduce emissions and achieve their sustainability goals. โWe've proven weโre not afraid to make these kinds of big changes to maintain a more focused strategy,โ Nathan said.
โOur expertise in precious metals still underpins our technologies: itโs about making every milligram count,โ Ed explained. โAnd with every application being quite niche, these all require their own development and innovation. And weโre still very tightly linked to this idea of questioning how we can make the most of everything. How can we adjust it, tweak it and keep moving it forward?โ
In this way, employees were very open to trying something new if it meant streamlining their research.
However, SciBite still needed to prove its worth.
So much information, so little access
โChemists have been filling up notebooks โ both paper and digital versions โ for decades and decades,โ noted Nathan. โAnd since this was highly valuable intellectual property, they were sent off-site to a locked container. This, of course, makes it very difficult to actually go back and find the information that youโre looking for. And when your colleagues retire, itโs almost impossible to actually go back and find the important information they captured so diligently 20 years ago โ but thatโs still relevant for today.
โTo avoid replicating all of that clever work that happened before, my job is to digitalize the chemistry and science that happens at JM. I am bringing in new tools and software so that the data is all captured not by the chemists but automatically by the instruments collecting the data. Then, the chemists can add extra information in terms of context and whether they thought the experiment worked or not.โ
But while researchers switched to this new electronic lab notebook, there were still two obsolete electronic lab notebooks with legacy information on them. โThese databases represent over 16 years and countless millions worth of research,โ Nathan said. โSo we needed a way for people to search and find the documents they needed โ while going beyond a simple search on title or possibly abstract. Other solutions just didnโt do the job and lacked semantic search capabilities.โ
The fallibility of human search engines
Meanwhile, the problem went beyond just two obsolete electronic lab notebook systems. โWhen it comes down to it, JM has got a huge wealth of knowledge stored in a few individuals,โ said Ed. โAnd the way you work is you go and have a chat with that person. And then they'll point you in the right direction โ perhaps towards a certain report in a certain filing cabinet.
โBut when the (COVID) lockdown happened, either you could no longer get to that person, or they could not get into their filing cabinet. Suddenly, there was this realization: โHold on, we can't work this way anymore. And, actually, these people are also moving steadily toward retirement. What happens then?โโ
When a plan comes together
Happily, it soon became apparent that SciBite had the solution they needed. โI had already encountered SciBite at a conference around 2016, so quite a while ago,โ said Ed. โAnd itโs been a progression since then with the pace picking up with COVID.โ
โWe could lay the foundation during the lockdown period โ when people had more time on their hands. Everything seemed to converge nicely,โ said Nathan. โAs part of our proof of concept, we could put both notebooks into a single server, and since the information was now FAIR โ and accessible for all โ people had access to not only their information but more information. So suddenly, there was less barrier of letting go of their old system since they could all access their data.โ
โEveryone was enthusiastic from the very first test when they could search for their own obscure terms that only existed in the JM universe,โ said Ed. โPeople were very interested, and we realized this is absolutely the right tool and this gave us the confidence to deploy further.โ
A very specific (but universal) use case
โI was actually convinced when I first saw SciBiteโs demo videoย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ,โ said Owen. โIt was easy and straightforward to use. And I must say, since we have 1,300 researchers, it was appealing that we could license it for the whole department and not by user.โ
โWe in the Catalyst Technologies department had very much the same issues as the rest of the company,โ said Owen. โNamely, collating legacy documents and figuring out how we can find all these old documents and bits of knowledge? Just do the math: many of those 1,300 scientists have been working here 20-30-40-50 years. Thatโs a lot of reports. And they are scattered throughout our digital infrastructure.โ
โAnd with the central R&D problem around replacing old electronic lab notebook systems, we quickly realized SciBite could also solve our problems,โ said Owen. โNow itโs been rolled out for six months with over 300,000 documents. People are using it and finding stuff they couldnโt find before. And we want to keep adding new data sources. We actually donโt even know what our 100% is. People are still coming and saying, โHey, we have this library over here where weโve been keeping documents for the past 20 years.โโ
Owen is also eyeing the โorange notebooksโ of lore โ those notebooks that documented all the experiments from the pre-digital age. โAs you can imagine, these handwritten notebooks can be a mess, but weโve already done some extraction experiments, and I am hopeful we can get there.โ
DIY ontologies
The projectโs biggest challenge was, and remains, building the ontologies โ the actual codification of all of JMโs facts. And while the SciBite team helped lay the groundwork for this process, it became a largely in-house effort. โSciBite is quite life-sciences focused, so a lot of the built-in ontologies are not applicable to catalyst technology,โ Owen said. โThis is something that really kept Ed busy.โ
โItโs really part of my larger job as Principal Information Analyst,โ Ed noted. โI work in a team that essentially provides intelligence for the company. My role is to see how we can use all of the new data becoming available through government and other open-source resources. I also see how we can use digital tools to better work with more conventional sources such as patents.โ
In this case, the intelligence gathering is happening inside the company. โAnd to move forward, you need to do the standardization; thatโs where you get into the ontologies, for which SciBiteโs CENtree Ontology Managerย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ is very useful,โ Ed explained. โThis in turn moves you into the world of knowledge graphsย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ, where you can connect equivalent concepts across different data sources.
โAnd this is essential for a company like JM. Weโre full of jargon. Every department has its own terminology. And we have 200 years of evolving jargon and 200 years of mergers, acquisitions and divestments โ all with their own systems and nomenclature. So thereโs a lot to sort out.โ
Happily, once itโs sorted, itโs done.
Short-term drudgery for long-term payback
But how do you avoid this relative drudgery of embedding this metadata โ all that data that organizes your data โ for the future?
โPeople who are writing reports today need to think more about how someone reads and uses their report in the future,โ said Owen. โHow are these readers going to find it? How are they going to reuse your data and your knowledge?
โThatโs the tricky part,โ said Nathan. โIn our new system, we are asking our scientists to add more metadata and context to their experiments. Once weโve got a critical mass of information in the system, we can start using that structured data and layering it with AI. And then their lives are going to be a lot faster and easier. But our researchers are not feeling this yet. But we know weโll get there!โ
Onward and upward
In fact, with success, JM can expand on its ambitions. โIโd like to see more documents, reports and data sources, with more parts of JM starting to adopt it,โ said Owen. โTechnically, it would also be nice to see generative AI put on top to make it even more accessible. Using SciBite as the retrieval piece of a (retrieval augmented generation) RAGย ์ ํญ/์ฐฝ์์ ์ด๊ธฐ system helps us reuse all our semantic knowledge and document sets with new AI tools. And thatโs something weโre planning to do internally.โ
โI hope all of the companyโs main teams will have their taxonomies and ontologies in the next few years,โ said Ed. โAnd Iโd love for us to have a knowledge graph based on the tagging and everything we do in the back end. Then, we can start to put more of these AI approaches over the top end of it and really make all of this unstructured data readily accessible to the latest data science approaches.โ
Advice to other legacy companies
โI would say: think big, but start small,โ said Nathan. โHave a well-defined small use case that you can show success with, and then move from there. You canโt boil the ocean.โ
Ed agrees. โI think itโs about accepting that itโs a journey โ and that itโs better not to put it off. Plant that tree now!โ
After all, like a gold mine, it wonโt dig itself.
๊ธฐ์ฌ์

AR
Ann-Marie Roche
Senior Director of Customer Engagement Marketing
Elsevier
Ann-Marie Roche ๋ ์ฝ์ด๋ณด๊ธฐ