Introduction
To solidify its position as an integral tool for translational sciences and enhance market adoption, metabolomics must confidently encompass a broad spectrum of metabolites while ensuring reproducibility. Advancements in both instrumentation and software have streamlined the process of generating data for untargeted metabolomics studies, making it a relatively standard practice. Consequently, there has been a surge in the number of research cores offering untargeted metabolomics services over the last decade. However, intrinsic challenges remain within this domain.
Notably, the untargeted workflows adopted by many laboratories do not address the various challenges associated with processing data and providing comprehensive experimental coverage. Typically, these workflows generate extensive lists of features with minimal annotation. Researchers then commonly apply a reductionist approach, which focuses on a subset of metabolites or pathways to the exclusion of others, or on metabolites exhibiting statistically significant differences among study groups. However, this practice fails to provide a holistic understanding of metabolome changes.
The failure of researchers to transparently convey the challenges intrinsic to metabolomics, coupled with the clinical community’s rapid drive towards incorporating new technologies, has led to an initial misalignment between expectations and “metabolomic deliverables1.” The aim of this blog is to convey how diverse strategies employed for data processing and annotation heavily influence the outcome and translatability of data.
To address the divide between expectation and deliverable, it is essential to understand what software tools have been employed, as unequivocally, the number of features translated to actionable biological insight is a function of the informatics methods performed. We will discuss two essential areas where informatics pipelines drastically differ in their approach to annotating metabolites based on raw spectral data. Finally, we advise a careful approach when matching expectations with deliverables, as achieving this goal requires a facility that has deep expertise, extensive experience, and a vision to provide data that ensures the translatability of biological insights.
Data Processing
Strong data collection and processing is the cornerstone of untargeted LC-MS based metabolomics and establishes a sound basis to identify significant changes accurately. A series of critically important steps are performed sequentially, which, if not done correctly, can lead to errors in metabolite annotation that can propagate throughout the entire data processing workflow and impact subsequent data interpretation.2 These steps include noise reduction by smoothing or filtering, baseline correction, deconvolution to detect and define thousands of spectral features from a single sample, and finally, peak matching and retention alignment of spectra across multiple samples.
Given the volume of the data generated, sophisticated automatic data processing strategies are necessary. Several “plug and play” open-source software tools are available: XCMS3, MZmine4, MetAlign5, Metabonalyst6, and MS-DIAL7. These options suit researchers who have a general understanding of the underlying principles involved in LC-MS data but cannot build their own custom workflow. Since the underlying algorithms differ, and each software includes several different user-input parameters, the outcome of a metabolomic study will vary depending on how the data are processed.
Although parameter selection can greatly influence the end results, there is often little guidance or consensus on how to make choices that maintain the fidelity of results to the composition of the sample. Adding to the complexity is the non-harmonized instrumentation use across metabolomic studies, an under-recognized but significant source of variation. There remains little uniformity or standardization for data processing strategies between laboratories and research groups, which often makes it difficult to independantly replicate metabolomics findings.
This variability is evident in a recent comparative study among software tools such as MZmine2, enviMass, Compound Discoverer™, and XCMS, which demonstrated low coherence between all four tools.2 The overlap of features was approximately 10%, and, for each software, between 40-55% of features did not match those identified by any of the others. Further comparative analyses have demonstrated similar findings with other software programs8–10.
The variance observed among these software tools can be partly explained by their capability to manage redundant features. Redundancy may arise from non-biological sources, such as artifacts and contaminants, or from biological origins, where degeneracy indicates the presence of multiple features originating from a single analyte. This encompasses isotopes, fragments, adducts, and oligomers. Overall, this gives insight into one underlying cause of why it is difficult to reproduce untargeted metabolomic analysis, even from the same raw data files.
Metabolite Elucidation
Annotation strategies in metabolomics are intricately tied to the confidence level of annotation (Figure 1). Different approaches exist, each influencing the reliability and interpretability of results. The strategy based on authentic standard compounds is the earliest developed road to illustrate the molecular structures of mass spectra and is sufficient for “Level 1” annotations (confident 2D structure annotations)11. Structure annotation of mass spectra along with retention time and fragmentation that searches public/commercial reference spectral libraries such HMDB12, METLIN13, MONA14 or GNPS15 are “Level 2” annotations (probable structures). The third strategy utilizes quantum chemistry, heuristic-based methods, chemical reaction-based methods, and machine learning to predict the in-silico mass spectra of a molecular library or annotate the substructures of query mass spectra and only requires a molecular structure library, rather than reference spectral libraries16. These In-silico approaches have been developed with the aim of expediting the decoding of the dark (i.e., uncharacterized) metabolome.
A recent study demonstrated that virtually all fragment ion structure annotations in three major in silico MS2 libraries (HMDB, METLIN, and mzCloud) are incorrect and caution researchers against their use for structure annotation of MS/MS ions17. METLIN, led by Gary Siuzdak, shared the same sentiment, and removed all in silico-generated data in 2020, now providing only experimental MS/MS data.
Although this publication sounded the alarm on relying on in-silico libraries, recent advancements in the application of network and graph-based methods for metabolomics data analysis have introduced the potential for a more systematic approach to leveraging those resources for exploration of the dark metabolome. Given the concerns over data quality, these approaches must be in the context of early discovery, and their limitations must be acknowledged to deduce novel insights. For translational sciences, where precision and accuracy are paramount, robust methodologies that yield Level 1 annotations are indispensable for confidently interpreting complex data sets.
There is increasing awareness in the international metabolomics community for stringent quality assurance (QA) and quality control (QC) processes to ensure data quality and reproducibility18,19. At Metabolon, our philosophy is driven by the desire to provide clients with the highest data quality. Metabolon overcomes the variability issues observed in data processing and annotation through robust QA/QC procedures, a large Level 1 library, and a patent that utilizes machine learning algorithms to match experimental data with our internal Level 1 library to ensure accuracy. These processes underpin the “chemocentric” approach that distinguishes Metabolon from the rest of the industry.
Our chemocentric approach focuses on the actual metabolites detected rather than the copious amounts of irrelevant ion-features that are the focus of an ion-centric approach. The chemocentric approach allows for extremely efficient data processing workflows while simultaneously reducing the number of false positives and misidentifications of which statistical analyses are performed providing clients with consistent and reproducible results.
Our aim is to enable researchers to draw meaningful conclusions that facilitate actionable insights. This includes elucidating drug mechanisms of action, patient stratification, PK/PD, and safety and toxicity studies for pharmaceutical and biotech companies. For academic researchers, this includes a deeper understanding of chronic diseases or for bioinformaticians to build robust statistical methodologies and create algorithms for muti-omics studies.
Pursuit of new metabolites
Finally, “unknown metabolite” classifications are in evitable, and we urge researchers to take caution when evaluating unknown metabolites that appear promising. Rarely are these features metabolites; instead, they are usually signals associated with redundant features, as mentioned earlier. For example, a report by Sindelar and Patti highlights a three-year trivial pursuit of a “promising feature” that repeatedly demonstrated significant association but ultimately turned out be an in-source fragment.20 Nevertheless, while our decades of experience has shown that most “unknowns” that do not match external databases or our internal library are non-relevant or non-biological features, there are a small number of features that are actually relevant.
As the industry leader in metabolomics, Metabolon produces the most reproducible data of the highest quality in the industry. With over 20 years of experience, we know first-hand the pitfalls and difficulties in LC-MS data processing and annotation. We have developed customizable methods that are continually refined through the application of machine learning algorithms, allowing for quicker time to actionable insights and answers to scientific questions.
References
1. Cheng S, Shah SH, Corwin EJ, et al. Potential Impact and Study Considerations of Metabolomics in Cardiovascular Health and Disease: A Scientific Statement From the American Heart Association. Circulation: Cardiovascular Genetics. 2017;10(2):e000032. doi:10.1161/HCG.0000000000000032
2. Hohrenk LL, Itzel F, Baetz N, Tuerk J, Vosough M, Schmidt TC. Comparison of Software Tools for Liquid Chromatography–High-Resolution Mass Spectrometry Data Processing in Nontarget Screening of Environmental Samples. Anal Chem. 2020;92(2):1898-1907. doi:10.1021/acs.analchem.9b04095
3. Tautenhahn R, Patti GJ, Rinehart D, Siuzdak G. XCMS Online: A Web-Based Platform to Process Untargeted Metabolomic Data. Anal Chem. 2012;84(11):5035-5039. doi:10.1021/ac300698c
4. Schmid R, Heuckeroth S, Korf A, et al. Integrative analysis of multimodal mass spectrometry data in MZmine 3. Nat Biotechnol. 2023;41(4):447-449. doi:10.1038/s41587-023-01690-2
5. LaPierre N, Alser M, Eskin E, Koslicki D, Mangul S. Metalign: efficient alignment-based metagenomic profiling via containment min hash. Genome Biology. 2020;21(1):242. doi:10.1186/s13059-020-02159-0
6. Pang Z, Zhou G, Ewald J, et al. Using MetaboAnalyst 5.0 for LC–HRMS spectra processing, multi-omics integration and covariate adjustment of global metabolomics data. Nat Protoc. 2022;17(8):1735-1761. doi:10.1038/s41596-022-00710-w
7. Tsugawa H, Nakabayashi R, Mori T, et al. A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms. Nat Methods. 2019;16(4):295-298. doi:10.1038/s41592-019-0358-2
8. Gürdeniz G, Kristensen M, Skov T, Dragsted LO. The Effect of LC-MS Data Preprocessing Methods on the Selection of Plasma Biomarkers in Fed vs. Fasted Rats. Metabolites. 2012;2(1):77-99. doi:10.3390/metabo2010077
9. Li Z, Lu Y, Guo Y, Cao H, Wang Q, Shui W. Comprehensive evaluation of untargeted metabolomics data processing software in feature detection, quantification and discriminating marker selection. Analytica Chimica Acta. 2018;1029:50-57. doi:10.1016/j.aca.2018.05.001
10. Myers OD, Sumner SJ, Li S, Barnes S, Du X. Detailed Investigation and Comparison of the XCMS and MZmine 2 Chromatogram Construction and Chromatographic Peak Detection Methods for Preprocessing Mass Spectrometry Metabolomics Data. Anal Chem. 2017;89(17):8689-8695. doi:10.1021/acs.analchem.7b01069
11. Sumner LW, Amberg A, Barrett D, et al. Proposed minimum reporting standards for chemical analysis. Metabolomics. 2007;3(3):211-221. doi:10.1007/s11306-007-0082-2
12. Wishart DS, Guo A, Oler E, et al. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022;50(D1):D622-D631. doi:10.1093/nar/gkab1062
13. Xue J, Guijas C, Benton HP, Warth B, Siuzdak G. METLIN MS2 molecular standards database: a broad chemical and biological resource. Nat Methods. 2020;17(10):953-954. doi:10.1038/s41592-020-0942-5
14. Horai H, Arita M, Kanaya S, et al. MassBank: a public repository for sharing mass spectral data for life sciences. Journal of Mass Spectrometry. 2010;45(7):703-714. doi:10.1002/jms.1777
15. Wang M, Carver JJ, Phelan VV, et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat Biotechnol. 2016;34(8):828-837. doi:10.1038/nbt.3597
16. Tian Z, Liu F, Li D, Fernie AR, Chen W. Strategies for structure elucidation of small molecules based on LC–MS/MS data from complex biological samples. Computational and Structural Biotechnology Journal. 2022;20:5085-5097. doi:10.1016/j.csbj.2022.09.004
17. van Tetering L, Spies S, Wildeman QDK, et al. A spectroscopic test suggests that fragment ion structure annotations in MS/MS libraries are frequently incorrect. Commun Chem. 2024;7(1):1-11. doi:10.1038/s42004-024-01112-7
18. Evans AM, O’Donovan C, Playdon M, et al. Dissemination and analysis of the quality assurance (QA) and quality control (QC) practices of LC–MS based untargeted metabolomics practitioners. Metabolomics. 2020;16(10):113. doi:10.1007/s11306-020-01728-5
19. Mosley JD, Schock TB, Beecher CW, et al. Establishing a framework for best practices for quality assurance and quality control in untargeted metabolomics. Metabolomics. 2024;20(2):20. doi:10.1007/s11306-023-02080-0
20. Sindelar M, Patti GJ. Chemical Discovery in the Era of Metabolomics. J Am Chem Soc. 2020;142(20):9097-9105. doi:10.1021/jacs.9b13198