In her book, The Invention of Nature, Andrea Wulf quotes the great naturalist and explorer Alexander von Humboldt as describing the world “as a living whole, not a dead aggregate” .1 The majesty of Chimborazo and the vibrance of the Amazon led him to an interconnected view of nature, corroborated by centuries of observation and experimentation.
Nevertheless, the “dead aggregate” remained thrilling: in 2000, 10 years after the advent of the Human Genome Project, President Bill Clinton declared that the project was poised to “revolutionize the diagnosis, prevention, and treatment of most, if not all, human diseases” 2 Francis Collins, then director of the National Center for Human Genome Research, was equally bullish, declaring that “Genetic prediction of individual risks… will reach the medical mainstream in the next decade or so. The development of designer drugs… will follow soon after” 3. While the nearly $3 billion, 13-year project has been undeniably useful, it fell short of those lofty expectations.4 Drug development, instead of accelerating, seems to have stalled.5 Companies, individuals who need life-saving therapies, and industries requiring technological innovations need faster and cheaper results.
The Human Genome Project was merely a harbinger, and the biology big data era is just beginning. Price tags are falling fast; in 2023, Illumina’s NovaSeq X boasts a $200 cost which – even excluding the cost of sample acquisition and preparation – is a minute fraction of the HGP’s cost just 20 years later. As a result, the largest repository of genetic sequences, GenBank, contains 504,000 formally described species as of 2023,6 and the UK Biobank has over 500,000 human datasets since 2006.7 This is just the tip of the iceberg; genomic datasets are growing at an exponential rate and are beginning to include innovative releases like the human draft pangenome.8 The resulting $33.90 billion market in 2023, growing at a forecasted 16.6% compound annual growth rate (CAGR) over the next decade,9 is a strong indicator that biological big data is here to stay.
Attempts to read a human’s past, present, and future from their As, Ts, Cs, and Gs have stalled and diverged into a variety of broad assays. Researchers have found significant benefits in different profiling techniques generally arranged by the Central Dogma, the model by which biology is best understood: DNA is transcribed into RNA, which translates into proteins that form the functional basis for life. Assays have stratified to include growing sectors in transcriptomics, proteomics, and metabolomics – corresponding to the biochemical entities of RNA, proteins, and metabolites – each estimated in the billions and growing fast. Given that the Central Dogma is an oversimplification of biological processes, and the reality that more and more assays measuring things like the microbiome and the epigenome are surfacing every day, the potential for life sciences data generation seems unlimited.
What’s Driving Momentum?
Companies continue to commit billions of dollars and millions of hours to omics data collection for three major reasons. First, researchers now have substantially more assets at their disposal, especially with the now understood limitations of genomics. It is now possible to employ large datasets to improve assets like biomarkers and directly benefit patients while generating intellectual property. Second, the nature of these big data experiments has fundamentally reconfigured the hypothesis-driven research landscape. Instead of experiments operating to reject or confirm a hypothesis, these big datasets function as laboratories, leading to a variety of hypothesis-generating experiments. Third, the advent of artificial intelligence is substantially enabling novel and innovative means to interrogate the data. The petabytes of data stored in individual databases can now be classified, scoured, and corroborated in innumerable ways.
The optimistic single-omic field has been exposed as a Pollyanna. Instead, researchers have moved towards a more comprehensive understanding of these complex systems using a combination of multiple omics in the expanding and intuitively named field of “multiomics”. In some cases, this simply means analyzing two datasets in tandem; for example, do the same pathways appear enriched in both the metabolomic and the proteomic data? In more complicated studies, researchers can use advanced statistical methods like neural networks or graph learning to integrate these datasets.10 No matter the means, the ends are the same: multiomics allows researchers greater access to measuring complex biological phenomena at scale. The resulting workflows can offer corroborated mechanistic understandings of systems, potential alternative drug targets, and patient stratification in ways that are inaccessible in a single dataset.
Omics All the Way Down
Multiomics evolved naturally from single omic experiments, which in turn evolved from more specific assays. Biological assays – commercialized as biomarkers or research aids – are quietly a part of everyday life for many people. Those with diabetes require blood sugar surveillance using a continuous glucose monitor (CGM), while a metabolic blood panel – measuring key electrolytes, creatinine, glucose, and urea – can identify dysfunctions in key organs before they become catastrophic. Beyond monitoring, biomarkers can be useful for diagnosis, prognosis, susceptibility, and other functions.11 Those nasal swabs that were necessary starting in 2019 also qualify: researchers were able to amplify and align exogenous DNA against COVID-19’s many strains to rapidly identify infection before it becomes transmissible and other biomarkers might give insight into the differential responses of patients to infection.12 Biomarkers are a diverse and difficult field to qualify; even so, they represent an estimated market share of $90.2 billion in 2024 with a CAGR of 13.6%.13
Given the commercialization potential of these specific measurements, researchers are scrambling to find new ways to qualify function, dysfunction, or disease early and comprehensively. Improvements in cost and analytical capacity have enabled wider research strategies: what began as measuring a single or few metabolites, sequences, or proteins in a biospecimen has grown into multiple thousands of measurements. That same human genome that took over thirteen years to sequence can now be had in less than a week. Over 5,000 metabolites that might’ve taken months individually to elucidate can be delivered as high-quality matches in under five weeks from Metabolon. Measuring at scale has finally met industry’s tempo and cost, opening a vast array of new options for researchers.
Beyond the obvious “why not?”, these wider profiling techniques have multiple benefits. Much is unknown: for example, while those same CGMs are essential for Type 1 diabetics, other, less treatable diseases might require detection or qualification beyond monitoring. Research has demonstrated that combinations of biomarkers can be more precise than carbohydrate antigen profiling in ovarian and pancreatic cancers.14,15 Researchers can operate in a context guided by the circumstances of the disease: carbohydrate antigen 19-9 (CA 19-9) makes sense as a biomarker for cancer because of its consistent presentation in multiple aggressive and late-stage tumors both in vivo and in vitro.16 But by the time something like CA 19-9 is measurable, valuable time has already been lost.17 Rather than relying on late-stage biomarkers, advances in technology can simultaneously use genomic, transcriptomic, metabolomic, and proteomic data to probe the subtle, direct effects of disease. For patients facing low survival rates, high expenses, and medications with adverse side effects, even weeks of earlier detection can mean thousands of dollars and multiple more years of quality life.
The New Laboratory
With over 63,000 known human genes, more than 100,000 postulated human proteins, and a nearly unlimited number of endogenous and exogenous human metabolites, this field is just getting started. Untargeted commercial transcriptomics assays typically provide 10,000-25,000 annotated sequences, and mass spectrometric (MS) assays for metabolomics and proteomics provide several thousand annotated entities. Notably, the process to usher “raw” analytical data to quality biochemical annotation is complex and continuously improving as researchers better understand the data that is collected. The underlying data, now required by many publications and databases in addition to the final annotated version, can provide ongoing insights as processing techniques mature. These ongoing innovations have created a new type of experimentation: massive data collection which provides a new laboratory to perform hypothesis-directed research in.
Public repositories and biobanks are proving to be a crucial instrument for companies at any maturity and size. For startups, having researchers who are familiar with accessing and analyzing repositories like GenBank, the Gene Expression Omnibus (GEO), and the Metabolomics Workbench is essential to providing initial insights to direct expensive lab operations. Large companies fund and leverage these datasets as a means of crucial operations like de-risking clinical trials,18 identifying novel targets,19 and providing reference cohort data against which to more effectively analyze their patient samples.20 Researchers can use these datasets to test published or commercial biological results and therefore deliver the highest scientific rigor to the field. This healthy cycle of research is facilitated by improving “FAIR” (findable, accessible, interoperability, and reusability) datasets that these centralized repositories provide, allowing for a globally collaborative library.21
As well-engineered as these public repositories are, understanding the provenance, analytical settings, and processing steps is crucial to obtaining robust biological insights. These public datasets are prone to the “tragedy of the commons” where they can be clouded by incorrect metadata, poor run quality, or unsuitable data pipelines. Metabolomics data from mass spectrometry, for example, can depend on a variety of analytical factors like instrument settings and platform and inter-laboratory variability is common.22 Sample preparation also has major impacts on the quality of the data because of the transient, sensitive nature of metabolites, and experience with a variety of matrices is key to success.23 The processing pipelines, too, require careful alignment to the analytical and experimental design. Aligning metabolomics data to validated and measured standards run on the same platform is essential to ensuring reliable data collection.24 While the new data laboratory is ushering in an era of broad collaboration and discovery, understanding quality data and partnering with good data providers is essential to maintaining its utility.
Biological Intelligence
Supposing that the data are high-quality and well-engineered, artificial intelligence is facilitating the next frontier of research. Data are relatively easy to generate and there are established instruments and workflows to provide gigabytes of data in a few hours. The resulting files are where some of the greatest opportunities are found; the raw files contain untapped multitudes of information known as analytical “dark matter,”25 while the processed data can be applied to state-of-the-art modeling paradigms to achieve insights. As with all big data, the establishment of strong, FAIR data processes and the creation of large datasets is giving birth to a new era of research.
Data from omics analytical platforms are not straightforward to analyze and are subject to a torturous route of processing. Starting this process from scratch requires a significant degree of collaboration, expertise, and computational power, and thus has led to any number of intermediate, semi-raw datasets.26 For example, FASTQ files are widely regarded as “raw” data but are a useful intermediate that retains most data directly from the Illumina device. Metabolomics data are provided in several different formats, including the aptly named .raw format, but are more accessible as forms like .mzML. These raw or raw-lite formats are essential for ongoing insights: as processing algorithms improve and reference libraries expand, more and more information can surface. Ongoing partnerships in developing these machine learning models that are tailored to the consistent, validated data acquisition processes, like Metabolon’s, can continually optimize even old datasets.
On the other end, biological insights can be combined in increasingly innovative ways to provide more comprehensive insights. Classification, regression, and unsupervised learning algorithms are being increasingly applied to biological data, yielding a cottage industry of biomarker discovery engines. Working with multiple layers of omics data at scale has demonstrated consistently better-performing models than singleton data.14,27 The evolution of knowledge graphs and large language models are some of the most important emerging trends for organizing exceedingly complex data into fathomable versions.28,29 Over time, the acquisition of more and more high-quality data from many levels of experimentation, and the backing of it with substantial computational tools, is the future of biology.
Metabolomics
Metabolomics remains a relatively untapped and challenging opportunity for researchers. Where the genome exists as a static blueprint of the cell, the metabolome provides a counterpoint as the transient, interactive layer of biology. It is the closest layer of biology to the phenotype, presenting the final “decisions” of cells as they respond to internal and external signals.28,30 These external signals also operate as a measurable interface between an organism and its environment. The metabolome can interrogate host-microbiome interactions by identifying substances crucial to the health of the microbiome and its influence on the host.31 In cases where even small compounds can pose health risks, measuring the exposome through metabolomics has provided evidence that chemicals like PFAS are, in some ways, the new Broad Street pump handle.32 And, where humans fail to effectively report their nutrition, the metabolome can provide real data that will positively influence quality of experimental conclusions.33 The sheer enormity and extreme amount of expertise required to achieve strong metabolomics measurements are the main obstacles to its adoption, but its role as the interface of biology makes it essential to multiomics and biological discovery.
Conclusions
Biological data at scale has long existed as a “dead aggregate.” While some interesting conclusions can be found in it, advances in the scale, sensitivity, and analysis of the assays that comprise multiomics are breathing new life into it. Accurately reflecting the complexity and interconnectedness of the biological systems that researchers study is already demonstrating consistent, quantitative benefit. As access to new processing tools and more curated data commons improves, the concurrent acceleration of biological insights will result in positive outcomes for patients, less risky drug development pipelines, and more mechanistic understanding of the underlying biological processes. Metabolomics plays a key role as the interface between individual phenotype and the environment it’s in but is a difficult field requiring substantial collaboration and expertise. Even so, multiomics is the new frontier for research and promises substantial breakthroughs in the years to come.
References
- Wulf A. The Invention of Nature: Alexander von Humboldt’s New World. Alfred A. Knopf; 2015.
- National Human Genome Research Institute. Draft of the Human Genome Sequence Announcement at the White House. Published online 2000.
- Collins FS, McKusick VA. Implications of the Human Genome Project for Medical Science. JAMA. 2001;285(5):540-544. doi:10.1001/jama.285.5.540
- Hall SS. Revolution Postponed: Why the Human Genome Project Has Been Disappointing. Sci Am. Published online October 2010.
- Congressional Budget Office. Rsearch and Development in the Pharmaceutical Industry.; 2021.
- Sayers EW, Cavanaugh M, Clark K, et al. GenBank 2023 update. Nucleic Acids Res. 2023;51(D1):D141-D144.
- Sudlow C, Gallacher J, Allen N, et al. UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age. PLoS Med. 2015;12(3):e1001779-. https://doi.org/10.1371/journal.pmed.1001779
- Liao WW, Asri M, Ebler J, et al. A draft human pangenome reference. Nature. 2023;617(7960):312-324. doi:10.1038/s41586-023-05896-x
- Nova One Advisor. Genomics Market Size to Hit USD 157.47 Billion by 2023. BioSpace.
- Athieniti E, Spyrou GM. A guide to multi-omics data collection and integration for translational medicine. Comput Struct Biotechnol J. 2023;21:134-149. doi:https://doi.org/10.1016/j.csbj.2022.11.050
- Califf RM. Biomarker definitions and their applications. Exp Biol Med. 2018;243(3):213-221. doi:10.1177/1535370217750088
- Bodaghi A, Fattahi N, Ramazani A. Biomarkers: Promising and valuable tools towards diagnosis, prognosis and treatment of Covid-19 and other diseases. Heliyon. 2023;9(2):e13323. doi:https://doi.org/10.1016/j.heliyon.2023.e13323
- PMI. Biomarker Market Size & Share to Exceed USD 288.5 Billion by 2034, at CAGR of 13.6%. “Growing Emphasis on Early Disease Detection”-By PMI. Yahoo! Finance.
- Kang KN, Koh EY, Jang JY, Kim CW. Multiple biomarkers are more accurate than a combination of carbohydrate antigen 125 and human epididymis protein 4 for ovarian cancer screening. Obstet Gynecol Sci. 2022;65(4):346-354. doi:10.5468/ogs.22017
- Kane LE, Mellotte GS, Mylod E, et al. Diagnostic Accuracy of Blood-based Biomarkers for Pancreatic Cancer: A Systematic Review and Meta-analysis. Cancer Research Communications. 2022;2(10):1229-1243. doi:10.1158/2767-9764.CRC-22-0190
- Lee T, Zheng Jie Teng T, Shelat VG. Carbohydrate antigen 19-9 – tumor marker: Past, present, and future. World J Gastrointest Surg. 2020;12(12):468-490.
- Berger AC, Garcia M, Hoffman JP, et al. Postresection CA 19-9 Predicts Overall Survival in Patients With Pancreatic Cancer Treated With Adjuvant Chemoradiation: A Prospective Validation by RTOG 9704. Journal of Clinical Oncology. 2008;26(36):5918-5922. doi:10.1200/JCO.2008.18.6288
- Huerga I. Q&A: De-risking clinical trials with real-world data. Tempus.com.
- Ivanisevic T, Sewduth RN. Multi-Omics Integration for the Design of Novel Therapies and the Identification of Novel Biomarkers. Proteomes. 2023;11(34).
- Boer AC, Burgers LE, Mangnus L, et al. Using a reference when defining an abnormal MRI reduces false-positive MRI results—a longitudinal study in two cohorts at risk for rheumatoid arthritis. Rheumatology. 2017;56(10):1700-1706. doi:10.1093/rheumatology/kex235
- Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. doi:10.1038/sdata.2016.18
- Lin Y, Caldwell GW, Li Y, Lang W, Masucci J. Inter-laboratory reproducibility of an untargeted metabolomics GC–MS assay for analysis of human plasma. Sci Rep. 2020;10(1):10918. doi:10.1038/s41598-020-67939-x
- González-Domínguez R, González-Domínguez Á, Sayago A, Fernández-Recamales Á. Recommendations and Best Practices for Standardizing the Pre-Analytical Processing of Blood and Urine Samples in Metabolomics. Metabolites. 2020;10(6).
- Sumner LW, Amberg A, Barrett D, et al. Proposed minimum reporting standards for chemical analysis. Metabolomics. 2007;3(3):211-221. doi:10.1007/s11306-007-0082-2
- Ross JL. The Dark Matter of Biology. Biophysical Perspective. 2016;111(5):909-916.
- Krassowski M, Das V, Sahu SK, Misra BB. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet. 2020;11.
- Karaman ED, Işık Z. Multi-Omics Data Analysis Identifies Prognostic Biomarkers across Cancers. Medical Sciences. 2023;11(3).
- Wörheide MA, Krumsiek J, Kastenmüller G, Arnold M. Multi-omics integration in biomedical research – A metabolomics-centric review. Anal Chim Acta. 2021;1141:144-162. doi:https://doi.org/10.1016/j.aca.2020.10.038
- Wang T, Shao W, Huang Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12(1):3445. doi:10.1038/s41467-021-23774-w
- Rattray NJW, Deziel NC, Wallach JD, et al. Beyond genomics: understanding exposotypes through metabolomics. Hum Genomics. 2018;12(1):4. doi:10.1186/s40246-018-0134-x
- Bauermeister A, Mannochio-Russo H, Costa-Lotufo L V, Jarmusch AK, Dorrestein PC. Mass spectrometry-based metabolomics in microbiome investigations. Nat Rev Microbiol. 2022;20(3):143-160. doi:10.1038/s41579-021-00621-9
- Fenton SE, Ducatman A, Boobis A, et al. Per- and Polyfluoroalkyl Substance Toxicity and Human Health Review: Current State of Knowledge and Strategies for Informing Future Research. Environ Toxicol Chem. 2021;40(3):606-630. doi:https://doi.org/10.1002/etc.4890
- Adjoian TK, Firestone MJ, Eisenhower D, Yi SS. Validation of self-rated overall diet quality by Healthy Eating Index-2010 score among New York City adults, 2013. Prev Med Rep. 2016;3:127-131. doi:https://doi.org/10.1016/j.pmedr.2016.01.001