Introduction to Multiomic Data Analysis
The field of bioinformatics has progressed to the point where we have reached a consensus on the best methods for specific data analysis and preparation tasks. This can be seen in genomics, with software such as GATK, Sentieon, and Dragen1–3 commonly providing robust and computationally efficient solutions for variant calling. Similarly, in transcriptomics, STAR has emerged as the popular tool for alignment (with alternative methods like salmon for niche use). EdgeR and DeSeq2 (which have converged on similar answers to normalization and differential expression) are considered the interchangeable gold standard tools for tertiary analyses4,5. Many emerging technologies, such as the proteomics approaches employed by companies like OLINK, come with proprietary data analysis packages that provide the recommended (though not necessarily optimal) data analysis solutions. However, this consensus has not yet emerged in some key areas, such as the integration of multiple omics (multiomics) technologies, where the huge diversity of possible combinations poses unique challenges.
The Challenge of Integrating Multiomics Technologies
The integration of omics data from multiple modalities is a complex topic with multiple competing approaches where there is not yet consensus on the best answers to many questions. The continued emergence of new data modalities (such as single-cell and spatial technologies) creates even more ambiguity regarding the optimal approaches that is unlikely to be resolved for years to come. The primary challenge in merging data from various omics disciplines arises from their distinct data distributions and other unique features intrinsic to each discipline. These characteristics include variations in data format, scale, and measurement techniques, which significantly influence data analysis and interpretation. For example, metabolomics usually utilizes mass spectrometry or other similar methods, which have a lower limit of detection. Therefore, many metabolites that do not meet this threshold are often imputed, leading to many readings at the same level for some metabolites. Gene variant data is generally binary and heavily zero-inflated, a problem shared by newer modalities such as single-cell and spatial data. Normalizing and standardizing all of these data types to ensure comparability provides some challenges to establishing standardized workflows.
Embeddings in Omics Data Integration
Despite this ambiguity however, there are a number of approaches that are certainly here to stay. The first and most prevalent of these approaches is the usage of embeddings. Embeddings are an approach to omics integration that compress the signals associated with sets of highly correlated variables (think levels of transcripts and proteins associated with pathways or processes that are closely related and correlated) into a smaller number of dimensions. Then samples can be analysed on a number of simpler spectrums representing important aspects of variability between samples. The factors driving each of these compressed variables can be interpreted to discover what each compressed variable represents. Essentially every compressed variable will have an associated set of weights that show how much each molecular signal (i.e., gene, protein, variant, metabolite) contributed to it. Most of the common methods used in multiomic embedding generation involve matrix factorization (employed by methods such as MOFA2), canonical correlation analysis, or principal component analysis6,7. Variants of PLS-DA (employed by DIABLO) also exist for supervised analyses8. Most embedding approaches allow for an embedding to be frozen, and for new samples containing the same variables to be projected into the embedding, making embeddings an excellent tool for developing quick and effective classifiers with a low computational cost.
Limitations and Challenges of Embedding-Based Approaches
While embeddings are one of the most prevalent methods of integrating omics data (and analyzing it in general), their usage comes with characteristics that can make their interpretation challenging, and the diversity of options makes arriving at an optimal answer difficult. As there are typically very large numbers of entities (genes, proteins, molecules, etc.) involved in omics experiments, the interpretation of any given compressed variable can be noisy, and without established information linking molecules across multiple omics modalities, the information can be very hard to interpret. Embedding-based approaches are also particularly subject to scaling and normalization issues where multiple omic data types are used. Since the particulars of each omics type are different, much care has to be taken to ensure that appropriate standardization is applied to make all data types comparable, ensure actual integration of signals, and prevent scenarios where covariance between signals from different sources (e.g., metabolomics and transcriptomics) cannot be found due to the way abundances are distributed in those technologies.
Recently, more advanced generative methods (such as autoencoders and variational autoencoders) have come into use, particularly in single-cell and spatial contexts, where data size is typically much larger than in other omics technologies. Some recent studies have shown the efficacy of these generative approaches in single omics contexts, demonstrating that the compressed variables generated via autoencoders are meaningfully mapped to disease-related processes and could be used to isolate disease subtypes9. These generative methods can handle non-linearity that is common in complex biological data better than linear methods, so they may more readily capture important variation across omics layers in a data set than linear methods while also providing similar interpretability as other embedding methods. The primary downside of generative approaches is the large amount of data required to adequately train them, however, continued development will likely decrease the amount of data required for effective generative embeddings.
Despite some of the difficulties associated with embeddings, many of these methods have shown utility over a long period of time, as the ability to detect groups of samples or features that behave similarly over large multiomic experiments is vital to the interpretation of many experimental designs. One previous study using Metabolon data used the MOFA algorithm to discover novel blood biomarkers for tuberculosis in children10. In practice, this enables the detection of disease subtypes or pathways that behave differently in response to a disease or drug treatment. Additionally, embedding-based approaches have been established and improved over time and are quite suitable for implementation in reproducible bioinformatics pipelines.
Local Integration Methods in Multiomic Analysis
Associations between any two specific omics layers are often quantified. These methods are often referred to as ‘local’ integration methods (as opposed to ‘global’ methods such as the usage of embeddings). The most common method of local integration is to quantify the effect of one omics layer on another via some form of regression model. The most well-known of these methods is expression quantitative trait locus analysis (eQTL) which is commonly used to associate changes in DNA sequence with expression of RNA molecules11. The same fundamental approach can be used to associate any two omics layers. The linear mixed-effects models commonly used in this kind of analysis are capable of accounting for covariates, which is vital in the analysis of dynamic and fast-changing systems, such as the metabolome. This kind of local integration is often constrained to narrow sets of features due to the high computational costs that can occur. Because there are typically upwards of hundreds of thousands of variants detected in genome sequencing analyses, it usually makes sense to restrict the analysis to a narrower set of genes or regions of interest. Due to the inherently standardizable nature of these kinds of analyses, they are excellent candidates for implementation as part of a standard platform.
Network Analysis in Multiomic Integration
Networks are one of the most flexible and useful ways of collecting and understanding data. As networks often better reflect the reality of many biological systems than simple binary comparisons or embeddings, they are often considered to provide the most comprehensive and accurate understanding of biological systems. Networks also bring the advantage of being able to represent any kind of connection between any kind of node, meaning that networks can be assembled through combinations of statistical connections between data modalities and results of binary comparisons, allowing for an extremely comprehensive understanding of systems biology. Diverse subtypes of networks can also be used depending on the research questions. Structured hierarchical networks can help reveal the causal chain of activations in response to a drug, while unstructured networks may be more suitable to develop an understanding of a disease, as they can be used to assess connectedness between different aspects of disease (for example immune responses and glucose metabolism). Projects such as the AD atlas, which is a results-based network for Alzheimer’s disease (AD), have shown the utility of these approaches, developing a large, comprehensible, and interactable network of AD-related biological features that provides an excellent utility for investigators seeking to better understand AD12. Network analysis, based on networks constructed from simple metrics (such as Pearson correlation, cosine distance, or Euclidean distance) of similarity between molecular abundances is frequently employed in the analysis of single omics technologies to find contextual relationships between the different variables of an omics experiment, and to find groups of variables that behave similarly. However, when multiple omics modalities are considered, this process grows more complicated, as the complex non-linear ways in which biological molecules may interact can result in a high rate of false positive detection and crowded, difficult-to-interpret networks. Network fusion is one approach designed to help control this. The Similarity Network Fusion (SNF) approach13 is one common method that has been previously used as the basis for robust clustering analysis to detect AD subtypes14. The size of networks (particularly multiomic networks with many connections between each omics layer) creates difficulties in their computational analysis, as potentially millions of paths and connections may need to be considered. It can also be difficult to assess the relative importance of nodes in overall network structures, as simple connectedness and centrality statistics used to assess nodes can often miss important bridging nodes that may not rank highly according to these statistics.
Many of the difficulties of networks can be overcome using more modern approaches. Random walks15, for example, can provide more robust rankings of the importance of nodes and inherently account for network size while not being overly biased by things like connectedness. Generative AI has also done much to improve the interpretability of networks. Graph neural can be used to predict the properties of nodes or edges based on learned network structures, which might be used to classify genes as potential drug targets or form the basis for disease classification algorithms16, while large language models (LLMs) have shown promise at interpreting and explaining networks in a human comprehensible manner through integration with knowledge graphs17, although this approach has not yet demonstrated efficacy in a biological context. With the addition of these new tools to the network analysis toolkit, networks are quickly becoming one of the most promising tools to help understand the complexity of biological systems through multiomic data.
Meta Integration: Utilizing Established Knowledge
Another important approach to omics integration is the usage of established knowledge about the relationships between molecular signals to provide context and interpretability to biological data. This method might better be called ‘meta’ integration. One of the forms this can take is multiomic enrichment analysis, such as the approach applied by the multiGSEA algorithm18, where associations between selected molecules (across multiple modalities), pathways or diseases are statistically associated with the selected molecules according to a set of categories. As the databases typically involved here are publicly available, such an analysis would be a prime candidate for integration into standardized multiomic platforms.
These ‘established knowledge’ based methods are disadvantaged by the fact that all the relevant information is not established in the literature, and in some cases may never be established with sufficient confidence. Fraudulent or inaccurate relationships may also end up published in the literature. With regard to metabolomics, the nature of mass-spectrometry means that some highly context-specific molecules may never be recorded in this system of established relationships, limiting the utility of these methods in metabolomics approaches that don’t solely utilize panels of selected molecules to some degree. However, this context-specific knowledge provides an advantage to organizations that can assemble and utilize it. For example, an organization that can assemble a strong knowledge database of the metabolites specific to a large range of bacteria may be able to offer targeted metabolomics as a detection platform for those bacteria, providing a cheaper alternative to metagenomics.
Emerging Trends and Future Directions in Multiomic Integration
Overall, while many approaches to multiomic integration are continuing to emerge, the families of approaches being employed are beginning to settle. For the foreseeable future, multiomic analyses will employ appropriate embedding methods to understand overall data structure, various forms of linear models to relate molecules to each other across modalities, and correlation networks to understand patterns in how molecules are co-expressed across omics data types. The development of new methods to enable understanding and use of multiomic networks is likely to be key in the coming years, however which methods will emerge as the best options in this area will be difficult to predict.
Aside from exciting developments in multiomic networks, largely the broad families of approaches to multiomics analysis have already been defined. This has created a space for the development of bioinformatics services that can provide access to integrated workflows utilizing these gold standard approaches via established computational infrastructure. This paves the way for a change in how bioinformatics is performed across the industry, creating spaces for developers that focus on integrated platforms that can provide the best solutions to established problems.
Opportunities in Metabolomics and Future Directions
Metabolomics is an emerging field that has the potential to bring significant value to multiomic contexts. Metabolomics is capable of filling gaps that are not accounted for by other omics technologies, providing valuable additional information that can be used to better understand and interpret experiments. Additionally, the fast-changing nature of the metabolome makes it more suited to things like longitudinal experimental designs, where changes in response to a drug or a disease may be tracked over time. Metabolon strives to be a key voice in the handling and interpretation of metabolomics. By leveraging decades of institutional knowledge, Metabolon has gained an understanding of the best practices required to consistently understand and derive value from metabolomics data. Metabolon is uniquely placed to leverage this understanding to help solve the multiomics problems that are the key to developing solid understandings of systems biology.
References
- Freed D, Aldana R, Weber J, Edwards J. The Sentieon Genomics Tools – A fast and accurate solution to variant calling from next-generation sequence dat. Published online 2017. doi:10.1101/115717
- GA A, BD O ’Connor. Genomics in the Cloud: Using Docker, GATK. In: And WDL in Terra. 1st ed. ; 2020.
- Goyal A, Kwon HJ, Lee K, et al. Ultra-Fast Next Generation Human Genome Sequencing Data Processing Using DRAGENTM Bio-IT Processor for Precision Medicine. Open Journal of Genetics. 2017;7(1):9-19. doi:10.4236/ojgen.2017.71002
- Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology. 2014;15(12):550. doi:10.1186/s13059-014-0550-8
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26(1):139-140. doi:10.1093/bioinformatics/btp616
- Argelaguet R, Velten B, Arnol D, et al. Multi‐Omics Factor Analysis—a framework for unsupervised integration of multi‐omics data sets. Molecular Systems Biology. 2018;14(6):e8124. doi:10.15252/msb.20178124
- Jiang MZ, Aguet F, Ardlie K, et al. Canonical correlation analysis for multi-omics: Application to cross-cohort analysis. PLOS Genetics. 2023;19(5):e1010517. doi:10.1371/journal.pgen.1010517
- Singh A, Shannon CP, Gautier B, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics. 2019;35(17):3055-3062. doi:10.1093/bioinformatics/bty1054
- Gomari DP, Schweickart A, Cerchietti L, et al. Variational autoencoders learn transferrable representations of metabolomics data. Commun Biol. 2022;5(1):1-9. doi:10.1038/s42003-022-03579-3
- Dutta NK, Tornheim JA, Fukutani KF, et al. Integration of metabolomics and transcriptomics reveals novel biomarkers in the blood for tuberculosis diagnosis in children. Sci Rep. 2020;10:19527. doi:10.1038/s41598-020-75513-8
- Nica AC, Dermitzakis ET. Expression quantitative trait loci: present and future. Philosophical Transactions of the Royal Society B: Biological Sciences. 2013;368(1620):20120362. doi:10.1098/rstb.2012.0362
- Wörheide MA, Krumsiek J, Nataf S, et al. An Integrated Molecular Atlas of Alzheimer’s Disease. Published online September 22, 2021:2021.09.14.21263565. doi:10.1101/2021.09.14.21263565
- Wang B, Mezlini AM, Demir F, et al. Similarity network fusion for aggregating data types on a genomic scale. Nat Methods. 2014;11(3):333-337. doi:10.1038/nmeth.2810
- Yang M, Matan-Lithwick S, Wang Y, De Jager PL, Bennett DA, Felsky D. Multi-omic integration via similarity network fusion to detect molecular subtypes of ageing. Brain Communications. 2023;5(2):fcad110. doi:10.1093/braincomms/fcad110
- Gentili M, Martini L, Sponziello M, Becchetti L. Biological Random Walks: multi-omics integration for disease gene prioritization. Bioinformatics. 2022;38(17):4145-4152. doi:10.1093/bioinformatics/btac446
- Wang T, Shao W, Huang Z, et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun. 2021;12(1):3445. doi:10.1038/s41467-021-23774-w
- Pan S, Luo L, Wang Y, Chen C, Wang J, Wu X. Unifying Large Language Models and Knowledge Graphs: A Roadmap. IEEE Trans Knowl Data Eng. Published online 2024:1-20. doi:10.1109/TKDE.2024.3352100
- Canzler S, Hackermüller J. multiGSEA: a GSEA-based pathway enrichment analysis for multi-omics data. BMC Bioinformatics. 2020;21(1):561. doi:10.1186/s12859-020-03910-x