Guide to Multiomics
Chapter 2 — Designing a Multiomics Study
In this chapter we provide an overview of some of the key challenges associated with analyzing multiomics datasets and how to design a robust multiomics study with these challenges in mind.
How Research Benefits from a Multiomics Approach
As discussed briefly in the previous chapter, multiomics research studies, by providing a holistic view of an organism or ecosystem, significantly deepen our understanding of biology. By helping to identify the flow of information between omics layers1 they can begin to unravel cause-effect relationships1,2. These characteristics make multiomics analyses particularly powerful for improving disease prediction and prognosis, facilitating the development of improved therapeutic strategies1.
Large-scale multiomics data repositories and cohorts have unequivocally demonstrated the power of multiomic analysis1. For example, The Cancer Genome Atlas (TCGA)—a combination of genomics, transcriptomics, epigenomics, and proteomics data—has been instrumental in identifying distinct subtypes of breast cancer3 and in describing multiple pathways that drive ovarian cancer and impact therapeutic strategies4. Newer cohorts integrating metabolomics and/or microbiome data with other omics data have yielded important insights into the genetic control of molecular traits and heritability of the gut microbiome (MuTher study5) and in the identification of genotype-phenotype relationships in cardiometabolic diseases6.
Arriving at this higher level of understanding, however, is a challenge. In multiomics research studies, researchers are no longer tasked with uncovering biological insight from one type of dataset. Instead, they must combine multiple layers of biology, captured with disparate data types and formats that on their own can be challenging to analyze. Multiomics analysis can be a daunting, time consuming and, unfortunately, expensive, task. Nevertheless, armed with an understanding of the various challenges facing multiomics analysis, researchers can design and execute robust multiomics studies with significant discovery potential.
The Challenges of Multiomics Analysis
There are several key challenges of multiomics analysis, including the vast amount of data produced by modern high-throughput techniques, heterogeneity in the data, missing data points, and the integration of different types of data so that biologically relevant observations can be made1,2,7,8. Additionally, a lack of universal standards, not only among the analysis tools used for integrative analysis, but in validating machine learning and other artificial intelligence approaches to data analysis, can complicate integrative analysis7,8. Each of these challenges is discussed briefly below.
Figure 1. The complexity of multiomics: a combination of omics-driven biology, data science, informatics, statistics, and computational sciences8.
Data volume/complexity
High-throughput sequencing, mass spectrometry, and other techniques have drastically increased the amount of data available to researchers. This phenomenon is a double-edged sword: more data does increase the likelihood of identifying novel associations (particularly rare ones), but it also makes analyzing datasets to identify those associations far more difficult.
Usually, each individual omics dataset requires unique data scaling, normalization, and/or transformation approaches that must be performed prior to integration with other omics data in the study7. In multiomics studies, more samples are also needed to increase power (more detail in the study design section below), making studies comprising thousands of samples not unlikely. This requires significant computational and data storage resources2,7.
Data heterogeneity
Not only are a lot of data points produced in multiomics studies, but each individual omics technique produces different amounts of data and in different formats. For example, an RNA-seq approach can yield thousands of transcripts and their isoforms, while proteomics and metabolomics techniques may produce just a few hundred to a few thousand features8. And because these data are generated using a range of different platforms, data formats and storage requirements also differ significantly1—and must be harmonized prior to analysis. Additionally, inconsistency in sample IDs, lack of standard nomenclature, and other technical inconsistencies can lead to additional discrepancies across different omics datasets, further complicating data integration and analysis2.
Missing data points
Despite the large amounts of data produced in omics studies, missing data points are a significant issue impacting multiomics analysis. These missing data points occur at the level of the individual omics dataset. For example, the field of genomics has focused mostly on protein-coding regions of the genome, leaving significant gaps regarding noncoding DNA and how these regions are carried out during transcription and translation7. Metabolomics and proteomics suffer most significantly from missing data points due to the limitations associated with mass spectrometry, including varying ionization efficiencies, in-source fragmentation, and the presence of numerous isomers, which prevent the confident identification of a significant number of features7. Orthogonal separation techniques have been developed to increase confidence in feature identification; however, a significant amount of “dark matter” exists, particularly in the metabolomics field9.
Single-cell omics techniques, which are still in their infancy and will be discussed in greater detail in Chapter 8 of this guide, also suffer from missing data points7,10-12. For one, the volume of starting material is significantly reduced in single-cell studies, causing missing value rates from low capture efficiency, technical variation, and/or stochastic gene expression to be as high as 30%7. Along those lines, dropout issues caused by the inability of many samples to fully represent their target population are prevalent in single-cell RNA-sequencing, the main omics technique used in single cell analysis10. This is a particularly significant problem when targeting low expressed genes/rare genes.
Data integration and analysis
Data volume, heterogeneity, and gaps all make integrating various multiomics datasets for analysis challenging. However, there are many other technical and biological entities that complicate the integration and analysis of multiple omics datasets. Biological variability in and of itself is one of the most important of these factors, as variations in sex, diet, age, and other environmental factors can cause significant molecular fluctuations that can mask true biological signatures7.
Additionally, the relationship between genes, transcripts, proteins, and metabolites is more complex than simple one-to-one relationships2,8,13. ID conversion—the correlation of identities of the same objects across multiple omics layers2,14—is thus not only necessary but difficult. Often, IDs must be mapped to various databases, which may not cover all omics of interest or may have ID inconsistencies between them (e.g. KEGG GENE based on RefSeq can lead to outdated IDs in one database after changes have been made to another)8.
Although ideally multiple omics techniques would be performed on the exact same set of samples, this isn’t always the case. For example, GWAS and expression data are often collected on different samples2, requiring genetic signatures to be inferred. It may also be impossible to perform all omics techniques on every sample when rare samples/small sample volumes are available. Multiple layers of inference across multiomics layers can lead to significant noise and prevent robust data analysis and interpretation.
Designing a Robust Multiomics Study
Despite the challenges associated with multiomics research, it is possible to execute a successful research study. Additionally, as analysis techniques are continually developed and improved, existing datasets can and are likely to be reanalyzed later with advanced techniques15.
There are several considerations to keep in mind when designing a multiomics study, which we break down below:
The scientific question
The most important consideration when designing a multiomics study is your scientific question. Complex diseases or environmental perturbations, for example, will necessarily require more omics approaches that are applied to the exact same samples, data collection at multiple time points, and samples collected from various different locations2. In cases where a reliable animal model exists, researchers may want to use an animal model rather than collect human samples to minimize sources of noise and to necessitate fewer samples2.
Sources of variation
It is critical for researchers to identify and minimize as much as possible sources of variation. These include both biological and technological sources, such as batch effects, missing data points, data heterogeneity, and analytical variation2,7,8. It’s important to identify and address these sources at all steps, from sample processing to data acquisition and analysis15-18.
Technical variation and missing data points can be minimized by being familiar with the limitations of each individual omics technique that you are using in your study19 and knowing how to address them. For example, a tiered system of metabolite identification confidence has been developed and choosing a vendor that provides Level 1 and 2 metabolite identifications can ensure the highest quality metabolomics data7. Metabolon has the largest Level 1 metabolite database available and has worked with hundreds of customers and collaborators on a wide range of projects—including several multiomics research studies.
The data itself also must be manipulated prior to analysis and this should be part of your study design. Being aware of the different omics data outputs and input file format requirements for the various analysis tools will enable you to adequately prepare for data transformation, mapping, filtering, normalization, removal of batch effects, and quality checks1,2,7.
Sample size and power
As with any scientific study, multiomics research studies must be adequately powered, which will be strongly impacted by background noise, effect size, and sample size2. Tarazona and colleagues have described a method for estimating optimal sample size for multiomics experiments and built and open source tool called MultiPower1 that researchers can use to perform power and sample size estimations for their multiomics study designs19.
Advanced analytical techniques
Advanced statistical methods and artificial intelligence/machine learning techniques are necessary to accurately analyze multiomics datasets. There is a wide variety of tools available to help researchers analyze their data. Researchers should familiarize themselves with these tools or with data scientists that can help them analyze their data before beginning their study1,7,8. Tool choice may impact all other aspects of study design, so analysis tools and techniques should be selected during study design.
Conclusions
Multiomics research studies are a powerful way to gain a holistic understanding of biology and the world around us. Unlike individual omics studies, they can help us identify cause-effect relationships. Analyzing multiple complex datasets as one whole is no easy task; however, a variety of tools, techniques, and experts are available to help you design and execute a robust multiomics study with significant discovery potential.
Continue to Chapter 3 - Genomics
In this chapter, we provide a brief overview of genomics—the omics modality concerned with the contents of the genome—and related areas, including epigenomics and metagenomics
References
- Subramanian I, Verma S, Kumar S, et al. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights. 2020;14: 1177932219899051. doi: 10.1177/1177932219899051
- Hasin Y, Seldin M, and Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18(1):83. doi: 10.1186/s13059-017-1215-1
- Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61-70. doi: 10.1038/nature11412
- Zhang H, Liu T, Zhang Z, et al. Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell. 2016;166(3):755-765. doi: 10.1016/j.cell.2016.05.069
- Nica AC, Parts L, Glass D, et al. The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genet. 2011;7(2):e1002003. doi: 10.1371/journal.pgen.1002003
- Laakso M, Kuusisto J, Stancakova A, et al. The metabolic syndrome in men study: a resource for studies of metabolic and cardiovascular diseases. J Lipid Res. 2017;58(3):481–93. doi: 10.1194/jlr.O072629
- Odenkirk MT, Reif DM, and Baker ES. Multiomic Big Data Analysis Challenges: Increasing Confidence in the Interpretation of Artificial Intelligence Assessments. Anal Chem. 202; 93(22): 7763–7773. doi: 10.1021/acs.analchem.0c04850
- Krassowski M, Das V, Sahu SK, et al. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet. 2020;11: 610798. doi: 10.3389/fgene.2020.610798
- da Silva RR, Dorrestein PC, and Quinn RA. Illuminating the dark matter in metabolomics. Proc Natl Acad Sci U S A. 2015;112(41):12549–12550. doi: 10.1073/pnas.1516878112
- Ma A, McDermaid A, Xu J, et al. Integrative Methods and Practical Challenges for Single-cell Multi-omics. Trends Biotechnol. 2020 Sep; 38(9):1007–1022. doi: 10.1016/j.tibtech.2020.02.013
- Yang MC, Weissman SM, Yang W, et al. MISC: missing imputation for single-cell RNA sequencing data. BMC Syst Biol. 2018;12(Suppl 7):114. doi: 10.1186/s12918-018-0638-y
- Hicks SC, Townes FW, Teng M, et al. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19(4):562-578. doi: 10.1093/biostatistics/kxx053
- Collins FS, Green ED, Guttmacher AE, et al. A vision for the future of genomics research. Nature. 2003;422(6934):835–47. doi: 10.1038/nature01626
- Yugi K, Kubota H, Hatano A, et al. Trans-Omics: how to reconstruct biochemical networks across multiple ‘omic’ layers. Trends Biotechnol. 2016;34:276–90. doi: 10.1016/j.tibtech.2015.12.013
- Gilad Y and Mizrahi-Man O. A reanalysis of mouse ENCODE comparative gene expression data. F1000Res. 2015; 19:4:121. doi: 10.12688/f1000research.6536.1
- Peixoto L, Risso D, Poplawski SG, et al. How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. 2015;43(16):7664–74. doi: 10.1093/nar/gkv736
- SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14. doi: 10.1038/nbt.2957
- Hartley SW and Mullikin JC. QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments. BMC Bioinformatics. 2015;16:224. doi: 10.1186/s12859-015-0670-5
- Tarazona S, Balzano-Nogueira L, Gómez-Cabrero D, et al. Harmonization of quality metrics and power calculation in multi-omic studies. Nat Commun. 2020;11:3092. doi: 10.1038/s41467-020-16937-8
Table of Contents
Download guide as PDF
Share this chapter
See how Metabolon can advance your path to preclinical and clinical insights
Contact Us
Talk with an expert
Request a quote for our services, get more information on sample types and handling procedures, request a letter of support, or submit a question about how metabolomics can advance your research.
Corporate Headquarters
617 Davis Drive, Suite 100
Morrisville, NC 27560