Guide to Multiomics

Chapter 2 — Designing a Multiomics Study

In this chapter we provide an overview of some of the key challenges associated with analyzing multiomics datasets and how to design a robust multiomics study with these challenges in mind.

How Research Benefits from a Multiomics Approach

As discussed briefly in the previous chapter, multiomics research studies, by providing a holistic view of an organism or ecosystem, significantly deepen our understanding of biology. By helping to identify the flow of information between omics layers¹ they can begin to unravel cause-effect relationships^1,2. These characteristics make multiomics analyses particularly powerful for improving disease prediction and prognosis, facilitating the development of improved therapeutic strategies1.

Large-scale multiomics data repositories and cohorts have unequivocally demonstrated the power of multiomic analysis¹. For example, The Cancer Genome Atlas (TCGA)—a combination of genomics, transcriptomics, epigenomics, and proteomics data—has been instrumental in identifying distinct subtypes of breast cancer³ and in describing multiple pathways that drive ovarian cancer and impact therapeutic strategies⁴. Newer cohorts integrating metabolomics and/or microbiome data with other omics data have yielded important insights into the genetic control of molecular traits and heritability of the gut microbiome (MuTher study⁵) and in the identification of genotype-phenotype relationships in cardiometabolic diseases⁶.

Arriving at this higher level of understanding, however, is a challenge. In multiomics research studies, researchers are no longer tasked with uncovering biological insight from one type of dataset. Instead, they must combine multiple layers of biology, captured with disparate data types and formats that on their own can be challenging to analyze. Multiomics analysis can be a daunting, time consuming and, unfortunately, expensive, task. Nevertheless, armed with an understanding of the various challenges facing multiomics analysis, researchers can design and execute robust multiomics studies with significant discovery potential.

The Challenges of Multiomics Analysis

There are several key challenges of multiomics analysis, including the vast amount of data produced by modern high-throughput techniques, heterogeneity in the data, missing data points, and the integration of different types of data so that biologically relevant observations can be made^1,2,7,8. Additionally, a lack of universal standards, not only among the analysis tools used for integrative analysis, but in validating machine learning and other artificial intelligence approaches to data analysis, can complicate integrative analysis^7,8. Each of these challenges is discussed briefly below.

Figure 1. The complexity of multiomics: a combination of omics-driven biology, data science, informatics, statistics, and computational sciences⁸.

Data volume/complexity

High-throughput sequencing, mass spectrometry, and other techniques have drastically increased the amount of data available to researchers. This phenomenon is a double-edged sword: more data does increase the likelihood of identifying novel associations (particularly rare ones), but it also makes analyzing datasets to identify those associations far more difficult.

Usually, each individual omics dataset requires unique data scaling, normalization, and/or transformation approaches that must be performed prior to integration with other omics data in the study⁷. In multiomics studies, more samples are also needed to increase power (more detail in the study design section below), making studies comprising thousands of samples not unlikely. This requires significant computational and data storage resources^2,7.

Data heterogeneity

Not only are a lot of data points produced in multiomics studies, but each individual omics technique produces different amounts of data and in different formats. For example, an RNA-seq approach can yield thousands of transcripts and their isoforms, while proteomics and metabolomics techniques may produce just a few hundred to a few thousand features⁸. And because these data are generated using a range of different platforms, data formats and storage requirements also differ significantly¹—and must be harmonized prior to analysis. Additionally, inconsistency in sample IDs, lack of standard nomenclature, and other technical inconsistencies can lead to additional discrepancies across different omics datasets, further complicating data integration and analysis².

Missing data points

Despite the large amounts of data produced in omics studies, missing data points are a significant issue impacting multiomics analysis. These missing data points occur at the level of the individual omics dataset. For example, the field of genomics has focused mostly on protein-coding regions of the genome, leaving significant gaps regarding noncoding DNA and how these regions are carried out during transcription and translation⁷. Metabolomics and proteomics suffer most significantly from missing data points due to the limitations associated with mass spectrometry, including varying ionization efficiencies, in-source fragmentation, and the presence of numerous isomers, which prevent the confident identification of a significant number of features⁷. Orthogonal separation techniques have been developed to increase confidence in feature identification; however, a significant amount of “dark matter” exists, particularly in the metabolomics field⁹.

Single-cell omics techniques, which are still in their infancy and will be discussed in greater detail in Chapter 8 of this guide, also suffer from missing data points^7,10-12. For one, the volume of starting material is significantly reduced in single-cell studies, causing missing value rates from low capture efficiency, technical variation, and/or stochastic gene expression to be as high as 30%⁷. Along those lines, dropout issues caused by the inability of many samples to fully represent their target population are prevalent in single-cell RNA-sequencing, the main omics technique used in single cell analysis¹⁰. This is a particularly significant problem when targeting low expressed genes/rare genes.

Data integration and analysis

Data volume, heterogeneity, and gaps all make integrating various multiomics datasets for analysis challenging. However, there are many other technical and biological entities that complicate the integration and analysis of multiple omics datasets. Biological variability in and of itself is one of the most important of these factors, as variations in sex, diet, age, and other environmental factors can cause significant molecular fluctuations that can mask true biological signatures⁷.

Additionally, the relationship between genes, transcripts, proteins, and metabolites is more complex than simple one-to-one relationships^2,8,13. ID conversion—the correlation of identities of the same objects across multiple omics layers^2,14—is thus not only necessary but difficult. Often, IDs must be mapped to various databases, which may not cover all omics of interest or may have ID inconsistencies between them (e.g. KEGG GENE based on RefSeq can lead to outdated IDs in one database after changes have been made to another)⁸.

Although ideally multiple omics techniques would be performed on the exact same set of samples, this isn’t always the case. For example, GWAS and expression data are often collected on different samples², requiring genetic signatures to be inferred. It may also be impossible to perform all omics techniques on every sample when rare samples/small sample volumes are available. Multiple layers of inference across multiomics layers can lead to significant noise and prevent robust data analysis and interpretation.

Designing a Robust Multiomics Study

Despite the challenges associated with multiomics research, it is possible to execute a successful research study. Additionally, as analysis techniques are continually developed and improved, existing datasets can and are likely to be reanalyzed later with advanced techniques¹⁵.

There are several considerations to keep in mind when designing a multiomics study, which we break down below:

The scientific question

The most important consideration when designing a multiomics study is your scientific question. Complex diseases or environmental perturbations, for example, will necessarily require more omics approaches that are applied to the exact same samples, data collection at multiple time points, and samples collected from various different locations². In cases where a reliable animal model exists, researchers may want to use an animal model rather than collect human samples to minimize sources of noise and to necessitate fewer samples².

Sources of variation

It is critical for researchers to identify and minimize as much as possible sources of variation. These include both biological and technological sources, such as batch effects, missing data points, data heterogeneity, and analytical variation^2,7,8. It’s important to identify and address these sources at all steps, from sample processing to data acquisition and analysis^15-18.

Technical variation and missing data points can be minimized by being familiar with the limitations of each individual omics technique that you are using in your study¹⁹ and knowing how to address them. For example, a tiered system of metabolite identification confidence has been developed and choosing a vendor that provides Level 1 and 2 metabolite identifications can ensure the highest quality metabolomics data⁷. Metabolon has the largest Level 1 metabolite database available and has worked with hundreds of customers and collaborators on a wide range of projects—including several multiomics research studies.

The data itself also must be manipulated prior to analysis and this should be part of your study design. Being aware of the different omics data outputs and input file format requirements for the various analysis tools will enable you to adequately prepare for data transformation, mapping, filtering, normalization, removal of batch effects, and quality checks^1,2,7.

Sample size and power

As with any scientific study, multiomics research studies must be adequately powered, which will be strongly impacted by background noise, effect size, and sample size². Tarazona and colleagues have described a method for estimating optimal sample size for multiomics experiments and built and open source tool called MultiPower¹ that researchers can use to perform power and sample size estimations for their multiomics study designs¹⁹.

Advanced analytical techniques

Advanced statistical methods and artificial intelligence/machine learning techniques are necessary to accurately analyze multiomics datasets. There is a wide variety of tools available to help researchers analyze their data. Researchers should familiarize themselves with these tools or with data scientists that can help them analyze their data before beginning their study^1,7,8. Tool choice may impact all other aspects of study design, so analysis tools and techniques should be selected during study design.

Conclusions

Multiomics research studies are a powerful way to gain a holistic understanding of biology and the world around us. Unlike individual omics studies, they can help us identify cause-effect relationships. Analyzing multiple complex datasets as one whole is no easy task; however, a variety of tools, techniques, and experts are available to help you design and execute a robust multiomics study with significant discovery potential.

Continue to Chapter 3 - Genomics

In this chapter, we provide a brief overview of genomics—the omics modality concerned with the contents of the genome—and related areas, including epigenomics and metagenomics

Read now

References

Subramanian I, Verma S, Kumar S, et al. Multi-omics Data Integration, Interpretation, and Its Application. Bioinform Biol Insights. 2020;14: 1177932219899051. doi: 10.1177/1177932219899051
Hasin Y, Seldin M, and Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18(1):83. doi: 10.1186/s13059-017-1215-1
Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61-70. doi: 10.1038/nature11412
Zhang H, Liu T, Zhang Z, et al. Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell. 2016;166(3):755-765. doi: 10.1016/j.cell.2016.05.069
Nica AC, Parts L, Glass D, et al. The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genet. 2011;7(2):e1002003. doi: 10.1371/journal.pgen.1002003
Laakso M, Kuusisto J, Stancakova A, et al. The metabolic syndrome in men study: a resource for studies of metabolic and cardiovascular diseases. J Lipid Res. 2017;58(3):481–93. doi: 10.1194/jlr.O072629
Odenkirk MT, Reif DM, and Baker ES. Multiomic Big Data Analysis Challenges: Increasing Confidence in the Interpretation of Artificial Intelligence Assessments. Anal Chem. 202; 93(22): 7763–7773. doi: 10.1021/acs.analchem.0c04850
Krassowski M, Das V, Sahu SK, et al. State of the Field in Multi-Omics Research: From Computational Needs to Data Mining and Sharing. Front Genet. 2020;11: 610798. doi: 10.3389/fgene.2020.610798
da Silva RR, Dorrestein PC, and Quinn RA. Illuminating the dark matter in metabolomics. Proc Natl Acad Sci U S A. 2015;112(41):12549–12550. doi: 10.1073/pnas.1516878112
Ma A, McDermaid A, Xu J, et al. Integrative Methods and Practical Challenges for Single-cell Multi-omics. Trends Biotechnol. 2020 Sep; 38(9):1007–1022. doi: 10.1016/j.tibtech.2020.02.013
Yang MC, Weissman SM, Yang W, et al. MISC: missing imputation for single-cell RNA sequencing data. BMC Syst Biol. 2018;12(Suppl 7):114. doi: 10.1186/s12918-018-0638-y
Hicks SC, Townes FW, Teng M, et al. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics. 2018;19(4):562-578. doi: 10.1093/biostatistics/kxx053
Collins FS, Green ED, Guttmacher AE, et al. A vision for the future of genomics research. Nature. 2003;422(6934):835–47. doi: 10.1038/nature01626
Yugi K, Kubota H, Hatano A, et al. Trans-Omics: how to reconstruct biochemical networks across multiple ‘omic’ layers. Trends Biotechnol. 2016;34:276–90. doi: 10.1016/j.tibtech.2015.12.013
Gilad Y and Mizrahi-Man O. A reanalysis of mouse ENCODE comparative gene expression data. F1000Res. 2015; 19:4:121. doi: 10.12688/f1000research.6536.1
Peixoto L, Risso D, Poplawski SG, et al. How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets. Nucleic Acids Res. 2015;43(16):7664–74. doi: 10.1093/nar/gkv736
SEQC/MAQC-III Consortium. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol. 2014;32(9):903–14. doi: 10.1038/nbt.2957
Hartley SW and Mullikin JC. QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments. BMC Bioinformatics. 2015;16:224. doi: 10.1186/s12859-015-0670-5
Tarazona S, Balzano-Nogueira L, Gómez-Cabrero D, et al. Harmonization of quality metrics and power calculation in multi-omic studies. Nat Commun. 2020;11:3092. doi: 10.1038/s41467-020-16937-8

Chapter 1 — Introduction to Multiomics

Chapter 2 — Designing a Multiomics Study

Chapter 3 — Genomics

Chapter 4 — Transcriptomics

Chapter 5 — Proteomics

Chapter 6 — Metabolomics

Chapter 7 — Microbiome

Chapter 8 — Future of Multiomics

Download guide as PDF

Download PDF

Share this chapter

See how Metabolon can advance your path to preclinical and clinical insights

Get A Project Quote

Talk with an expert

Request a quote for our services, get more information on sample types and handling procedures, request a letter of support, or submit a question about how metabolomics can advance your research.



Corporate Headquarters

617 Davis Drive, Suite 100
Morrisville, NC 27560



+1 (919) 572-1711



+1 (919) 572-1721

Chapter 2 — Designing a Multiomics Study

How Research Benefits from a Multiomics Approach

The Challenges of Multiomics Analysis

Data volume/complexity

Data heterogeneity

Missing data points

Data integration and analysis

Designing a Robust Multiomics Study

The scientific question

Sources of variation

Sample size and power

Advanced analytical techniques

Conclusions

Continue to Chapter 3 - Genomics

References

Table of Contents

See how Metabolon can advance your path to preclinical and clinical insights

Talk with an expert

Corporate Headquarters

Software

Services

Applications

Learn

Company

Newsletter