Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2019 Jul 12;9(7):143.
doi: 10.3390/metabo9070143.

Statistical Workflow for Feature Selection in Human Metabolomics Data

Affiliations
Review

Statistical Workflow for Feature Selection in Human Metabolomics Data

Joseph Antonelli et al. Metabolites. .

Abstract

High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations.

Keywords: high-dimensional data; large-scale metabolomics; statistical methods.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Metabolite data transformation and centering. A frequently used approach for managing metabolite data collected in a large human cohort study involves log transforming each metabolite measures and centering the data on plate median to account for batch to batch variation. Interestingly, variable transformation can reveal multi-modal distributions.
Figure 2
Figure 2
Actual and simulated metabolomics data. Previously analyzed data, or prior detailed knowledge of the structure of metabolomics data collected from an existing large epidemiologic cohort study (a) can be used to construct simulated data that mimics the data structure observed from real measures (b). These simulated data can be used to estimate statistical power, based on one or more methods of analyses, for planning the design of a future study.
Figure 3
Figure 3
Using multiple statistical methods to evaluate results in a real-life application involving analyses of large cohort metabolite data. We related a panel of bioactive lipid molecule metabolites (i.e., eicosanoids) to putative derivative substrates (i.e., eicosapentaenoic acid (EPA) and docosahexaenoic acid (DHA)), which were considered in all analyses as the outcomes of clinical relevance and interest. We used multiple different statistical methods and compared results. Metabolites are denoted by mass-to-charge (m/z) ratio and retention time (rt, in minutes) using the m/z_rt convention, and are listed in rank order for each outcome (EPA or DHA) according to performance metrics provided by each model.

References

    1. Benjamini Y., Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B. 1995;57:289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. - DOI
    1. Wishart D., Arndt D., Pon A., Sajed T., Guo A.C., Djoumbou Y., Knox C., Wilson M., Liang Y., Grant J. T3 DB: The toxic exposome database. Nucleic Acids Res. 2014;43:D928–D934. doi: 10.1093/nar/gku1004. - DOI - PMC - PubMed
    1. Mayers J.R., Wu C., Clish C.B., Kraft P., Torrence M.E., Fiske B.P., Yuan C., Bao Y., Townsend M.K., Tworoger S.S. Elevation of circulating branched-chain amino acids is an early event in human pancreatic adenocarcinoma development. Nat. Med. 2014;20:1193–1198. doi: 10.1038/nm.3686. - DOI - PMC - PubMed
    1. Hinton D.J., Vázquez M.S., Geske J.R., Hitschfeld M.J., Ho A.M., Karpyak V.M., Biernacka J.M., Choi D.-S. Metabolomics biomarkers to predict acamprosate treatment response in alcohol-dependent subjects. Sci. Rep. 2017;7:2496. doi: 10.1038/s41598-017-02442-4. - DOI - PMC - PubMed
    1. Lewis G.D., Asnani A., Gerszten R.E. Application of metabolomics to cardiovascular biomarker and pathway discovery. J. Am. Coll. Cardiol. 2008;52:117–123. doi: 10.1016/j.jacc.2008.03.043. - DOI - PMC - PubMed