Microbiome Datasets Are Compositional: And This Is Not Optional

Gregory B Gloor¹, Jean M Macklaim¹, Vera Pawlowsky-Glahn², Juan J Egozcue³

Affiliations

¹ Department of Biochemistry, University of Western Ontario, London, ON, Canada.
² Departments of Computer Science, Applied Mathematics, and Statistics, Universitat de Girona, Girona, Spain.
³ Department of Applied Mathematics, Universitat Politècnica de Catalunya, Barcelona, Spain.

PMID: 29187837
PMCID: PMC5695134
DOI: 10.3389/fmicb.2017.02224

Review

Microbiome Datasets Are Compositional: And This Is Not Optional

Gregory B Gloor et al. Front Microbiol. 2017.

. 2017 Nov 15:8:2224.

doi: 10.3389/fmicb.2017.02224. eCollection 2017.

Authors

Gregory B Gloor¹, Jean M Macklaim¹, Vera Pawlowsky-Glahn², Juan J Egozcue³

Affiliations

¹ Department of Biochemistry, University of Western Ontario, London, ON, Canada.
² Departments of Computer Science, Applied Mathematics, and Statistics, Universitat de Girona, Girona, Spain.
³ Department of Applied Mathematics, Universitat Politècnica de Catalunya, Barcelona, Spain.

PMID: 29187837
PMCID: PMC5695134
DOI: 10.3389/fmicb.2017.02224

Abstract

Datasets collected by high-throughput sequencing (HTS) of 16S rRNA gene amplimers, metagenomes or metatranscriptomes are commonplace and being used to study human disease states, ecological differences between sites, and the built environment. There is increasing awareness that microbiome datasets generated by HTS are compositional because they have an arbitrary total imposed by the instrument. However, many investigators are either unaware of this or assume specific properties of the compositional data. The purpose of this review is to alert investigators to the dangers inherent in ignoring the compositional nature of the data, and point out that HTS datasets derived from microbiome studies can and should be treated as compositions at all stages of analysis. We briefly introduce compositional data, illustrate the pathologies that occur when compositional data are analyzed inappropriately, and finally give guidance and point to resources and examples for the analysis of microbiome datasets using compositional data analysis.

Keywords: Bayesian estimation; compositional data; correlation; count normalization; high-throughput sequencing; microbiota; relative abundance.

PubMed Disclaimer

Figures

**Figure 1**
High-throughput sequencing data are compositional. **(A)** illustrates that the data observed after sequencing a set of nucleic acids from a bacterial population cannot inform on the absolute abundance of molecules. The number of counts in a high throughput sequencing (HTS) dataset reflect the proportion of counts per feature (OTU, gene, etc.) per sample, multiplied by the sequencing depth. Therefore, only the relative abundances are available. The bar plots in **(B)** show the difference between the count of molecules and the proportion of molecules for two features, A (red) and B (gray) in three samples. The top bar graphs show the total counts for three samples, and the height of the color illustrates the total count of the feature. When the three samples are sequenced we lose the absolute count information and only have relative abundances, proportions, or “normalized counts” as shown in the bottom bar graph. Note that features A and B in samples 2 and 3 appear with the same relative abundances, even though the counts in the environment are different. The table below in **(C)** shows real and perceived changes for each sample if we transition from one sample to another.

**Figure 2**
The standard microbiome analysis tool kit and the compositional replacements. A simplified standard microbiome computational workflow is illustrated. The initial normalization steps are not formally equivalent since compositional data are inherently “normalized”, and read count normalization is unnecessary. The other steps are functionally equivalent and substitute a compositionally appropriate approach for one that is not.

See this image and copyright information in PMC

References

1. Aitchison J. (1983). Principal component analysis of compositional data. Biometrika 70, 57–65. 10.1093/biomet/70.1.57 - DOI
1. Aitchison J. (1986). The Statistical Analysis of Compositional Data. London: Chapman and Hall.
1. Aitchison J., Barceló-Vidal C., Martín-Fernández J. A., Pawlowsky-Glahn V. (2000). Logratio analysis and compositional distance. Math. Geol. 32, 271–275. 10.1023/A:1007529726302 - DOI
1. Aitchison J., Greenacre M. (2002). Biplots of compositional data. J. Roy. Stat. Soc. Ser. C 51, 375–392. 10.1111/1467-9876.00275 - DOI
1. Anders S., Huber W. (2010). Differential expression analysis for sequence count data. Genome Biol. 11:R106. 10.1186/gb-2010-11-10-r106 - DOI - PMC - PubMed

Publication types

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Medical
- ClinicalTrials.gov

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Microbiome Datasets Are Compositional: And This Is Not Optional

Affiliations

Microbiome Datasets Are Compositional: And This Is Not Optional

Authors

Affiliations

Abstract

Figures

References

Publication types

LinkOut - more resources

Full Text Sources

Other Literature Sources

Medical