. 2022 Apr 26;7(2):e0016722.

doi: 10.1128/msystems.00167-22. Epub 2022 Apr 4.

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

Qiyun Zhu^#^{1

2

3}, Shi Huang^#^{3

4

5}, Antonio Gonzalez³, Imran McGrath^{4

6}, Daniel McDonald³, Niina Haiminen⁷, George Armstrong^{3

4

8}, Yoshiki Vázquez-Baeza⁴, Julian Yu^{1

2}, Justin Kuczynski⁹, Gregory D Sepich-Poore¹⁰, Austin D Swafford⁴, Promi Das^{3

11}, Justin P Shaffer³, Franck Lejzerowicz^{3

4}, Pedro Belda-Ferre^{3

4}, Aki S Havulinna^{12

13}, Guillaume Méric^{14

15}, Teemu Niiranen^{12

16

17}, Leo Lahti¹⁸, Veikko Salomaa¹², Ho-Cheol Kim¹⁹, Mohit Jain^{20

21}, Michael Inouye^{14

22}, Jack A Gilbert^{3

4

11}, Rob Knight^{3

10

23}

Affiliations

¹ School of Life Sciences, Arizona State Universitygrid.215654.1, Tempe, Arizona, USA.
² Center for Fundamental and Applied Microbiomics, Arizona State Universitygrid.215654.1, Tempe, Arizona, USA.
³ Department of Pediatrics, School of Medicine, University of California San Diego, La Jolla, California, USA.
⁴ Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, California, USA.
⁵ Faculty of Dentistry, The University of Hong Kong, Hong Kong SAR, China.
⁶ Division of Biological Sciences, University of California San Diego, La Jolla, California, USA.
⁷ IBM T. J. Watson Research Center, Yorktown Heights, New York, USA.
⁸ Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, California, USA.
⁹ Google, Mountain View, California, USA.
¹⁰ Department of Bioengineering, University of California San Diego, La Jolla, California, USA.
¹¹ Scripps Institution of Oceanography, University of California San Diego, La Jolla, California, USA.
¹² Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland.
¹³ Institute for Molecular Medicine Finland, HiLIFE, University of Helsinki, Helsinki, Finland.
¹⁴ Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.
¹⁵ Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia.
¹⁶ Department of Internal Medicine, University of Turku, Turku, Finland.
¹⁷ Division of Medicine, Turku University Hospital, Finland.
¹⁸ Department of Computing, University of Turku, Turku, Finland.
¹⁹ IBM Almaden Research Center, San Jose, California, USA.
²⁰ Department of Medicine, University of California San Diego, La Jolla, California, USA.
²¹ Department of Pharmacology, University of California San Diego, La Jolla, California, USA.
²² Department of Public Health and Primary Care, Cambridge University, Cambridge, United Kingdom.
²³ Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA.

^# Contributed equally.

PMID: 35369727
PMCID: PMC9040630
DOI: 10.1128/msystems.00167-22

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

Qiyun Zhu et al. mSystems. 2022.

. 2022 Apr 26;7(2):e0016722.

doi: 10.1128/msystems.00167-22. Epub 2022 Apr 4.

Authors

Affiliations

¹ School of Life Sciences, Arizona State Universitygrid.215654.1, Tempe, Arizona, USA.
² Center for Fundamental and Applied Microbiomics, Arizona State Universitygrid.215654.1, Tempe, Arizona, USA.
³ Department of Pediatrics, School of Medicine, University of California San Diego, La Jolla, California, USA.
⁴ Center for Microbiome Innovation, Jacobs School of Engineering, University of California San Diego, La Jolla, California, USA.
⁵ Faculty of Dentistry, The University of Hong Kong, Hong Kong SAR, China.
⁶ Division of Biological Sciences, University of California San Diego, La Jolla, California, USA.
⁷ IBM T. J. Watson Research Center, Yorktown Heights, New York, USA.
⁸ Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, California, USA.
⁹ Google, Mountain View, California, USA.
¹⁰ Department of Bioengineering, University of California San Diego, La Jolla, California, USA.
¹¹ Scripps Institution of Oceanography, University of California San Diego, La Jolla, California, USA.
¹² Department of Public Health and Welfare, Finnish Institute for Health and Welfare, Helsinki, Finland.
¹³ Institute for Molecular Medicine Finland, HiLIFE, University of Helsinki, Helsinki, Finland.
¹⁴ Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia.
¹⁵ Department of Infectious Diseases, Central Clinical School, Monash University, Melbourne, Victoria, Australia.
¹⁶ Department of Internal Medicine, University of Turku, Turku, Finland.
¹⁷ Division of Medicine, Turku University Hospital, Finland.
¹⁸ Department of Computing, University of Turku, Turku, Finland.
¹⁹ IBM Almaden Research Center, San Jose, California, USA.
²⁰ Department of Medicine, University of California San Diego, La Jolla, California, USA.
²¹ Department of Pharmacology, University of California San Diego, La Jolla, California, USA.
²² Department of Public Health and Primary Care, Cambridge University, Cambridge, United Kingdom.
²³ Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA.

^# Contributed equally.

PMID: 35369727
PMCID: PMC9040630
DOI: 10.1128/msystems.00167-22

Erratum in

Correction for Zhu et al., "Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy".
Zhu Q, Huang S, Gonzalez A, McGrath I, McDonald D, Haiminen N, Armstrong G, Vázquez-Baeza Y, Yu J, Kuczynski J, Sepich-Poore GD, Swafford AD, Das P, Shaffer JP, Lejzerowicz F, Belda-Ferre P, Havulinna AS, Méric G, Niiranen T, Lahti L, Salomaa V, Kim H-C, Jain M, Inouye M, Gilbert JA, Knight R. Zhu Q, et al. mSystems. 2024 Nov 19;9(11):e0128924. doi: 10.1128/msystems.01289-24. Epub 2024 Oct 18. mSystems. 2024. PMID: 39422474 Free PMC article. No abstract available.

Abstract

We introduce the operational genomic unit (OGU) method, a metagenome analysis strategy that directly exploits sequence alignment hits to individual reference genomes as the minimum unit for assessing the diversity of microbial communities and their relevance to environmental factors. This approach is independent of taxonomic classification, granting the possibility of maximal resolution of community composition, and organizes features into an accurate hierarchy using a phylogenomic tree. The outputs are suitable for contemporary analytical protocols for community ecology, differential abundance, and supervised learning while supporting phylogenetic methods, such as UniFrac and phylofactorization, that are seldom applied to shotgun metagenomics despite being prevalent in 16S rRNA gene amplicon studies. As demonstrated in two real-world case studies, the OGU method produces biologically meaningful patterns from microbiome data sets. Such patterns further remain detectable at very low metagenomic sequencing depths. Compared with taxonomic unit-based analyses implemented in currently adopted metagenomics tools, and the analysis of 16S rRNA gene amplicon sequence variants, this method shows superiority in informing biologically relevant insights, including stronger correlation with body environment and host sex on the Human Microbiome Project data set and more accurate prediction of human age by the gut microbiomes of Finnish individuals included in the FINRISK 2002 cohort. We provide Woltka, a bioinformatics tool to implement this method, with full integration with the QIIME 2 package and the Qiita web platform, to facilitate adoption of the OGU method in future metagenomics studies. IMPORTANCE Shotgun metagenomics is a powerful, yet computationally challenging, technique compared to 16S rRNA gene amplicon sequencing for decoding the composition and structure of microbial communities. Current analyses of metagenomic data are primarily based on taxonomic classification, which is limited in feature resolution. To solve these challenges, we introduce operational genomic units (OGUs), which are the individual reference genomes derived from sequence alignment results, without further assigning them taxonomy. The OGU method advances current read-based metagenomics in two dimensions: (i) providing maximal resolution of community composition and (ii) permitting use of phylogeny-aware tools. Our analysis of real-world data sets shows that it is advantageous over currently adopted metagenomic analysis methods and the finest-grained 16S rRNA analysis methods in predicting biological traits. We thus propose the adoption of OGUs as an effective practice in metagenomic studies.

Keywords: UniFrac; metagenomics; operational genomic unit; reference phylogeny; supervised learning; taxonomy independent.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

**FIG 1**
Feature resolution impacts community ecology analysis in small conceptual examples. (A) Synthetic data set involving three microbial communities, each having 12 unique read hits, as represented by black circles in the frequency table, to a total of 10 reference genomes (OGUs), classified under five species, three genera, and one family, as noted on the left. A phylogenetic tree of the 10 genomes is shown on the right. In this simplified case, the phylogeny is not much more complex than the taxonomy (with three more edges); however, the taxonomic assignment and the phylogenetic placement of genome O5 are not consistent. (B) Beta diversity of the data set. The three samples (circles) are connected by edges representing the pairwise distances calculated by Bray-Curtis (BC) or weighted UniFrac (WU) on the frequency table. For the latter measure, either the taxonomy or the phylogeny was used to quantify the hierarchical relationships among OGUs, as noted in the parentheses. The edge lengths were normalized so that their sum is equal in each graph. This synthetic case study demonstrates that different resolutions of features and feature structures can lead to very different conclusions regarding sample relationships.

**FIG 2**
Analysis of the HMP metagenomes reveals clustering by body environment and differentiation by host sex. The analysis was performed on 210 samples subsampled to one million paired-end shotgun reads each. (A) Heat map of the OGU table randomly subsampled to 10 samples and 100 OGUs. (B) Heat map of a species-level taxonomic profile inferred by Bracken, subsampled to the same 10 samples and 100 randomly selected species. Empty cells represent zero values. The feature count and table density indicated in the y axis describe the entire (not subsampled) table. (C) PCoA based on OGUs using weighted UniFrac calculated with the WoL reference phylogeny based on the OGU table. Samples (dots) are colored by body site and shaped by host sex. (D) PCoA using the currently adopted method (CAM): Bray-Curtis calculated on species-level taxonomic units identified by Bracken, which shows a diagonal pattern that aligns all samples of the four nonoral body sites in one plane. (E) Proportions of community structure variance explained by the first three axes of PCoA. (F) Mean ratio of the beta diversity distances from any oral sample to a sample of the two other oral sites versus to that of nonoral body sites. The lower the mean ratio is, the more similar communities of the three oral sites are to each other in the background of multiple body environments. The bold line in each box represents the median. The whiskers represent 1.5 times the interquartile range (IQR). (G and H) PERMANOVA pseudo-F statistics indicating the differentiation of community structures by body site (G) and by host sex (H). The larger F is, the more distinct the community structures are between groups versus within groups. The y axis is aligned to an F value of 1.0 which indicates no difference. For panel G, all statistics have a P value of 0.001. For H, an asterisk indicates a P value of ≤0.05. (I) Differentially abundant phylogenetic clades by host sex inferred using PhyloFactor and visualized using EMPress on the WoL reference phylogeny. The tree was subsetted to include only OGUs detected in the data set. The top 20 clades by effect size are colored (full details provided in Fig. S1 and S2). The top five clades are numbered 1 through 5 by decreasing effect size, circled, and labeled with corresponding taxonomic annotations. The small color ring represents phylum-level annotations. The inner and outer bar plot rings indicate the OGU counts split by body site (using the same color scheme as in panels A and B) and by host sex, respectively.

**FIG 3**
Analysis of the FINRISK metagenomes showing superior prediction accuracy over taxonomic units and 16S rRNA data. (A) The empirical error (MAE; a lower error indicates better performance) in predicting host chronological age using microbiome features at distinct taxonomic ranks in paired 16S rRNA amplicon and shallow shotgun metagenomics data with a random forest regressor. “None” represents the taxonomy-free, finest-possible level (ASV for 16S, OGU for shotgun). Small circles indicate MAEs in all iterations of 5-fold cross validation. Large circles and error bars indicate means and standard deviations of the five MAEs. (B) Scatterplot of the actual age versus the predicted age by the best-performing model with OGU features in the 5-fold cross-validation. The black line was generated using ggplot2’s local polynomial regression fitting. (C) Phylogenomic tree of 169 OGUs with importance scores of ≥0.1 in the prediction model. The tree was subsampled based on the WoL reference phylogeny and drawn to scale (branch lengths represent mutations per site). Branch colors indicate the mean importance score of all descendants of the clade. Taxonomic labels are displayed where needed. Circles and lines with stops are displayed where needed to assist location of taxonomic labels to target branches or clades.

**FIG 4**
Impact of alignment artifacts on the result of the OGU analysis. (A to D) Robustness of analysis results on the HMP data set. Four commonly used beta diversity measures were tested: Jaccard (unweighted, tree free), Bray-Curtis (weighted, tree free), unweighted UniFrac (tree aware) and weighted UniFrac (tree aware). (A and B) Impact of the maximum number (k) of alignments to consider for each query sequence. This k value corresponds to Bowtie2’s “-k” parameter. “All” instructs Bowtie2 to return all possible alignments. (C and D) Impact of the feature filtering threshold. The original OGU table was filtered to remove OGUs that were below the per-sample relative abundance threshold indicated at the x axis. “None” indicates no filtering (original table). (A and C) PERMANOVA pseudo-F statistics for the separation between samples from male and female hosts. (B and D) Mean ratios of the distances from any oral sample to samples of the two other oral sites versus to that of nonoral body sites. The error bars indicate the means and standard deviations of the ratios.

**FIG 5**
Relationship between OGU taxonomy and community composition. (A to C) Presence of OGU-informed taxonomic units in Bracken-inferred taxonomic profiles using the HMP data set (n = 210). (A) Per-sample fraction of OGUs whose corresponding taxonomic units at different ranks were found by Bracken. (B) Presence/absence of corresponding species of all OGUs in all samples (n = 307,060) in the Bracken result. The y axis indicates the frequency of individual OGUs per sample (out of 1 million paired-end reads). Outliers are not displayed due to the sample size. The bold line is a logistic regression curve, with coefficient and intercept annotated. (C) Per-sample fraction of OGUs filtered by different relative abundance thresholds whose corresponding species were found by Bracken. (D to F) Evaluation of the accuracy of OGU or species assignments and abundance estimation using a simulated metagenomic data set (n = 25) by the area under the precision/recall curve (AUPR) (D), Bray-Curtis (E), and weighted UniFrac (F) against the ground truth. Three tools—SHOGUN, Bracken, and MIDAS—were evaluated. The bold line and the whiskers represent median and 1.5 times the IQR, respectively. Each dot represents a simulated metagenomic sample.

**FIG 6**
Comparison of beta diversity analysis results on the HMP data set using alternative protocols. (A to D) Comparison of the OGU method with five taxonomic profilers. The OGU table was extracted from SHOGUN’s alignment file using Woltka and analyzed using weighted UniFrac with the WoL reference phylogeny. It was compared with species-level profiles analyzed using Bray-Curtis. (E to H) Comparison of six reference genome databases used for generating the OGU table. The six databases are ordered by their volume from small (left) to large (right). Each OGU table was analyzed using either Bray-Curtis or weighted UniFrac with the original phylogenetic tree provided in the database. (I to L) Comparison of three sequence aligners used for generating the OGU table. (A, E, and I) Proportions of variance explained by the first three axes of PCoA. (B, F, and J) PERMANOVA pseudo-F statistics for the separation between samples from male and female hosts. (C, G, and K) P values of the corresponding PERMANOVA tests. (D, H, and L) Mean ratios of the distances from any oral sample to samples of the two other oral sites versus to that of nonoral body sites. The bold line and the whiskers represent median and 1.5 times the IQR, respectively.

**FIG 7**
Impact of sequencing depth on the result of the OGU analysis. The original OGU table generated from 1 million paired-end reads of the HMP data set was randomly subsampled to each of the sampling depths indicated at the x axis, which is equivalent to the number of paired-end reads in the original sequencing data (see Materials and Methods). (A) Pearson correlation coefficient (r) between the composition (frequencies) of the original OGU table and the subsampled tables. To ensure visual resolution of the boxes, three outliers of 1,000, 200 and 100, respectively, are not visible in the range of the y axis. (B) Pearson’s r between the ratios of distances from any oral sample to sample of the two other oral sites versus to that of nonoral body sites. (C and D) PERMANOVA pseudo-F statistics on body site (C) and host sex (D). The results presented in panels B, C, and D were calculated from 10 replicates of random subsampling at each depth. The dashed lines indicate the statistic calculated on the original table. In all panels, the bold line in each box represents the median. The whiskers represent 1.5 times the IQR. Note that the ranges of the y axes differ between panels.

**FIG 8**
Impact of classification resolution and reference genome pool on the analysis results on the HMP data set. All feature tables were generated using the SHOGUN/Woltka protocol. (A to C) Comparison of community profiles classified to OGUs or each of the seven standard taxonomic ranks. (A) Feature counts and table densities of the profiles. (D to I) Comparison of OGU-level and species-level profiles generated using a gradient of sizes of the reference genome pool sampled from the same genome collection. (E) Numbers of genomes and species in the three genome pools: “half,” “core,” and “double.” Note that “core” is the original WoL database, and it is a superset of “half” and a subset of “double.” (I) Numbers of OGUs and species classified in the profiles. (B, E, and H) PERMANOVA pseudo-F statistics for the separation between samples from male and female hosts. (C, F, and I) P values of the corresponding PERMANOVA tests.

**FIG 9**
Procedure for generating OGU tables, as implemented in Woltka. This program serves as an interface connecting upstream sequence alignment and downstream community diversity analysis. The inputs are sequence alignments, and the outputs are feature tables (in this case, OGU tables). The pipeline contains multiple procedures, with rich format support and flexible options to address diverse user needs.

See this image and copyright information in PMC

References

1. Hugenholtz P, Chuvochina M, Oren A, Parks DH, Soo RM. 2021. Prokaryotic taxonomy and nomenclature in the age of big sequence data. ISME J 15:1879–1892. doi: 10.1038/s41396-021-00941-x. - DOI - PMC - PubMed
1. Huson DH, Auch AF, Qi J, Schuster SC. 2007. MEGAN analysis of metagenomic data. Genome Res 17:377–386. doi: 10.1101/gr.5969107. - DOI - PMC - PubMed
1. Ounit R, Wanamaker S, Close TJ, Lonardi S. 2015. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16:236. doi: 10.1186/s12864-015-1419-2. - DOI - PMC - PubMed
1. Kim D, Song L, Breitwieser FP, Salzberg SL. 2016. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 26:1721–1729. doi: 10.1101/gr.210641.116. - DOI - PMC - PubMed
1. Wood DE, Lu J, Langmead B. 2019. Improved metagenomic analysis with Kraken 2. Genome Biol 20:257. doi: 10.1186/s13059-019-1891-0. - DOI - PMC - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

Affiliations

Phylogeny-Aware Analysis of Metagenome Community Ecology Based on Matched Reference Genomes while Bypassing Taxonomy

Authors

Affiliations

Erratum in

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Research Materials