Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 25;25(5):bbae424.
doi: 10.1093/bib/bbae424.

CAIM: coverage-based analysis for identification of microbiome

Affiliations

CAIM: coverage-based analysis for identification of microbiome

Daniel A Acheampong et al. Brief Bioinform. .

Abstract

Accurate taxonomic profiling of microbial taxa in a metagenomic sample is vital to gain insights into microbial ecology. Recent advancements in sequencing technologies have contributed tremendously toward understanding these microbes at species resolution through a whole shotgun metagenomic approach. In this study, we developed a new bioinformatics tool, coverage-based analysis for identification of microbiome (CAIM), for accurate taxonomic classification and quantification within both long- and short-read metagenomic samples using an alignment-based method. CAIM depends on two different containment techniques to identify species in metagenomic samples using their genome coverage information to filter out false positives rather than the traditional approach of relative abundance. In addition, we propose a nucleotide-count-based abundance estimation, which yield lesser root mean square error than the traditional read-count approach. We evaluated the performance of CAIM on 28 metagenomic mock communities and 2 synthetic datasets by comparing it with other top-performing tools. CAIM maintained a consistently good performance across datasets in identifying microbial taxa and in estimating relative abundances than other tools. CAIM was then applied to a real dataset sequenced on both Nanopore (with and without amplification) and Illumina sequencing platforms and found high similarity of taxonomic profiles between the sequencing platforms. Lastly, CAIM was applied to fecal shotgun metagenomic datasets of 232 colorectal cancer patients and 229 controls obtained from 4 different countries and 44 primary liver cancer patients and 76 controls. The predictive performance of models using the genome-coverage cutoff was better than those using the relative-abundance cutoffs in discriminating colorectal cancer and primary liver cancer patients from healthy controls with a highly confident species markers.

Keywords: bioinformatics; gut microbiome; metagenome; metagenome coverage; taxonomic identification.

PubMed Disclaimer

Figures

Figure 1
Figure 1
CAIM workflow: (1) High-quality metagenomic reads are screened against sketched reference genomes. The reference genomes are further (2) reduced and (3) refined which is later mapped against the metagenomic reads. (4) Unmapped reads are sketched and screen against reference genomes to obtain refined genomes and later are aligned with the unmapped reads. The aligned results are then merged for taxonomic classification.
Figure 2
Figure 2
Bar plot showing the sizes of (a) mock community and (b) real datasets used in this study. Asterisks (*) represent mock community datasets sequenced in our laboratory and the plus signs (+) represent synthetic datasets. All the real datasets sequenced in our laboratory of native metagenome DNA amplified metagenome DNA indicated by the at sign (@). (c) Measure of RMSE comparison between the nucleotide and read-abundance based on perfect reference genome scenario. (d) Average F1 scores based on the genome-coverage cutoff and relative-abundance cutoff for short-reads and long-reads on different subsampled mock community datasets. (e and f) The average F1 score for all the mock datasets at different genome-coverage and relative-abundance cutoffs for CAIM with and without the refinement step in Fig. 1; (formula image) corresponds to abundance cutoff values of (formula image) and genome coverage cutoff of (formula image) CAIM_NoR (CAIM without refinement) and CAIM_WiR (CAIM with refinement).
Figure 3
Figure 3
Comparison of different classifiers on (a) short- and (b) long-read metagenomic mock samples by estimating their F1 score at various relative-abundance cutoffs (%), while genome coverage cutoff of 15% was used for CAIM. (c) The RMSE plot for the various taxonomic classifiers on the mock community datasets based on the highest F1 score from the previous panels.
Figure 4
Figure 4
Comparison of species composition abundance in common from the gut microbiome of different organisms sequenced on both the Illumina and Nanopore sequencing platform as seen in scatter with (a) Spearman’s rho correlation coefficient, (b) alpha diversity, and (c) beta diversity plots. (d) Krona plots of taxonomic compositions of the sample HumanM showed a high similarity between Illumina and Nanopore sequencing platform.
Figure 5
Figure 5
Predictive performance of RF, SVM, and LASSO predictive models using taxonomic profiles derived from CAIM and CAIM_abun relative abundances. (a) Cross-prediction models where we trained the model on the CRC datasets on the y-axis and tested on the x-axis datasets. The diagonal represents the within-dataset prediction where we trained the model on 80% of the datasets and tested on the 20%. (b) Average cross-prediction model AUC values for the different models when trained on the x-axis datasets. (c) Average AUC values for the different models considered when we leave one dataset out (x-axis) and train the model on the other three. We performed a 10-fold cross validation repeated 100 times and reported the average AUC values as shown. (d) Number of important features identified in the LODO settings using RF when the model was trained on the other three and the country shown is exempted from training with their AUC values when predicted with those features for CAIM and CAIM_abun. (b and c). (e) Bump charts illustrated the top five important features in all the RF models trained for the LODO setting and their ranks in the other scenarios (y-axis). X-axis represents the training sets or datasets left out of the model. (f) Comparison of AUC of prediction models of LASSO, SVM, and RF based on taxonomic profiles generated by CAIM and MetaPhlAn3 of the Thai primary liver cancer dataset.

Update of

References

    1. TO D, Malandain C, Prestat E. et al. Metagenomic mining for microbiologists. ISME J 2011;5:1837–43. - PMC - PubMed
    1. Choi KY, Lee TK, Sul WJ. Metagenomic analysis of chicken gut microbiota for improving metabolism and health of chickens—a review. Asian Australas J Anim Sci 2015;28:1217–25. 10.5713/ajas.15.0026. - DOI - PMC - PubMed
    1. Marchesi JR, Adams DH, Fava F. et al. The gut microbiota and host health: a new clinical frontier. Gut 2016;65:330–9. 10.1136/gutjnl-2015-309990. - DOI - PMC - PubMed
    1. Mande SS, Mohammed MH, Ghosh TS. Classification of metagenomic sequences: methods and challenges. Brief Bioinform 2012;13:669–81. 10.1093/bib/bbs054. - DOI - PubMed
    1. Peabody MA, Van Rossum T, Lo R. et al. Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities. BMC Bioinformatics 2015;16:363. - PMC - PubMed