Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Oct 15;12(10):2350.
doi: 10.3390/biomedicines12102350.

Correction of Batch Effect in Gut Microbiota Profiling of ASD Cohorts from Different Geographical Origins

Affiliations

Correction of Batch Effect in Gut Microbiota Profiling of ASD Cohorts from Different Geographical Origins

Matteo Scanu et al. Biomedicines. .

Abstract

Background: To date, there have been numerous metataxonomic studies on gut microbiota (GM) profiling based on the analyses of data from public repositories. However, differences in study population and wet and dry pipelines have produced discordant results. Herein, we propose a biostatistical approach to remove these batch effects for the GM characterization in the case of autism spectrum disorders (ASDs).

Methods: An original dataset of GM profiles from patients with ASD was ecologically characterized and compared with GM public digital profiles of age-matched neurotypical controls (NCs). Also, GM data from seven case-control studies on ASD were retrieved from the NCBI platform and exploited for analysis. Hence, on each dataset, conditional quantile regression (CQR) was performed to reduce the batch effects originating from both technical and geographical confounders affecting the GM-related data. This method was further applied to the whole dataset matrix, obtained by merging all datasets. The ASD GM markers were identified by the random forest (RF) model.

Results: We observed a different GM profile in patients with ASD compared with NC subjects. Moreover, a significant reduction of technical- and geographical-dependent batch effects in all datasets was achieved. We identified Bacteroides_H, Faecalibacterium, Gemmiger_A_73129, Blautia_A_141781, Bifidobacterium_388775, and Phocaeicola_A_858004 as robust GM bacterial biomarkers of ASD. Finally, our validation approach provided evidence of the validity of the QCR method, showing high values of accuracy, specificity, sensitivity, and AUC-ROC.

Conclusions: Herein, we proposed an updated biostatistical approach to reduce the technical and geographical batch effects that may negatively affect the description of bacterial composition in microbiota studies.

Keywords: autism spectrum disorders (ASDs); batch effect normalization; gut microbiota; intestinal biomarkers; machine learning; quantile regression.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Graphical summary. (A) Scheme of the comparative workflow for ASD and NC groups. From 82 fecal samples, the bacterial DNA was extracted and the V3–V4 hypervariable region of 16S rRNA was amplified and sequenced on the MiSeq Illumina platform. Amplicon sequence variants (ASVs) were obtained from a total of 150 fastq files (82 ASD fastq files and 68 NC fastq file age, match-selected from PRJNA280490 BioProject) and were assigned taxonomically by the Greengenes database v2022.10. The ecological and univariate analyses were conducted for statistical comparisons. (B) Workflow of the batch effect correction. In the left panel (Discovery Phase), a selection of ASVs that classify an individual as either autistic or neurotypical control by applying conditional quantile regression (CQR) and random forest (RF) models to a 16S rRNA sequencing datasets. In the right panel (Validation Phase), validation of the selected set of ASVs and the CQR method using the Italian validation dataset and the whole validation dataset, respectively.
Figure 2
Figure 2
Comparison of gut microbiota composition between ASD and NC groups. Alpha diversity analysis was evaluated by Shannon–Wiener, Simpson, and Chao1 indexes, and the Mann–Whitney test was used to compare ASD (red) and NC (green) groups (p-value > 0.05) (AC). Principal coordinate analysis (PCoA) shows the dissimilarity between ASD and NC groups calculated by the Bray–Curtis dissimilarity algorithm (PERMANOVA test, p-value < 0.05) (D). Univariate analysis performed with linear discriminant analysis effect size (LEfSe) shows genera differentially expressed and statistically significant between ASD and NC groups with an LDA value > 3 (p-adjusted < 0.05) (E).
Figure 3
Figure 3
PCoA plot of the Bray–Curtis dissimilarity in Italian and Chinese datasets. Principal coordinate analysis (PCoA) was performed on dissimilarity matrices produced by the Bray–Curtis algorithm. In the left panel, the biplots show the PCoA applied to the Italian dataset pre- and post-technical batch correction for the comparison between BioProjects (A,B) and between ASD and NC groups (C,D). In the right panel, the biplots show the PCoA applied to the Chinese dataset pre- and post-technical batch correction for the comparison between BioProjects (E,F) and between ASD and NC groups (G,H). The R2 values, calculated by the PERMANOVA test, are statistically significant (p-value < 0.05).
Figure 4
Figure 4
PCoA plot of the Bray–Curtis dissimilarity in whole dataset. Principal coordinate analysis (PCoA) was performed on dissimilarity matrices produced by the Bray–Curtis algorithm. The biplots show the PCoA performed on dissimilarity matrices pre- and post-technical and geographical batch normalized for the comparison between BioProjects (A,B) and between ASD and NC groups (C,D). The R2 values, calculated by the PERMANOVA test, are statistically significant (p-value < 0.05).
Figure 5
Figure 5
Random forest model applied to bacterial matrices merged by geographical origin. The importance of the 1st 25 genera in the predictive model applied to Chinese (A), Italian (B), and Korean (C) matrices were evaluated using the mean decreasing Gini coefficient. For each RF model, the accuracy, sensitivity, specificity, and AUC-ROC values are reported. The Venn diagram (D) shows the number of unique and shared most important features between datasets.
Figure 6
Figure 6
Random forest applied to the entire count matrix at the genus level. The importance of the 1st 25 genera in the predictive model was evaluated using the mean decreasing Gini coefficient (A). The accuracy, sensitivity, specificity, and AUC-ROC values of the RF model are reported (B).

References

    1. Manor O., Dai C.L., Kornilov S.A., Smith B., Price N.D., Lovejoy J.C., Gibbons S.M., Magis A.T. Health and Disease Markers Correlate with Gut Microbiome Composition across Thousands of People. Nat. Commun. 2020;11:5206. doi: 10.1038/s41467-020-18871-1. - DOI - PMC - PubMed
    1. Petrosino J.F. The Microbiome in Precision Medicine: The Way Forward. Genome Med. 2018;10:12. doi: 10.1186/s13073-018-0525-6. - DOI - PMC - PubMed
    1. Duvallet C., Gibbons S.M., Gurry T., Irizarry R.A., Alm E.J. Meta-Analysis of Gut Microbiome Studies Identifies Disease-Specific and Shared Responses. Nat. Commun. 2017;8:1784. doi: 10.1038/s41467-017-01973-8. - DOI - PMC - PubMed
    1. Dai Z., Coker O.O., Nakatsu G., Wu W.K.K., Zhao L., Chen Z., Chan F.K.L., Kristiansen K., Sung J.J.Y., Wong S.H., et al. Multi-Cohort Analysis of Colorectal Cancer Metagenome Identified Altered Bacteria across Populations and Universal Bacterial Markers. Microbiome. 2018;6:70. doi: 10.1186/s40168-018-0451-2. - DOI - PMC - PubMed
    1. Yu J., Feng Q., Wong S.H., Zhang D., Liang Q.Y., Qin Y., Tang L., Zhao H., Stenvang J., Li Y., et al. Metagenomic Analysis of Faecal Microbiome as a Tool towards Targeted Non-Invasive Biomarkers for Colorectal Cancer. Gut. 2017;66:70–78. doi: 10.1136/gutjnl-2015-309800. - DOI - PubMed

LinkOut - more resources