Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Sep 16:7:12797.
doi: 10.1038/ncomms12797.

Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes

Affiliations

Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes

John A Lees et al. Nat Commun. .

Abstract

Bacterial genomes vary extensively in terms of both gene content and gene sequence. This plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogens Streptococcus pneumoniae and Streptococcus pyogenes, SEER identifies relevant previously characterized resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness of S. pyogenes. We thus demonstrate that our method can answer important biologically and medically relevant questions.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Power to find associations versus number of samples.
Using simulations and subsamples of the population as described in the methods, power for (a) detecting gene presence/absence at different odds ratios (b) using all informative k-mers versus a single length (c) detecting k-mers near, in the correct gene, or containing the causal variant for trimethoprim resistance. All curves are logistic fits to the mean power over 100 subsamples.
Figure 2
Figure 2. Fine mapping trimethoprim resistance.
The locus pictured contains 72 significant k-mers, the most of any gene cluster. Coverage over the locus is pictured at the bottom of the figure. Shown above the genes are high-quality missense SNPs, plotted using their P value for affecting protein function as predicted by SIFT. Scale bar is 200 base pairs.

References

    1. Falush D. Bacterial genomics: Microbial GWAS coming of age. Nat. Microbiol. 1, 16059 (2016). - PubMed
    1. Chen P. E. & Shapiro B. J. The advent of genome-wide association studies for bacteria. Curr. Opin. Microbiol. 25, 17–24 (2015). - PubMed
    1. Farhat M. R. et al. Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis. Nat. Genet. 45, 1183–1189 (2013). - PMC - PubMed
    1. Liu J. Z. & Anderson C. A. Genetic studies of Crohn's disease: past, present and future. Best Pract. Res. Clin. Gastroenterol. 28, 373–386 (2014). - PMC - PubMed
    1. Sheppard S. K. et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc. Natl Acad. Sci. USA 110, 11923–11927 (2013). - PMC - PubMed

Publication types

MeSH terms