Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005 Dec;71(12):7724-36.
doi: 10.1128/AEM.71.12.7724-7736.2005.

At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies

Affiliations

At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies

Kevin E Ashelford et al. Appl Environ Microbiol. 2005 Dec.

Abstract

A new method for detecting chimeras and other anomalies within 16S rRNA sequence records is presented. Using this method, we screened 1,399 sequences from 19 phyla, as defined by the Ribosomal Database Project, release 9, update 22, and found 5.0% to harbor substantial errors. Of these, 64.3% were obvious chimeras, 14.3% were unidentified sequencing errors, and 21.4% were highly degenerate. In all, 11 phyla contained obvious chimeras, accounting for 0.8 to 11% of the records for these phyla. Many chimeras (43.1%) were formed from parental sequences belonging to different phyla. While most comprised two fragments, 13.7% were composed of at least three fragments, often from three different sources. A separate analysis of the Bacteroidetes phylum (2,739 sequences) also revealed 5.8% records to be anomalous, of which 65.4% were apparently chimeric. Overall, we conclude that, as a conservative estimate, 1 in every 20 public database records is likely to be corrupt. Our results support concerns recently expressed over the quality of the public repositories. With 16S rRNA sequence data increasingly playing a dominant role in bacterial systematics and environmental biodiversity studies, it is vital that steps be taken to improve screening of sequences prior to submission. To this end, we have implemented our method as a program with a simple-to-use graphic user interface that is capable of running on a range of computer platforms. The program is called Pintail, is released under the terms of the GNU General Public License open source license, and is freely available from our website at http://www.cardiff.ac.uk/biosi/research/biosoft/.

PubMed Disclaimer

Figures

FIG. 1.
FIG. 1.
Program screenshot illustrating a typical analysis. In this example, query AY693838 (top left) is compared with subject AJ551147 (bottom left), generating a plot of evolutionary distances that demonstrate high similarity between these two sequences at the 5′ end only. AY693838, introduced into the NCBI on 30 August 2004, is classified by the RDP as belonging to the proposed new OP11 phylum. AJ551147, in contrast, belongs to the β-Proteobacteria genus Janthinobacterium.
FIG. 2.
FIG. 2.
Typical 16S rRNA gene sequence comparison plots generated by Pintail (all graphs generated with window size 300 and step size 25). (A to C) Plots between pairs of trusted sequences of increasing evolutionary distance, while D to F show examples where the query sequence is a chimera. Observed percentage differences between sequences are plotted as black lines. Gray lines show the expected percentage differences for the sequence pairs. Light gray shading indicates expected percentage differences ±5%. Escherichia coli ATCC 11775T (X80725) is compared to Escherichia vulneris ATCC 33821T (X80734) (A), Pseudomonas aeruginosa LMG 1242T (Z76651) (B), and Aquifex pyrophilus (T) Kol5a (M83548) (C). (D to F) Three typical chimeric patterns. (D) The three-fragment Nitrospira chimeric sequence AY373422 (estimated breakpoints, 340 and 740) is compared to its BLAST identified nearest neighbor, X82559. (E) The three fragment chimeric record U10877 generated from Riemerella anatipestifer (T) ATCC 11845 is shown to diverge from the sequence of its nearest neighbor, R. anatipestifer strain 115/02 (AY856450) around E. coli positions 790 to 1130. (F) The two-fragment Fusobacteria chimeric sequence AY548989 (estimated breakpoint, 800) is compared to the sequence from its nearest neighbor, AY548984.
FIG. 3.
FIG. 3.
Illustrating variable regions within the 16S rRNA gene and location of chimeric breakpoints. (A) The frequency of occurrence of the most common nucleotide residue at each base position within the 16S rRNA gene, as determined from RDP-listed 4,383 type strains, with E. coli U00096 as a reference. These frequencies are measures of variability within the gene. (B) Smoothing the data, by taking the mean frequency within a window of 50 bases, moving one base at a time along the gene, creates the plot shown in panel B. The locations of the hypervariable regions are labeled, with gray bars on the x axis defining these regions as V1 to V9 (the Comparative RNA Web Site [http://www.rna.icmb.utexas.edu/]). (C) Histogram of all chimera breakpoints identified in this study and that of Hugenholtz and Huber (8).
FIG. 4.
FIG. 4.
DE values generated from type strain data set containing 2,022 16S rRNA gene sequences without any degenerate base positions (see text). DE value was generated for each of the 2,043,231 pairwise sequence comparisons and plotted against evolutionary distance between sequences. (A) The data set prior to the removal of the 15 anomalous sequences (see text); (B) the plot after removal; (C) the quantile values used to describe these data and incorporated into the Pintail program as a means of calibration.
FIG. 5.
FIG. 5.
Illustrating procedure 2 for unambiguously confirming a chimeric sequence (all graphs were generated with window size 300 and step size 25). (A) In this example, the query, an Acidobacteria sp. (AF523990), is compared to its nearest neighbor (AF523976) identified by BlastN search, and an anomaly at the 5′ end is identified. (B) AF523976 is next compared to its nearest neighbor, AY234512, to confirm that it is reliable. No anomaly is detected. (C) As a final check, AF523990 is compared to AY234512; as expected, the 5′ end anomalous feature is seen. (D) To determine whether this anomaly is chimeric, the identified 5′ region is excised, a BLAST search is undertaken, and the identified nearest neighbor (in this case Actinobacteria X68459) is compared to AF523990. Again, an anomaly is detected, but this time the reverse of that seen in panel A, clearly indicating our query to be a chimera. (E) Comparing X68459 with its neighbor, AF498683, confirms its reliability, and as expected, (F) comparing the original query with AF498683 generates the same profile as that seen in panel D. The chimeric breakpoint can be estimated by superimposing A on D.
FIG. 6.
FIG. 6.
Analysis of the three-fragment chimera AF254401 (all graphs were generated with window size 100 and step size 25). The query is shown compared to AF323775 (A), AF323760 (B), and M88719 (C).
FIG. 7.
FIG. 7.
Distribution of sequence anomalies with the nineteen Bacteria phyla, as defined by the Ribosomal Database Project (3). Numbers in brackets after the phylum (or candidate division) name are the total number of sequences within that phylum present in RDP release 9, update 22, of September 2004.
FIG. 8.
FIG. 8.
First appearance in the NCBI database of the anomalous records identified by this study.

References

    1. Altschul, S., T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402.e - PMC - PubMed
    1. Benson, D. A., I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, B. A. Rapp, and D. L. Wheeler. 2000. GenBank. Nucleic Acids Res. 28:15-18. - PMC - PubMed
    1. Cole, J., B. Chai, T. Marsh, R. Farris, Q. Wang, S. Kulum, S. Chandra, D. McGarrell, T. Schmidt, G. Garrity, and J. Tiedje. 2003. The Ribosomal Database Project (RDP-II): previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res. 31:442-443. - PMC - PubMed
    1. Fox, J. L. 2005. Ribosomal gene milestone met, already left in dust. ASM News 71:6-7.
    1. Garrity, G. M., M. Winters, A. W. Kuo, and D. Searles. 2002. Taxonomic outline of the prokaryotes, p. 49-66. Bergey's manual of systematic bacteriology, 2nd ed. Springer-Verlag, New York, N.Y.

Publication types

Substances

LinkOut - more resources