Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2006 Dec 7:4:41.
doi: 10.1186/1741-7007-4-41.

Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST

Affiliations

Composition-based statistics and translated nucleotide searches: improving the TBLASTN module of BLAST

E Michael Gertz et al. BMC Biol. .

Abstract

Background: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server.

Results: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy.

Conclusion: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Statistical accuracy of three variants of TBLASTN. One thousand queries were randomly selected from mouse proteins, permuted, and aligned to human nuclear DNA. For each variant, we plot against x the number of queries with P-value less than or equal to x. The solid line is the theoretically ideal distribution of these values.
Figure 2
Figure 2
A portion of the ROC curves for three variants of TBLASTN. The ROC curves were generated by analyzing the results of aligning 102 queries against the yeast genome. The ROC-250 score for each version of TBLASTN is included in the legend in parentheses after the name of the version. True positives are plotted against false positives, on a linear scale. The total number of true positives possible in this test set was 988. Inset: part of the same ROC curves, plotted on a different scale to show the separation between curves.
Figure 3
Figure 3
A semi-log plot of a portion of the ROC curves for three variants of TBLASTN. The same data as Figure2 in a semi-log plot, using the scales of coverage and errors per query.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol. 1990;215:403–410. doi: 10.1006/jmbi.1990.9999. - DOI - PubMed
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST – a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. - DOI - PMC - PubMed
    1. Gish W, States DJ. Identification of protein coding regions by database similarity search. Nat Genet. 1993;3:266–272. doi: 10.1038/ng0393-266. - DOI - PubMed
    1. Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001;29:2994–3005. doi: 10.1093/nar/29.14.2994. - DOI - PMC - PubMed
    1. Yu YK, Wootton JC, Altschul SF. The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA. 2003;100:15688–15693. doi: 10.1073/pnas.2533904100. - DOI - PMC - PubMed

Publication types