Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep 15;28(18):i438-i443.
doi: 10.1093/bioinformatics/bts417.

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Affiliations

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

J Patrik Koskinen et al. Bioinformatics. .

Abstract

Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.

Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50-100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.

Availability and implementation: The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.

Contact: liisa.holm@helsinki.fi.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A classification of approaches used by representative application programs for protein sequence retrieval. This work focuses on word filters based on suffix arrays or k-mer counting. Related approaches or data structures are shown in gray and not expanded. SANS is simple and fast but has a limited application range. Other fast programs calculate explicit alignments for a filtered subset of database proteins. SSEARCH calculates the optimal alignment between the query and all database proteins
Fig. 2.
Fig. 2.
Problem formulations with typical application domains in italics
Fig. 3.
Fig. 3.
Schematic illustration of suffix array (SA) and inverse suffix array (ISA) on text ‘BANANAS$’. For example, the suffixes ‘ANANAS$’, ‘ANAS$’ and ‘AS$’ are adjacent in the lexically ordered suffix array
Fig. 4.
Fig. 4.
Correctness of mappings in metagenome benchmark. The top hit of 3.6 million queries were evaluated using BLAST (e-value < 1) as reference of truth. USEARCH was run with default parameters (a = 1, r = 8). SANS was run with window size 1. KSEARCHx were run with word size x
Fig. 5.
Fig. 5.
Relative sensitivity of SSEARCH, BLAST, SANS and KSEARCH on the genome benchmark. The top-1000 hits per query are evaluated. Sensitivity is calculated for different bins of sequence identity. Sensitivities within each bin are scaled linearly so that BLAST sensitivity is one. Sequence identities of the TP hits were taken from SSEARCH. SSEARCH and BLAST hits are sorted by e-value. SANS was run with W = H = 1000. SANS+greedy was run with W = H = 2000 and the top-2000 hits were reranked by greedy alignment score, keeping the top-1000. KSEARCH(k) were run with word length k
Fig. 6.
Fig. 6.
Distribution of AUC values from 4173 query-specific precision-recall plots of the genome benchmark. The top-1000 hits per query were evaluated. AUC values are binned into 10 evenly spaced intervals between one (bottom, perfect result) and zero (gray at top, total failure). KSEARCH runs are labeled with word size (k). SANS runs are labeled with the window size (w). SANS(1000) has the highest average

References

    1. Altschul S.F., et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Bejerano G., Yona G. The Proceedings of RECOMB 1999. Lyon, France: ACM Press; 1999. Modeling protein families using probabilistic suffix trees; pp. 15–24.
    1. Burkhard S., et al. RECOMB'99 Proceedings of the third annual international conference on Computational molecular biology. Lyon, France: 1999. q-gram based database searching using a suffix array (QUASAR) pp. 77–83.
    1. Califano A., Rigoutsos I. FLASH: A fast look-up algorithm for string homology. In: Hunter L., et al., editors. Proceedings of the first International Conference on Intelligent Systems for Molecular Biology. Bethesda, Maryland, USA: 1993. pp. 56–64. - PubMed
    1. Devos D., Valencia A. Practical limits of function prediction. Proteins. 2000;41:98–07. - PubMed

Publication types

Substances