. 2012 Sep 15;28(18):i438-i443.

doi: 10.1093/bioinformatics/bts417.

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

J Patrik Koskinen¹, Liisa Holm

Affiliations

PMID: 22962464
PMCID: PMC3436844
DOI: 10.1093/bioinformatics/bts417

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

J Patrik Koskinen et al. Bioinformatics. 2012.

. 2012 Sep 15;28(18):i438-i443.

doi: 10.1093/bioinformatics/bts417.

Authors

J Patrik Koskinen¹, Liisa Holm

Affiliation

¹ Department of Biosciences, Division of Genetics. University of Helsinki, 00014 Helsinki, Finland.

PMID: 22962464
PMCID: PMC3436844
DOI: 10.1093/bioinformatics/bts417

Abstract

Motivation: The genomic era in molecular biology has brought on a rapidly widening gap between the amount of sequence data and first-hand experimental characterization of proteins. Fortunately, the theory of evolution provides a simple solution: functional and structural information can be transferred between homologous proteins. Sequence similarity searching followed by k-nearest neighbor classification is the most widely used tool to predict the function or structure of anonymous gene products that come out of genome sequencing projects.

Results: We present a novel word filter, suffix array neighborhood search (SANS), to identify protein sequence similarities in the range of 50-100% identity with sensitivity comparable to BLAST and 10 times the speed of USEARCH. In contrast to these previous approaches, the complexity of the search is proportional only to the length of the query sequence and independent of database size, enabling fast searching and functional annotation into the future despite rapidly expanding databases.

Availability and implementation: The software is freely available to non-commercial users from our website http://ekhidna.biocenter.helsinki.fi/downloads/sans.

Contact: liisa.holm@helsinki.fi.

PubMed Disclaimer

Figures

**Fig. 1.**
A classification of approaches used by representative application programs for protein sequence retrieval. This work focuses on word filters based on suffix arrays or k-mer counting. Related approaches or data structures are shown in gray and not expanded. SANS is simple and fast but has a limited application range. Other fast programs calculate explicit alignments for a filtered subset of database proteins. SSEARCH calculates the optimal alignment between the query and all database proteins

**Fig. 2.**
Problem formulations with typical application domains in *italics*

**Fig. 3.**
Schematic illustration of suffix array (SA) and inverse suffix array (ISA) on text ‘BANANAS$’. For example, the suffixes ‘ANANAS$’, ‘ANAS$’ and ‘AS$’ are adjacent in the lexically ordered suffix array

**Fig. 4.**
Correctness of mappings in metagenome benchmark. The top hit of 3.6 million queries were evaluated using BLAST (e-value < 1) as reference of truth. USEARCH was run with default parameters (a = 1, r = 8). SANS was run with window size 1. KSEARCHx were run with word size x

**Fig. 5.**
Relative sensitivity of SSEARCH, BLAST, SANS and KSEARCH on the genome benchmark. The top-1000 hits per query are evaluated. Sensitivity is calculated for different bins of sequence identity. Sensitivities within each bin are scaled linearly so that BLAST sensitivity is one. Sequence identities of the TP hits were taken from SSEARCH. SSEARCH and BLAST hits are sorted by e-value. SANS was run with W = H = 1000. SANS+greedy was run with W = H = 2000 and the top-2000 hits were reranked by greedy alignment score, keeping the top-1000. KSEARCH(k) were run with word length k

**Fig. 6.**
Distribution of AUC values from 4173 query-specific precision-recall plots of the genome benchmark. The top-1000 hits per query were evaluated. AUC values are binned into 10 evenly spaced intervals between one (bottom, perfect result) and zero (gray at top, total failure). KSEARCH runs are labeled with word size (k). SANS runs are labeled with the window size (w). SANS(1000) has the highest average

See this image and copyright information in PMC

References

1. Altschul S.F., et al. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
1. Bejerano G., Yona G. The Proceedings of RECOMB 1999. Lyon, France: ACM Press; 1999. Modeling protein families using probabilistic suffix trees; pp. 15–24.
1. Burkhard S., et al. RECOMB'99 Proceedings of the third annual international conference on Computational molecular biology. Lyon, France: 1999. q-gram based database searching using a suffix array (QUASAR) pp. 77–83.
1. Califano A., Rigoutsos I. FLASH: A fast look-up algorithm for string homology. In: Hunter L., et al., editors. Proceedings of the first International Conference on Intelligent Systems for Molecular Biology. Bethesda, Maryland, USA: 1993. pp. 56–64. - PubMed
1. Devos D., Valencia A. Practical limits of function prediction. Proteins. 2000;41:98–07. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Affiliation

SANS: high-throughput retrieval of protein sequences allowing 50% mismatches

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Research Materials

Miscellaneous