Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2013 Jun:Chapter 3:3.1.1-3.1.8.
doi: 10.1002/0471250953.bi0301s42.

An introduction to sequence similarity ("homology") searching

Affiliations
Review

An introduction to sequence similarity ("homology") searching

William R Pearson. Curr Protoc Bioinformatics. 2013 Jun.

Abstract

Sequence similarity searching, typically with BLAST, is the most widely used and most reliable strategy for characterizing newly determined sequences. Sequence similarity searches can identify "homologous" proteins or genes by detecting excess similarity- statistically significant similarity that reflects common ancestry. This unit provides an overview of the inference of homology from significant similarity, and introduces other units in this chapter that provide more details on effective strategies for identifying homologs.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The distribution of real and expected similarity scores. The human dual specificity protein phosphatase 12 (DUS12_HUMAN) was compared to 38,114 human RefSeq proteins using the SSEARCH program. The distribution of bit-scores for all 38,114 alignments is shown (squares, □) as well as the mathematically expected distribution of normalized similarity scores (z-scores, or standard deviations above and below the mean 0) based on the size of the database, using the extreme-value distribution. The close agreement between the observed and expected distribution of scores reflects the observation that the distribution of unrelated sequence scores is indistinguishable from random (mathematically generated) scores, so sequences with significant sequence similarity can be inferred to be not-unrelated, or homologous.

References

    1. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402. - PMC - PubMed
    1. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004;32:1792–1797. - PMC - PubMed
    1. Gerlt JA, Babbitt PC. Can sequence determine function? Genome Biol. 2000;1(5) REVIEWS0005. - PMC - PubMed
    1. Gonzalez MW, Pearson WR. Homologous over-extension: a challenge for iterative similarity searches. Nuc. Acids Res. 2010;38:2177–2189. - PMC - PubMed
    1. Johnson LS, Eddy SR, Portugaly E. Hidden markov model speed heuristic and iterative hmm search procedure. BMC Bioinformatics. 2010;11:431. - PMC - PubMed