Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2009 Mar 10;106(10):3770-5.
doi: 10.1073/pnas.0810767106. Epub 2009 Feb 20.

Sequence context-specific profiles for homology searching

Affiliations

Sequence context-specific profiles for homology searching

A Biegert et al. Proc Natl Acad Sci U S A. .

Abstract

Sequence alignment and database searching are essential tools in biology because a protein's function can often be inferred from homologous proteins. Standard sequence comparison methods use substitution matrices to find the alignment with the best sum of similarity scores between aligned residues. These similarity scores do not take the local sequence context into account. Here, we present an approach that derives context-specific amino acid similarities from short windows centered on each query sequence residue. Our results demonstrate that the sequence context contains much more information about the expected mutations than just the residue itself. By employing our context-specific similarities (CS-BLAST) in combination with NCBI BLAST, we increase the sensitivity more than 2-fold on a difficult benchmark set, without loss of speed. Alignment quality is likewise improved significantly. Furthermore, we demonstrate considerable improvements when applying this paradigm to sequence profiles: Two iterations of CSI-BLAST, our context-specific version of PSI-BLAST, are more sensitive than 5 iterations of PSI-BLAST. The paradigm for biological sequence comparison presented here is very general. It can replace substitution matrices in sequence- and profile-based alignment and search methods for both protein and nucleotide sequences.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig. 1.
Fig. 1.
Method of context-specific sequence comparison. (A) Sequence search/alignment algorithms find the path that maximizes the sum of similarity scores (color-coded blue to red). Substitution matrix scores are equivalent to profile scores if the sequence profile (colored histogram) is generated from the query sequence by adding artificial mutations with the substitution matrix pseudocount scheme. Histogram bar heights represent the fraction of amino acids in profile columns. (B) Computation of context-specific pseudocounts. The expected mutations (i.e., pseudocounts) for a residue (highlighted in yellow) are calculated based on the sequence context around it (red box). Library profiles contribute to the context-specific sequence profile with weights determined by their similarity to the sequence context (see percentages). The resulting profile can be used to jump-start PSI-BLAST, which will then perform a sequence-to-sequence search with context-specific amino acid similarities. (C) Positional window weights are chosen to decrease exponentially with the distance from the center position to model the decreasing information value of farther positions for the central profile column.
Fig. 2.
Fig. 2.
Context information improves search performance and alignment quality. (A) Homology detection benchmark on SCOP20 dataset: true positives (pairs from the same SCOP superfamily) versus false positives (pairs from different folds). CS-BLAST detects 138% more true positives than BLAST at 10% error rate. (B) CS-BLAST has better average alignment sensitivity and precision than BLAST over the entire range of sequence identities of the aligned pairs. (C) Actual versus reported E-values on the SCOP20 dataset show that CS-BLAST E-values are too optimistic by a factor of 3 to 5. (D) Same benchmark as A (note different y-scales), but comparing CSI-BLAST with PSI-BLAST for one to five iterations. Two CSI-BLAST iterations are more sensitive than five PSI-BLAST iterations.
Fig. 3.
Fig. 3.
Proline-rich region in human transcription factor SOX-9. The mutation profile computed with substitution matrix pseudocounts (Left) overestimates the conservation in this region. The context-specific profile (Right) shows weaker conservation of prolines, alanines, and glutamines, and increased presence of these residues in neighboring columns.
Fig. 4.
Fig. 4.
Computation of the library of context profiles representing local sequence contexts. From a database (NR30) of 1.5M groups of aligned sequences covering the NR database, we select the 50,000 most diverse alignments and enrich these with homologs from a single BLAST search. The alignments are converted to sequence profiles and 1M profile windows are randomly sampled and used to train K context profiles (K = 500, 1,000, 2,000, 4,000) with the expectation maximization algorithm.

References

    1. Dayhoff M, Schwartz R, Orcutt B. A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 1978;5:345–352.
    1. Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA. 1992;89:10915–10919. - PMC - PubMed
    1. Gonnet GH, Cohen MA, Benner SA. Exhaustive matching of the entire protein sequence database. Science. 1992;256:1443–1445. - PubMed
    1. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970;48:443–453. - PubMed
    1. Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147:195–197. - PubMed