Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Apr;38(7):2177-89.
doi: 10.1093/nar/gkp1219. Epub 2010 Jan 11.

Homologous over-extension: a challenge for iterative similarity searches

Affiliations

Homologous over-extension: a challenge for iterative similarity searches

Mileidy W Gonzalez et al. Nucleic Acids Res. 2010 Apr.

Abstract

We have characterized a novel type of PSI-BLAST error, homologous over-extension (HOE), using embedded PFAM domain queries on searches against a reference library containing Pfam-annotated UniProt sequences and random synthetic sequences. PSI-BLAST makes two types of errors: alignments to non-homologous regions and HOE alignments that begin in a homologous region, but extend beyond the homology into neighboring sequence regions. When the neighboring sequence region contains a non-homologous domain, PSI-BLAST can incorporate the unrelated sequence into its position specific scoring matrix, which then finds non-homologous proteins with significant expectation values. HOE accounts for the largest fraction of the initial false positive (FP) errors, and the largest fraction of FPs at iteration 5. In searches against complete protein sequences, 5-9% of alignments at iteration 5 are non-homologous. HOE frequently begins in a partial protein domain; when partial domains are removed from the library, HOE errors decrease from 16 to 3% of weighted coverage (hard queries; 35-5% for sampled queries) and no-error searches increase from 2 to 58% weighed coverage (hard; 16-78% sampled). When HOE is reduced by not extending previously found sequences, PSI-BLAST specificity improves 4-8-fold, with little loss in sensitivity.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
True positive and two types of PSI-BLAST errors. PSI-BLAST alignments are classified after comparing the alignment boundaries to the embedding boundaries in the query sequence, and to the annotated domain boundaries in the library sequence. (A) TP—an alignment is classified as a TP if at least 50% of the aligned residues overlap the Pfam annotation. Two types of FP errors can occur: (B) non-homologous FP alignments (NH-FP) and (C) homologous over-extension (HOE) FPs. Non-homologous FPs map entirely to random (and thus unrelated) sequences (B, left), to non-domain regions of the protein (B, middle), or to regions covered by unrelated Pfam domains (B, right). Homologous over-extension FPs occur when two homologous domains align, but the alignments overextend, so that more than 50% of the alignment is outside the homologous region. Both the library domain (C, left and middle) and the query domain (C, right) can overextend.
Figure 2.
Figure 2.
Distribution of initial alignment errors. The expectation values for the first FP errors are shown for each of 50 hard (A and B) or randomly-sampled (C and D) searches, classified by error type (i.e. HOE-L, HOE-Q and NH). FP E()-values are plotted from lowest (most-significant) to highest in each of the four panels; thus, query families are ordered differently in each panel. The iteration number for the first independent FP type (when two FP types occur for a query in the same iteration, the less significant FP is not considered independent) with the lowest E()-value (expectation) for each error type is also shown. (A) FP E()-values and error types with embedded hard queries against the standard domain library. (B) Searches with non-embedded hard queries against the long-domain library. (C) Searches with the embedded randomly sampled queries against the standard domain library [the E()-value for the lowest HOE-L first FP in this panel is E() = 5 × 10−70, but it is graphed at E() = 10−50]. (D) Searches with non-embedded randomly sampled queries against the long-domain library.
Figure 3.
Figure 3.
Iterative growth of a homologous over-extension. (A) The raw PSI-BLAST output of a search querying a PF00668 embedded domain against the standard curated Pfam library at iteration 5. The portion of the alignment that contains the PF00668 homologous domain is shown in blue, while the over-extension onto the structurally unrelated PF00550 domain is shown in red. (B) A diagram that tracks the progression of the alignments shown in (A) from iterations 2 through 5. The alignment on the partial PF00668 domain begins as the first FP in the search at iteration 3, and continues to overextend further onto the unrelated PF00550 domain (in red) in subsequent iterations. By iteration 5, the entire unrelated PF00550 domain is covered by the overextended alignment.
Figure 4.
Figure 4.
Error-free family coverage before the first FP. (A–D) show the weighted fraction of family coverage by FP type—HOE on the library sequence (HOE-L, filled down triangles), HOE on the query sequence (HOE-Q, filled up triangles), non-hom (non-homologous error, filled diamonds)—on two datasets: hard families (A and B) and sampled families (C and D). Performance against the standard library (all domains, panels A and C) and the long-domain library (B and D) is shown. Coverage for searches that converged without any error (no error) is plotted with an ‘X’. Results are weighted so that each of the 50 searches contributes 2% of the coverage. The weighted coverage of each search was calculated as formula image where formula image is the number of errors of each type (NH, HOE-Q, HOE-L) in family f; formula image is the total number of FP errors for the family in the error iteration; formula image is the number of TPs found before the first error iteration, and formula image is the total family size in the complete or long-domain database. Thus, a search that achieves 50% family coverage before the first FP and had an equal number of HOE-L and NH errors would contribute 0.005 on both the HOE-L and the NH curves. The total family coverage in the iteration before the first FP for each search is also shown (open squares).
Figure 5.
Figure 5.
Family coverage at iteration 5. The frequency of TPs and FPs found at iteration 5, weighted by number of alignments from each search is shown. Both hard (A, B) and randomly sampled (C, D) families were tested against the standard library (all domains, A, C) and the long-domain library (B, D). The weighted FP frequency at iteration 5 is formula image, where formula image is the number of FPs of the specified type (HOE-Q, HOE-L, NH, or total errors) in family f at iteration 5, formula image is the total number of FPs at iteration 5, and formula image is the number of TPs found for the family at iteration 5. Filled squares plot the total weighted coverage of all three error types: HOE-Q (up-triangle), HOE-L (down-triangle) and NH (diamond). Total family coverage (open squares) is defined as formula image, where formula image is the total number of homologs in the family. With this weighting, a family that finds all of its homologs without any errors will contribute 0.02 to the coverage; a family that finds half of its homologs and an equal number of non-homologs will get a weighted frequency of (0.02 × 0.333) for the HOE-L, HOE-Q or non-hom. error type. For this figure and Figure 7, an HOE-L or HOE-Q alignment is counted both as a TP, reflecting the homologous alignment, and as a FP, because more than half of the alignment is outside the homologous domain; NH alignments are counted as only as FPs.
Figure 6.
Figure 6.
Difference between tree coverage and family coverage by iteration. A comparison between family and tree coverage for the non-embedded searches of two queries from each of 50 families against the long-domain library. Two queries per family were chosen based on tree location: one domain query from a populated and another from a deserted area of the tree. The number of searches where tree coverage was larger than family coverage is plotted by iteration.
Figure 7.
Figure 7.
Sensitivity and Specificity of PSI-BLAST and PSI-BLAST noExt. Weighted fractional family coverage for TPs is plotted as a function of weighted fractional FPs at iteration five using a threshold of E() <0.005. (A) Performance of unmodified PSI-BLAST with the −t 1 composition adjustment (long dashed line), unmodified PSI-BLAST with the default −t 2 composition adjustment (short dashed line), and PSI-BLAST noExt (solid line) on the hard queries. (B) Performance of PSI-BLAST −t 1, PSI-BLAST −t 2 and PSI-BLAST noExt on the randomly sampled queries. As in Figures 4 and 5, each family contributes 2% to the fraction of TPs (weighted). On the y-axis, fraction TPs (weighted) is calculated as formula image where formula image is the number of TPs at iteration 5, and formula image is the total number of homologs in family f. Likewise, fraction FPs (weighted) is calculated as formula image, where formula image is the number of FPs at iteration 5. For this figure and Figure 5, an HOE-L and HOE-Q alignment is counted both as a TP and a FP.

References

    1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. - PubMed
    1. Pearson WR. Flexible sequence similarity searching with the FASTA3 program package. Methods Mol. Biol. 2000;132:185–219. - PubMed
    1. Pearson WR. Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics. 1991;11:635–650. - PubMed
    1. Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. 1981;147:195–197. - PubMed
    1. Brenner SE, Chothia C, Hubbard TJ. Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc. Natl Acad. Sci. USA. 1998;95:6073–6078. - PMC - PubMed

Publication types