Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Mar;21(3):487-93.
doi: 10.1101/gr.113985.110. Epub 2011 Jan 5.

Adaptive seeds tame genomic sequence comparison

Affiliations

Adaptive seeds tame genomic sequence comparison

Szymon M Kiełbasa et al. Genome Res. 2011 Mar.

Abstract

The main way of analyzing biological sequences is by comparing and aligning them to each other. It remains difficult, however, to compare modern multi-billionbase DNA data sets. The difficulty is caused by the nonuniform (oligo)nucleotide composition of these sequences, rather than their size per se. To solve this problem, we modified the standard seed-and-extend approach (e.g., BLAST) to use adaptive seeds. Adaptive seeds are matches that are chosen based on their rareness, instead of using fixed-length matches. This method guarantees that the number of matches, and thus the running time, increases linearly, instead of quadratically, with sequence length. LAST, our open source implementation of adaptive seeds, enables fast and sensitive comparison of large sequences with arbitrarily nonuniform composition.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Number of exact matches between genomic sequences as a function of match length. (A) Matches of size 7–35 bases between the human X chromosome (151 million bases) and the mouse X chromosome (162 million bases). (B) Matches between the genomes of P. falciparum (23 million bases) and P. yoelii (20 million bases). (C) Matches between the human X chromosome and the mouse X chromosome, of seeds that occur at most 10 times in the mouse X chromosome. (D) Matches between P. falciparum and P. yoelii, of seeds that occur at most 10 times in P. falciparum. Lines show expected frequencies for uniformly random sequences.
Figure 2.
Figure 2.
Dot-plots of matches (without extensions) identified by adaptive and fixed-length seeds when comparing the P. yoelii contig with the region of the P. falciparum genome where the MB2 homologous genes are known to exist. Their exact locations are indicated by the dashed boxes in blue. The colored dots in both panels indicate the number of hits in the plot area. As the frequency threshold f for adaptive seeds increases or the length l for fixed-length seeds decreases, the number of hits increases. Caveat: the area each color appears to occupy does not exactly correlate with the number of hits—for each graph, the color with the lower number of hits is drawn over the other color, and also nearby hits cannot be resolved visually at this resolution.
Figure 3.
Figure 3.
Performance comparison of adaptive seeds (circles) versus fixed-length seeds (crosses). Black denotes results obtained for contiguous seeds and unmasked sequences (l or f parameters are shown next to each data point). Red shows the effect of spaced seeds, subset seeds, or repeat-masking for the sequence alignment of: (A) H. sapiens promoters to the M. musculus genome (and spaced seed 111010010100110); (B) D. melanogaster protein sequences to those of C. elegans (and subset seeds); (C) P. yoelii contigs to the P. falciparum genome (and soft-masking with Tandem Repeats Finder); and (D) short DNA reads from 454 Life Sciences (Roche) GS 20 for A. thaliana against its genome (and soft-masking with WindowMasker).
Figure 4.
Figure 4.
LASTZ and BLASTN use fixed-length seeds and present similar performance to LAST with fixed-length seeds. (Circles) LAST adaptive seeds; (crosses) LAST fixed-length seeds. (A) P. yoelii contigs are aligned to P. falciparum chromosomes using the same score matrix as in Figure 3C. Triangles show performance of LASTZ executed with corresponding parameters for different LASTZ-seed lengths. (B) A simpler match/mismatch scoring scheme is used. Squares present the performance of BLASTN and the numbers correspond to BLASTN-seed lengths.

References

    1. Abouelhoda MI, Kurtz S, Ohlebusch E 2004. Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2: 53–86
    1. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 - PMC - PubMed
    1. Batzer MA, Deininger PL 2002. Alu repeats and human genomic diversity. Nat Rev Genet 3: 370–379 - PubMed
    1. Benson G 1999. Tandem Repeats Finder: a program to analyze DNA sequences. Nucleic Acids Res 27: 573–580 - PMC - PubMed
    1. Carlton J, Silva J, Hall N 2005. The genome of model malaria parasites, and comparative genomics. Curr Issues Mol Biol 7: 23–37 - PubMed

Publication types