Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Sep 1;34(17):i748-i756.
doi: 10.1093/bioinformatics/bty597.

A fast adaptive algorithm for computing whole-genome homology maps

Affiliations

A fast adaptive algorithm for computing whole-genome homology maps

Chirag Jain et al. Bioinformatics. .

Abstract

Motivation: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements.

Results: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation.

Availability and implementation: https://github.com/marbl/MashMap.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A local alignment depicting the inclusion of a length l0/2 fragment of the query sequence
Fig. 2.
Fig. 2.
Probability of mapping at least one seed fragment for two different error-rate thresholds εmax=10%,20%. As true error rate ε decreases, the probability values accordingly improve as expected. Similarly, longer alignments spanning more fragments are more likely to be reported. Most importantly, all the sensitivity scores are consistently above 90%. To compute the probability values, sketch size for Minhash based Jaccard estimation was assumed as 200, and the k-mer size was set to 16. These parameter values are internally computed by Mashmap (Jain et al., 2017)
Fig. 3.
Fig. 3.
Left figure is a toy example to illustrate line segments corresponding to multiple local alignments obtained between a query and reference sequence. Each alignment segment is labeled with an alignment score. Suppose we want to filter best mappings for the query sequence. These segments are laid out as weighted intervals over the query sequence (right figure). In the above case, two intervals marked with a cross are completely subsumed by higher scoring intervals, and therefore, will be labeled as redundant by our filtering heuristic
Fig. 4.
Fig. 4.
Wall time of Mashmap2 decreases with increasing length or identity thresholds using dataset D3 and eight CPU threads. In this experiment, identity and length thresholds were fixed to 90% and 10 Kbp while varying the other parameter. Memory-usage also follows a similar trend (data not shown)
Fig. 5.
Fig. 5.
Visualization of ≥1 Kbp duplications in the human genome computed using Mashmap2. Alignments are colored based on their lengths: blue 1–5 Kbp, red 5–10 Kbp, black >10 Kbp. Majority of blue and red mappings occur due to SINEs and LINEs repeats, respectively. Right plot is a magnification of ≥1 Kbp duplications within chromosome 7. Chromosome 7 is known to be one of the most duplicated human chromosomes. Large clustered duplications in red circle are associated with Williams-Beuren syndrome (Hillier et al., 2003)
Fig. 6.
Fig. 6.
Recall scores of duplications computed using Mashmap2 against the UCSC segmental duplication database. Above 90% recall scores are achieved on each chromosome consistently. The red dotted line shows the aggregate recall score of 97.15% for the complete genome
Fig. 7.
Fig. 7.
Comparison of genomic coverage between the UCSC Segmental Duplication database and Mashmap2 output alignments. Both methods reported equal coverage 83% on mitochondrial chromosome (not shown above to keep the plot legible). Coverage of duplications computed using our method is significantly higher, owing to its exhaustive search of all repeats with 1 Kbp length and 90% identity without repeat masking

References

    1. Altschul S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
    1. Bailey J.A., et al. (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res., 11, 1005–1017. - PMC - PubMed
    1. Bailey J.A., et al. (2002) Recent segmental duplications in the human genome. Science, 297, 1003–1007. - PubMed
    1. Berman P., et al. (1999) Winnowing sequences from a database search. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology. ACM, pp. 50–58. - PubMed
    1. Bray N., et al. (2003) AVID: a global alignment program. Genome Res., 13, 97–102. - PMC - PubMed

Publication types