. 2018 Sep 1;34(17):i748-i756.

doi: 10.1093/bioinformatics/bty597.

A fast adaptive algorithm for computing whole-genome homology maps

Chirag Jain^{1

2}, Sergey Koren², Alexander Dilthey^{2

3}, Adam M Phillippy², Srinivas Aluru¹

Affiliations

¹ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
² Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
³ Institute of Medical Microbiology, University Hospital of Düsseldorf, Düsseldorf, Germany.

PMID: 30423094
PMCID: PMC6129286
DOI: 10.1093/bioinformatics/bty597

A fast adaptive algorithm for computing whole-genome homology maps

Chirag Jain et al. Bioinformatics. 2018.

. 2018 Sep 1;34(17):i748-i756.

doi: 10.1093/bioinformatics/bty597.

Authors

Chirag Jain^{1

2}, Sergey Koren², Alexander Dilthey^{2

3}, Adam M Phillippy², Srinivas Aluru¹

Affiliations

¹ School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA.
² Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
³ Institute of Medical Microbiology, University Hospital of Düsseldorf, Düsseldorf, Germany.

PMID: 30423094
PMCID: PMC6129286
DOI: 10.1093/bioinformatics/bty597

Abstract

Motivation: Whole-genome alignment is an important problem in genomics for comparing different species, mapping draft assemblies to reference genomes and identifying repeats. However, for large plant and animal genomes, this task remains compute and memory intensive. In addition, current practical methods lack any guarantee on the characteristics of output alignments, thus making them hard to tune for different application requirements.

Results: We introduce an approximate algorithm for computing local alignment boundaries between long DNA sequences. Given a minimum alignment length and an identity threshold, our algorithm computes the desired alignment boundaries and identity estimates using kmer-based statistics, and maintains sufficient probabilistic guarantees on the output sensitivity. Further, to prioritize higher scoring alignment intervals, we develop a plane-sweep based filtering technique which is theoretically optimal and practically efficient. Implementation of these ideas resulted in a fast and accurate assembly-to-genome and genome-to-genome mapper. As a result, we were able to map an error-corrected whole-genome NA12878 human assembly to the hg38 human reference genome in about 1 min total execution time and <4 GB memory using eight CPU threads, achieving significant improvement in memory-usage over competing methods. Recall accuracy of computed alignment boundaries was consistently found to be >97% on multiple datasets. Finally, we performed a sensitive self-alignment of the human genome to compute all duplications of length ≥1 Kbp and ≥90% identity. The reported output achieves good recall and covers twice the number of bases than the current UCSC browser's segmental duplication annotation.

Availability and implementation: https://github.com/marbl/MashMap.

PubMed Disclaimer

Figures

**Fig. 1.**
A local alignment depicting the inclusion of a length $l_{0} / 2$ fragment of the query sequence

**Fig. 2.**
Probability of mapping at least one seed fragment for two different error-rate thresholds $ε_{m a x} = 10 %, 20 %$ . As true error rate $ε$ decreases, the probability values accordingly improve as expected. Similarly, longer alignments spanning more fragments are more likely to be reported. Most importantly, all the sensitivity scores are consistently above 90%. To compute the probability values, sketch size for Minhash based Jaccard estimation was assumed as 200, and the k-mer size was set to 16. These parameter values are internally computed by Mashmap (Jain *et al.*, 2017)

**Fig. 3.**
Left figure is a toy example to illustrate line segments corresponding to multiple local alignments obtained between a query and reference sequence. Each alignment segment is labeled with an alignment score. Suppose we want to filter best mappings for the query sequence. These segments are laid out as weighted intervals over the query sequence (right figure). In the above case, two intervals marked with a cross are completely subsumed by higher scoring intervals, and therefore, will be labeled as redundant by our filtering heuristic

**Fig. 4.**
Wall time of Mashmap2 decreases with increasing length or identity thresholds using dataset D3 and eight CPU threads. In this experiment, identity and length thresholds were fixed to 90% and 10 Kbp while varying the other parameter. Memory-usage also follows a similar trend (data not shown)

**Fig. 5.**
Visualization of ≥1 Kbp duplications in the human genome computed using Mashmap2. Alignments are colored based on their lengths: blue 1–5 Kbp, red 5–10 Kbp, black >10 Kbp. Majority of blue and red mappings occur due to SINEs and LINEs repeats, respectively. Right plot is a magnification of ≥1 Kbp duplications within chromosome 7. Chromosome 7 is known to be one of the most duplicated human chromosomes. Large clustered duplications in red circle are associated with Williams-Beuren syndrome (Hillier *et al.*, 2003)

**Fig. 6.**
Recall scores of duplications computed using Mashmap2 against the UCSC segmental duplication database. Above 90% recall scores are achieved on each chromosome consistently. The red dotted line shows the aggregate recall score of 97.15% for the complete genome

**Fig. 7.**
Comparison of genomic coverage between the UCSC Segmental Duplication database and Mashmap2 output alignments. Both methods reported equal coverage 83% on mitochondrial chromosome (not shown above to keep the plot legible). Coverage of duplications computed using our method is significantly higher, owing to its exhaustive search of all repeats with $\geq$ 1 Kbp length and $\geq$ 90% identity without repeat masking

See this image and copyright information in PMC

References

1. Altschul S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. - PMC - PubMed
1. Bailey J.A., et al. (2001) Segmental duplications: organization and impact within the current human genome project assembly. Genome Res., 11, 1005–1017. - PMC - PubMed
1. Bailey J.A., et al. (2002) Recent segmental duplications in the human genome. Science, 297, 1003–1007. - PubMed
1. Berman P., et al. (1999) Winnowing sequences from a database search. In: Proceedings of the Third Annual International Conference on Computational Molecular Biology. ACM, pp. 50–58. - PubMed
1. Bray N., et al. (2003) AVID: a global alignment program. Genome Res., 13, 97–102. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A fast adaptive algorithm for computing whole-genome homology maps

Affiliations

A fast adaptive algorithm for computing whole-genome homology maps

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Research Materials