Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Nov 16;46(20):e120.
doi: 10.1093/nar/gky677.

Umap and Bismap: quantifying genome and methylome mappability

Affiliations

Umap and Bismap: quantifying genome and methylome mappability

Mehran Karimzadeh et al. Nucleic Acids Res. .

Abstract

Short-read sequencing enables assessment of genetic and biochemical traits of individual genomic regions, such as the location of genetic variation, protein binding and chemical modifications. Every region in a genome assembly has a property called 'mappability', which measures the extent to which it can be uniquely mapped by sequence reads. In regions of lower mappability, estimates of genomic and epigenomic characteristics from sequencing assays are less reliable. These regions have increased susceptibility to spurious mapping from reads from other regions of the genome with sequencing errors or unexpected genetic variation. Bisulfite sequencing approaches used to identify DNA methylation exacerbate these problems by introducing large numbers of reads that map to multiple regions. Both to correct assumptions of uniformity in downstream analysis and to identify regions where the analysis is less reliable, it is necessary to know the mappability of both ordinary and bisulfite-converted genomes. We introduce the Umap software for identifying uniquely mappable regions of any genome. Its Bismap extension identifies mappability of the bisulfite-converted genome. A Umap and Bismap track hub for human genome assemblies GRCh37/hg19 and GRCh38/hg38, and mouse assemblies GRCm37/mm9 and GRCm38/mm10 is available at https://bismap.hoffmanlab.org for use with genome browsers.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Mappability of the genome by Umap. (A) The Umap workflow identifies all unique k-mers of a genome given a read length of k. (B) Mappability of the human genome and methylome for read lengths between 24 and 100. (C) All of the uniquely mappable reads in two regions with high and low multi-read mappability are shown. In Case 1 (blue), all possible reads covering the region are uniquely mappable. In Case 2 (magenta), only two reads out of 10 are uniquely mappable.
Figure 2.
Figure 2.
Mappability of the methylome by Bismap. Bismap identifies uniquely mappable k-mers of a bisulfite-converted genome. It simulates the same changes that may occur in bisulfite treatment on the + strand (CT) and − strand (GA). To account for sequence of the − strand, we generate an extra set of reverse-complemented chromosomes and then simulate bisulfite conversion on these chromosomes. We do not simulate reverse complementation after bisulfite conversion, because the experimental protocol does not involve post-conversion DNA amplification. We then align k-mers by disabling complement search and combine the resulting data to quantify the mappability of a bisulfite-converted genome.
Figure 3.
Figure 3.
Mappability of ChIP-seq peaks in 1193 ENCODE datasets. (A) Single-read mappability and (B) multi-read mappability for narrow peaks identified in ENCODE ChIP-seq datasets. (C) An NRF1 narrow peak identified by MACS (purple) that is not uniquely mappable in the experiment with read length of 36 bp. The red bar in peaks indicates the summit. Signal tracks (gray) show two different replicates of this ChIP-seq experiment in K562 chronic myeloid leukemia cells (ENCODE accessions ENCSR000EHH and ENCSR494TDU, with read lengths of 36 and 100 bp, respectively). Umap tracks show single-read and multi-read mappability for two different read lengths of 36 and 100 bp.
Figure 4.
Figure 4.
Mappability of the CpG island annotations. (A) Single-read mappability and (B) multi-read mappability of CpG islands, CpG shores, CpG shelves and CpG resorts for a variety of read lengths. For comparison, asterisks indicate the average mappability of the whole genome at each read length. (C) A CpG island that is not uniquely mappable with a read length of 100 bp by Umap and Bismap. In Bismap single-read mappability tracks, chevrons pointing right indicate mappability of the + strand and chevrons pointing left indicate mappability of − strand. Multi-read mappability is calculated based on reads that are uniquely mappable on both + strand and − strand.
Figure 5.
Figure 5.
Mappability of differentially methylated regions of mice mammary basal and luminal alveolar tissues. (A) Single-read and (B) multi-read mappability of differentially methylated regions. (C) A differentially methylated region identified with 50-nt sequencing reads that are not uniquely mappable (purple). None of the sequencing reads that overlap this differentially methylated region uniquely map to the bisulfite-converted genome, although they all map uniquely to the unmodified genome.
Figure 6.
Figure 6.
Mappability of targeted methylation assays. Multi-read mappability of probes in (A) the Illumina Infinium HumanMethylation27 (27K) BeadChip and (B) the Illumina Infinium HumanMethylation450 (450K) BeadChip. (C) Multi-read mappability of CpG dinucleotides found in DiseaseMeth RRBS datasets.
Figure 7.
Figure 7.
Limitations of paired-end sequencing. (A) Empirical cumulative distribution function of the length of a region mapped by paired-end reads in an experiment with 150 bp paired-end sequencing (ENCFF721VIZ). The plotted curve shows the proportion of regions (y-axis) that are shorter than some length (x-axis). This shows that 87.5% of mapped fragments are smaller than 300 bp. (B) Number of transcript components not uniquely mappable with 400-mers. (C) Number of RepeatMasker repeat elements not uniquely mappable with 400-mers. LTR, long terminal repeat; RC, rolling circle; rRNA, ribosomal RNA; scRNA, small conditional RNA; snRNA, small nuclear RNA; srpRNA, signal recognition particle RNA; tRNA, transfer RNA.
Figure 8.
Figure 8.
Bland–Altman density plots comparing Umap and GEM-mappability scores. For several read lengths (10, 100, 1000, 10 000 and 100 000) and k-mer sizes (24, 36, 50 and 100), we randomly selected 2400 regions (100 regions from each chromosome) and compared GEM-mappability with (A) Umap multi-read mappability and (B) Umap single-read mappability.

References

    1. Derrien T., Estellé J., Sola S.M., Knowles D.G., Raineri E., Guigó R., Ribeca P.. Fast computation and applications of genome mappability. PLOS One. 2012; 7:e30377. - PMC - PubMed
    1. Langmead B., Trapnell C., Pop M., Salzberg S.L.. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009; 10:R25. - PMC - PubMed
    1. Krueger F., Andrews S.R.. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics. 2011; 27:1571–1572. - PMC - PubMed
    1. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57–74. - PMC - PubMed
    1. Langmead B., Salzberg S.L.. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012; 9:357–359. - PMC - PubMed

Publication types