Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jul 1;34(13):i202-i210.
doi: 10.1093/bioinformatics/bty264.

AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing

Affiliations

AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing

Kendell Clement et al. Bioinformatics. .

Abstract

Motivation: Unique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon-based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.

Results: Based on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis.

Availability and implementation: AmpUMI is open-source and freely available at http://github.com/pinellolab/AmpUMI.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
(a) Schematic showing utility of UMIs in identifying PCR duplicates. In libraries using UMIs, a short sequence of random nucleotides is added to each DNA fragment before PCR amplification. All PCR products of that read will contain the same UMI. After library sequencing, DNA fragments with the same sequence (shown as square nucleotides on the right part of the read) can be identified as either PCR duplicates or not, based on the UMI sequence (shown as rounded nucleotides on the left part of the read). (b) Outline of a standard experiment utilizing UMI technology. The steps shown in gray are computational processing steps, and are the procedures performed by our software, AmpUMI
Fig. 2.
Fig. 2.
Association between distortion of allelic frequency and UMI length. Colored bars show simulated allelic fractions of four alleles after deduplication of reads with the same UMI and allele. Simulated samples consisted of n=100 000 reads and were drawn from a population with an allelic diversity given by m=(0.1,0.1,0.3,0.5). Reads were generated using using UMI of length between 1 bp and 18 bp long. For each UMI length, the average simulation proportion of each allele is shown after removing UMI-allele collisions. Samples of 100 were simulated for each UMI length. The right column marked ’Truth’ shows the underlying allelic diversity from which the simulated samples were drawn. Dots connected by lines show the predicted allele frequency given our model [Equation (12)] and are in complete concordance with the simulation results. The gray histogram at the bottom of the plot shows the TAFD [Equation (13)] for each UMI length
Fig. 3.
Fig. 3.
Distribution of collisions in simulated populations. The number of collisions in each simulated sample used in Figure 2 were aggregated by UMI length. Boxplots show the median (thick line), interquartile range (box) and the range of the data (whiskers). The count of collisions is defined as the number of simulated reads (UMI-molecule combination) that had already been observed in the simulation
Fig. 4.
Fig. 4.
Probability of having no UMI collisions for Case 2 (worst case scenario): The probability of no collision as a function of sample size n for 5 consecutive values of UMI lengths, k=14,,18 such that J=4kn (colored curves). The vertical dotted line shows the n = 100 000 sample size referenced in Figures 2and 5
Fig. 5.
Fig. 5.
Probability of having no UMI collisions in simulated samples. We simulated samples of size n=100 000, with DNA fragments randomly selected from a set containing five unique fragments each with a random fraction of presence in the sample. Simulated DNA fragments were paired with a given set of UMIs, and the rate of UMI collisions were measured. The average percent of all 1000 simulated samples having no collisions is shown with the blue line. Three were carried out with 1000 samples each. The red reference line is computed by our model, and shows the values in Figure 4

References

    1. Aird D. et al. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18.. - PMC - PubMed
    1. Burriesci M.S. et al. (2012) Fulcrum: condensing redundant reads from high-throughput sequencing studies. Bioinformatics, 28, 1324–1327. - PMC - PubMed
    1. Ebbert M.T. et al. (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics, 17, 239. - PMC - PubMed
    1. Girardot C. et al. (2016) Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers. BMC Bioinformatics, 17, 419.. - PMC - PubMed
    1. Islam S. et al. (2014) Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods, 11, 163–166. - PubMed

Publication types