. 2018 Jul 1;34(13):i202-i210.

doi: 10.1093/bioinformatics/bty264.

AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing

Kendell Clement^{1

2

3}, Rick Farouni^{1

2

3}, Daniel E Bauer^{4

5

6}, Luca Pinello^{1

2

3}

Affiliations

¹ Molecular Pathology Unit and Cancer Center, Massachusetts General Hospital, Boston, MA, USA.
² Department of Pathology, Harvard Medical School, Boston, MA, USA.
³ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Division of Hematology/Oncology, Boston Children's Hospital; Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
⁵ Department of Pediatrics, Harvard Medical School, Boston, MA, USA.
⁶ Harvard Stem Cell Institute, Cambridge, MA, USA.

PMID: 29949956
PMCID: PMC6022702
DOI: 10.1093/bioinformatics/bty264

AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing

Kendell Clement et al. Bioinformatics. 2018.

. 2018 Jul 1;34(13):i202-i210.

doi: 10.1093/bioinformatics/bty264.

Authors

Kendell Clement^{1

2

3}, Rick Farouni^{1

2

3}, Daniel E Bauer^{4

5

6}, Luca Pinello^{1

2

3}

Affiliations

¹ Molecular Pathology Unit and Cancer Center, Massachusetts General Hospital, Boston, MA, USA.
² Department of Pathology, Harvard Medical School, Boston, MA, USA.
³ Broad Institute of MIT and Harvard, Cambridge, MA, USA.
⁴ Division of Hematology/Oncology, Boston Children's Hospital; Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA, USA.
⁵ Department of Pediatrics, Harvard Medical School, Boston, MA, USA.
⁶ Harvard Stem Cell Institute, Cambridge, MA, USA.

PMID: 29949956
PMCID: PMC6022702
DOI: 10.1093/bioinformatics/bty264

Abstract

Motivation: Unique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon-based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.

Results: Based on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis.

Availability and implementation: AmpUMI is open-source and freely available at http://github.com/pinellolab/AmpUMI.

PubMed Disclaimer

Figures

**Fig. 1.**
(a) Schematic showing utility of UMIs in identifying PCR duplicates. In libraries using UMIs, a short sequence of random nucleotides is added to each DNA fragment before PCR amplification. All PCR products of that read will contain the same UMI. After library sequencing, DNA fragments with the same sequence (shown as square nucleotides on the right part of the read) can be identified as either PCR duplicates or not, based on the UMI sequence (shown as rounded nucleotides on the left part of the read). (b) Outline of a standard experiment utilizing UMI technology. The steps shown in gray are computational processing steps, and are the procedures performed by our software, AmpUMI

**Fig. 2.**
Association between distortion of allelic frequency and UMI length. Colored bars show simulated allelic fractions of four alleles after deduplication of reads with the same UMI and allele. Simulated samples consisted of $n =$ 100 000 reads and were drawn from a population with an allelic diversity given by $m = (0.1, 0.1, 0.3, 0.5)$ . Reads were generated using using UMI of length between 1 bp and 18 bp long. For each UMI length, the average simulation proportion of each allele is shown after removing UMI-allele collisions. Samples of 100 were simulated for each UMI length. The right column marked ’Truth’ shows the underlying allelic diversity from which the simulated samples were drawn. Dots connected by lines show the predicted allele frequency given our model [Equation (12)] and are in complete concordance with the simulation results. The gray histogram at the bottom of the plot shows the TAFD [Equation (13)] for each UMI length

**Fig. 3.**
Distribution of collisions in simulated populations. The number of collisions in each simulated sample used in Figure 2 were aggregated by UMI length. Boxplots show the median (thick line), interquartile range (box) and the range of the data (whiskers). The count of collisions is defined as the number of simulated reads (UMI-molecule combination) that had already been observed in the simulation

**Fig. 4.**
Probability of having no UMI collisions for Case 2 (worst case scenario): The probability of no collision as a function of sample size n for 5 consecutive values of UMI lengths, $k = 14, \dots, 18$ such that $J = 4^{k} \leq n$ (colored curves). The vertical dotted line shows the n = 100 000 sample size referenced in Figures 2and 5

**Fig. 5.**
Probability of having no UMI collisions in simulated samples. We simulated samples of size $n =$ 100 000, with DNA fragments randomly selected from a set containing five unique fragments each with a random fraction of presence in the sample. Simulated DNA fragments were paired with a given set of UMIs, and the rate of UMI collisions were measured. The average percent of all 1000 simulated samples having no collisions is shown with the blue line. Three were carried out with 1000 samples each. The red reference line is computed by our model, and shows the values in Figure 4

See this image and copyright information in PMC

References

1. Aird D. et al. (2011) Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol., 12, R18.. - PMC - PubMed
1. Burriesci M.S. et al. (2012) Fulcrum: condensing redundant reads from high-throughput sequencing studies. Bioinformatics, 28, 1324–1327. - PMC - PubMed
1. Ebbert M.T. et al. (2016) Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches. BMC Bioinformatics, 17, 239. - PMC - PubMed
1. Girardot C. et al. (2016) Je, a versatile suite to handle multiplexed NGS libraries with unique molecular identifiers. BMC Bioinformatics, 17, 419.. - PMC - PubMed
1. Islam S. et al. (2014) Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods, 11, 163–166. - PubMed

Publication types

Actions
Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing

Affiliations

AmpUMI: design and analysis of unique molecular identifiers for deep amplicon sequencing

Authors

Affiliations

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources