A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Vikas Bansal¹

Affiliations

PMID: 28361665
PMCID: PMC5374682
DOI: 10.1186/s12859-017-1471-9

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Vikas Bansal. BMC Bioinformatics. 2017.

. 2017 Mar 14;18(Suppl 3):43.

doi: 10.1186/s12859-017-1471-9.

Author

Vikas Bansal¹

Affiliation

¹ Department of Pediatrics, School of Medicine, University of California San Diego, 9500 Gilman Drive, 92093, La JollaCA, USA. vibansal@cs.ucsd.edu.

PMID: 28361665
PMCID: PMC5374682
DOI: 10.1186/s12859-017-1471-9

Abstract

Background: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.

Results: In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.

Conclusions: The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates .

Keywords: Heterozygosity; High-throughput sequencing; Mathematical modeling; Natural duplicates; PCR duplicates; RNA-seq.

PubMed Disclaimer

Figures

**Fig. 1**
Illustration of paired-end reads covering a heterozygous SNV (reference allele is denoted by 0 and the variant allele as 1) in a diploid genome. The reads can be grouped into clusters of different sizes based on their alignment coordinates. Two reads that start and end at the same position but carry different alleles (0 and 1) at the heterozygous site (a) are highly likely to correspond to natural duplicates, i.e. independent DNA fragments. In contrast, a pair of read duplicates that have identical alleles at the heterozygous site (b) could correspond to PCR duplicates or natural duplicates

**Fig. 2**
Overview of computational method for estimating the PCR duplication rate using clusters of duplicate reads that overlap heterozygous variant sites. C _i corresponds to the clusters of read duplicates with i reads and U _i is the average number of unique DNA fragments for clusters of size i

**Fig. 3**
Box-plot showing the error in the estimation of the PCR duplication rate using our method on simulated data with varying levels of PCR duplicates (0 to 0.4). Data was simulated with a fixed sampling read duplication rate (plots shown for values of 0.2 and 0.4). For each combination of values, 50 simulated datasets were used to assess the error of the estimated PCR duplication rate

**Fig. 4**
Comparison of the estimated PCR duplication rate on 40 exome datasets from the 1000 Genomes Project analyzed as paired-end (PE) reads and single-end (SE) reads. The two plots correspond to the analysis using exome variant calls and Omni genotype calls. For visual clarity, two outlier samples with a high PCR duplication rate (>0.12) are not shown

**Fig. 5**
Comparison of the read duplication rate and the estimated PCR duplication rate for 40 RNA-seq samples from the Geuvadis project. Three samples with much higher PCR duplication rates than the remaining samples are highlighted

See this image and copyright information in PMC

References

1. Bentley DR, Balasubramanian S, Swerdlow HP, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9. doi: 10.1038/nature07517. - DOI - PMC - PubMed
1. Quail MA, Swerdlow H, Turner DJ. Improved protocols for the illumina genome analyzer sequencing system. Curr Protoc Hum Genet. 2009; Chapter 18: Unit 18.2. http://dx.doi.org/10.1002/0471142905.hg1802s62. - DOI - PMC - PubMed
1. Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat Methods. 2009;6(4):291–5. doi: 10.1038/nmeth.1311. - DOI - PMC - PubMed
1. Bronner IF, Quail MA, Turner DJ, Swerdlow H. Improved protocols for illumina sequencing. Curr Protoc Hum Genet. 2014;18:18.2.1–18.2.42. doi: 10.1002/0471142905.hg1802s80. - DOI - PubMed
1. DePristo MA, Banks E, Poplin R, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011;43(5):491–8. doi: 10.1038/ng.806. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Affiliation

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

Author

Affiliation

Abstract

Figures

References

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources