FastUniq: a fast de novo duplicates removal tool for paired short reads

Haibin Xu¹, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, Shilin Chen

Affiliations

Affiliation

¹ National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People's Republic of China.

PMID: 23284954
PMCID: PMC3527383
DOI: 10.1371/journal.pone.0052249

FastUniq: a fast de novo duplicates removal tool for paired short reads

Haibin Xu et al. PLoS One. 2012.

. 2012;7(12):e52249.

doi: 10.1371/journal.pone.0052249. Epub 2012 Dec 20.

Authors

Haibin Xu¹, Xiang Luo, Jun Qian, Xiaohui Pang, Jingyuan Song, Guangrui Qian, Jinhui Chen, Shilin Chen

Affiliation

¹ National Engineering Laboratory for Breeding of Endangered Medicinal Materials, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, People's Republic of China.

PMID: 23284954
PMCID: PMC3527383
DOI: 10.1371/journal.pone.0052249

Abstract

The presence of duplicates introduced by PCR amplification is a major issue in paired short reads from next-generation sequencing platforms. These duplicates might have a serious impact on research applications, such as scaffolding in whole-genome sequencing and discovering large-scale genome variations, and are usually removed. We present FastUniq as a fast de novo tool for removal of duplicates in paired short reads. FastUniq identifies duplicates by comparing sequences between read pairs and does not require complete genome sequences as prerequisites. FastUniq is capable of simultaneously handling reads with different lengths and results in highly efficient running time, which increases linearly at an average speed of 87 million reads per 10 minutes. FastUniq is freely available at http://sourceforge.net/projects/fastuniq/.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

**Figure 1. The processing flow chart for FastUniq.**
Step 1: import all read pairs into memory; Step 2: sort read pairs based on nucleotide sequences; Step 3: identify duplicates in sorted read pairs and output the unique sequences.

**Figure 2. FastUniq three-tier architecture for storage of read pairs.**
The high-tier objective was to store hundreds of millions or more of paired reads. Data for each read pair composed of two reads are stored in a middle-tier ‘fastq_pair’ object, and data for each read are stored in a basic-tier ‘fastq’ object.

**Figure 3. Results of duplicates removal for Illumina sequencing libraries from *Acropora digitifera* corresponding to multiple insert sizes.**
(A) The number of read pairs before and after duplicates removal using FastUniq or the mapping-based pipeline for each library. (B) The percentage of duplicates in the results of the mapping-based pipeline identified using FastUniq or fastx_collapser for each library.

**Figure 4. Running time performance of FastUniq.**
The running time is measured by the ‘time’ command in the Linux operating system.

See this image and copyright information in PMC

References

1. Li R, Fan W, Tian G, Zhu H, He L, et al. (2010) The sequence and de novo assembly of the giant panda genome. Nature 463: 311–317. - PMC - PubMed
1. Shinzato C, Shoguchi E, Kawashima T, Hamada M, Hisata K, et al. (2011) Using the Acropora digitifera genome to understand coral responses to environmental change. Nature 476: 320–323. - PubMed
1. Hohenlohe PA, Bassham S, Etter PD, Stiffler N, Johnson EA, et al. (2010) Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genet 6: e1000862. - PMC - PubMed
1. Thomas RK, Baker AC, Debiasi RM, Winckler W, Laframboise T, et al. (2007) High-throughput oncogene mutation profiling in human cancer. Nat Genet 39: 347–351. - PubMed
1. Lu T, Lu G, Fan D, Zhu C, Li W, et al. (2010) Function annotation of the rice transcriptome at single-nucleotide resolution by RNA-seq. Genome Res 20: 1238–1249. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

FastUniq: a fast de novo duplicates removal tool for paired short reads

Affiliation

FastUniq: a fast de novo duplicates removal tool for paired short reads

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources