Accurate estimation of short read mapping quality for next-generation genome sequencing

doi:10.1093/bioinformatics/bts408

. 2012 Sep 15;28(18):i349-i355.

doi: 10.1093/bioinformatics/bts408.

Accurate estimation of short read mapping quality for next-generation genome sequencing

Matthew Ruffalo¹, Mehmet Koyutürk, Soumya Ray, Thomas LaFramboise

Affiliations

PMID: 22962451
PMCID: PMC3436835
DOI: 10.1093/bioinformatics/bts408

Accurate estimation of short read mapping quality for next-generation genome sequencing

Matthew Ruffalo et al. Bioinformatics. 2012.

. 2012 Sep 15;28(18):i349-i355.

doi: 10.1093/bioinformatics/bts408.

Authors

Matthew Ruffalo¹, Mehmet Koyutürk, Soumya Ray, Thomas LaFramboise

Affiliation

¹ Department of Electrical Engineering & Computer Science, Case Western Reserve University, Cleveland, OH 44106, USA. matthew.ruffalo@case.edu

PMID: 22962451
PMCID: PMC3436835
DOI: 10.1093/bioinformatics/bts408

Abstract

Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment-in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants.

Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality.

Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can 'resurrect' many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms.

Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/.

Contact: matthew.ruffalo@case.edu.

PubMed Disclaimer

Figures

**Fig. 1.**
Direct comparison of the theoretical accuracy at each quality score Q_m against each tool's actual accuracy. The mapping quality Q_m is defined as the log-scaled probability P that the mapping is incorrect: Q_m = −10log₁₀ P, giving a theoretical accuracy A for each quality score: A = 1 − P = 1−10^−Q_m/10

**Fig. 2.**
The machine learning framework for recalibrating mapping quality scores. Rectangles represent data, blue rounded rectangles represent available hardware and software and green ellipses represent computational methods implemented within LoQuM

**Fig. 3.**
Boxplots of per-base quality statistics provided by the FastQC tool (Andrews, 2010) for the MDAMB468 cell line (Sun *et al.*, 2011) across a random sample of the 33 million reads. The x-axis is the base position in each read, and the y-axis is the base quality score Q_b. The blue line shows the mean base quality

**Fig. 4.**
Precision versus recall for the alignment tool BWA, using the raw mapping qualities in (a), and the output of LoQuM in (b). The color of the curve denotes a threshold on mapping quality or prediction output; decreasing this threshold typically increases recall but decreases precision

**Fig. 5.**
Precision versus recall for the alignment tool SOAP2, using the raw mapping qualities in (a), and the output of LoQuM in (b). The color of the curve denotes a threshold on mapping quality or prediction output; decreasing this threshold typically increases recall but decreases precision

**Fig. 6.**
Precision versus recall for the alignment tool Novoalign, using the raw mapping qualities in (a) and the output of LoQuM in (b). The color of the curve denotes a threshold on mapping quality or prediction output; decreasing this threshold typically increases recall but decreases precision

**Fig. 7.**
Comparison of reported accuracy versus theoretical accuracy for ART's simulated reads. The x-axis is the output of the logistic regression classifier p after inversion and negative log-scaling: Q = −10log₁₀(1 − p). This corresponds to the mapping quality score Q_m in Equation (2)

See this image and copyright information in PMC

Cited by

A framework for organizing cancer-related variations from existing databases, publications and NGS data using a High-performance Integrated Virtual Environment (HIVE).
Wu TJ, Shamsaddini A, Pan Y, Smith K, Crichton DJ, Simonyan V, Mazumder R. Wu TJ, et al. Database (Oxford). 2014 Mar 25;2014:bau022. doi: 10.1093/database/bau022. Print 2014. Database (Oxford). 2014. PMID: 24667251 Free PMC article.
SeqControl: process control for DNA sequencing.
Chong LC, Albuquerque MA, Harding NJ, Caloian C, Chan-Seng-Yue M, de Borja R, Fraser M, Denroche RE, Beck TA, van der Kwast T, Bristow RG, McPherson JD, Boutros PC. Chong LC, et al. Nat Methods. 2014 Oct;11(10):1071-5. doi: 10.1038/nmeth.3094. Epub 2014 Aug 31. Nat Methods. 2014. PMID: 25173705
Comparison of single-nucleotide variants identified by Illumina and Oxford Nanopore technologies in the context of a potential outbreak of Shiga toxin-producing Escherichia coli.
Greig DR, Jenkins C, Gharbia S, Dallman TJ. Greig DR, et al. Gigascience. 2019 Aug 1;8(8):giz104. doi: 10.1093/gigascience/giz104. Gigascience. 2019. PMID: 31433830 Free PMC article.
Epigenomic profiling of primary gastric adenocarcinoma reveals super-enhancer heterogeneity.
Ooi WF, Xing M, Xu C, Yao X, Ramlee MK, Lim MC, Cao F, Lim K, Babu D, Poon LF, Lin Suling J, Qamra A, Irwanto A, Qu Zhengzhong J, Nandi T, Lee-Lim AP, Chan YS, Tay ST, Lee MH, Davies JO, Wong WK, Soo KC, Chan WH, Ong HS, Chow P, Wong CY, Rha SY, Liu J, Hillmer AM, Hughes JR, Rozen S, Teh BT, Fullwood MJ, Li S, Tan P. Ooi WF, et al. Nat Commun. 2016 Sep 28;7:12983. doi: 10.1038/ncomms12983. Nat Commun. 2016. PMID: 27677335 Free PMC article.
Re-alignment of the unmapped reads with base quality score.
Peng X, Wang J, Zhang Z, Xiao Q, Li M, Pan Y. Peng X, et al. BMC Bioinformatics. 2015;16 Suppl 5(Suppl 5):S8. doi: 10.1186/1471-2105-16-S5-S8. Epub 2015 Mar 18. BMC Bioinformatics. 2015. PMID: 25860434 Free PMC article.

See all "Cited by" articles

References

1. Alkan C., et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. - PMC - PubMed
1. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/. (last accessed date September 12, 2011)
1. Hach F., et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods. 2010;7:576–577. - PMC - PubMed
1. Homer N., Nelson S. F. Improved variant discovery through local realignment of short-read next-generation sequencing data using SRMA. Genome Biol. 2010;11:R99. - PMC - PubMed
1. Huang W., et al. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

[1] Alkan C., et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. - PMC - PubMed

[2] Alkan C., et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. - PMC - PubMed

[3] Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/. (last accessed date September 12, 2011)

[4] Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/. (last accessed date September 12, 2011)

[5] Hach F., et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods. 2010;7:576–577. - PMC - PubMed

[6] Hach F., et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods. 2010;7:576–577. - PMC - PubMed

[7] Homer N., Nelson S. F. Improved variant discovery through local realignment of short-read next-generation sequencing data using SRMA. Genome Biol. 2010;11:R99. - PMC - PubMed

[8] Homer N., Nelson S. F. Improved variant discovery through local realignment of short-read next-generation sequencing data using SRMA. Genome Biol. 2010;11:R99. - PMC - PubMed

[9] Huang W., et al. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. - PMC - PubMed

[10] Huang W., et al. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Accurate estimation of short read mapping quality for next-generation genome sequencing

Affiliation

Accurate estimation of short read mapping quality for next-generation genome sequencing

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources