Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Sep 15;28(18):i349-i355.
doi: 10.1093/bioinformatics/bts408.

Accurate estimation of short read mapping quality for next-generation genome sequencing

Affiliations

Accurate estimation of short read mapping quality for next-generation genome sequencing

Matthew Ruffalo et al. Bioinformatics. .

Abstract

Motivation: Several software tools specialize in the alignment of short next-generation sequencing reads to a reference sequence. Some of these tools report a mapping quality score for each alignment-in principle, this quality score tells researchers the likelihood that the alignment is correct. However, the reported mapping quality often correlates weakly with actual accuracy and the qualities of many mappings are underestimated, encouraging the researchers to discard correct mappings. Further, these low-quality mappings tend to correlate with variations in the genome (both single nucleotide and structural), and such mappings are important in accurately identifying genomic variants.

Approach: We develop a machine learning tool, LoQuM (LOgistic regression tool for calibrating the Quality of short read mappings, to assign reliable mapping quality scores to mappings of Illumina reads returned by any alignment tool. LoQuM uses statistics on the read (base quality scores reported by the sequencer) and the alignment (number of matches, mismatches and deletions, mapping quality score returned by the alignment tool, if available, and number of mappings) as features for classification and uses simulated reads to learn a logistic regression model that relates these features to actual mapping quality.

Results: We test the predictions of LoQuM on an independent dataset generated by the ART short read simulation software and observe that LoQuM can 'resurrect' many mappings that are assigned zero quality scores by the alignment tools and are therefore likely to be discarded by researchers. We also observe that the recalibration of mapping quality scores greatly enhances the precision of called single nucleotide polymorphisms.

Availability: LoQuM is available as open source at http://compbio.case.edu/loqum/.

Contact: matthew.ruffalo@case.edu.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
Direct comparison of the theoretical accuracy at each quality score Qm against each tool's actual accuracy. The mapping quality Qm is defined as the log-scaled probability P that the mapping is incorrect: Qm = −10log10 P, giving a theoretical accuracy A for each quality score: A = 1 − P = 1−10−Qm/10
Fig. 2.
Fig. 2.
The machine learning framework for recalibrating mapping quality scores. Rectangles represent data, blue rounded rectangles represent available hardware and software and green ellipses represent computational methods implemented within LoQuM
Fig. 3.
Fig. 3.
Boxplots of per-base quality statistics provided by the FastQC tool (Andrews, 2010) for the MDAMB468 cell line (Sun et al., 2011) across a random sample of the 33 million reads. The x-axis is the base position in each read, and the y-axis is the base quality score Qb. The blue line shows the mean base quality
Fig. 4.
Fig. 4.
Precision versus recall for the alignment tool BWA, using the raw mapping qualities in (a), and the output of LoQuM in (b). The color of the curve denotes a threshold on mapping quality or prediction output; decreasing this threshold typically increases recall but decreases precision
Fig. 5.
Fig. 5.
Precision versus recall for the alignment tool SOAP2, using the raw mapping qualities in (a), and the output of LoQuM in (b). The color of the curve denotes a threshold on mapping quality or prediction output; decreasing this threshold typically increases recall but decreases precision
Fig. 6.
Fig. 6.
Precision versus recall for the alignment tool Novoalign, using the raw mapping qualities in (a) and the output of LoQuM in (b). The color of the curve denotes a threshold on mapping quality or prediction output; decreasing this threshold typically increases recall but decreases precision
Fig. 7.
Fig. 7.
Comparison of reported accuracy versus theoretical accuracy for ART's simulated reads. The x-axis is the output of the logistic regression classifier p after inversion and negative log-scaling: Q = −10log10(1 − p). This corresponds to the mapping quality score Qm in Equation (2)

Similar articles

Cited by

References

    1. Alkan C., et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat. Genet. 2009;41:1061–1067. - PMC - PubMed
    1. Andrews S. FastQC: a quality control tool for high throughput sequence data. 2010. Available at http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/. (last accessed date September 12, 2011)
    1. Hach F., et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat. Methods. 2010;7:576–577. - PMC - PubMed
    1. Homer N., Nelson S. F. Improved variant discovery through local realignment of short-read next-generation sequencing data using SRMA. Genome Biol. 2010;11:R99. - PMC - PubMed
    1. Huang W., et al. ART: a next-generation sequencing read simulator. Bioinformatics. 2012;28:593–594. - PMC - PubMed

Publication types