Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan 1;27(1):130-1.
doi: 10.1093/bioinformatics/btq614. Epub 2010 Nov 18.

SAMStat: monitoring biases in next generation sequencing data

Affiliations

SAMStat: monitoring biases in next generation sequencing data

Timo Lassmann et al. Bioinformatics. .

Abstract

Motivation: The sequence alignment/map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship between potential protocol biases and poor mapping we wrote SAMstat, a simple C program plotting nucleotide overrepresentation and other statistics in mapped and unmapped reads in a concise html page. Collecting such statistics also makes it easy to highlight problems in the data processing and enables non-experts to track data quality over time.

Results: We demonstrate that studying sequence features in mapped data can be used to identify biases particular to one sequencing protocol. Once identified, such biases can be considered in the downstream analysis or even be removed by read trimming or filtering techniques.

Availability: SAMStat is open source and freely available as a C program running on all Unix-compatible platforms. The source code is available from http://samstat.sourceforge.net.

Contact: timolassmann@gmail.com.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
A selection of SAMStat's html output. (a) Mapping statistics. More than half of the reads are mapped with a high mapping accuracy (red) while 9.9% of the reads remain unmapped (black). (b) Barcharts showing the distribution of mismatches and insertions along the read for alignments with the highest mapping accuracy [shown in red in (a)]. The colors indicate the mismatched nucleotides found in the read or the nucleotides inserted into the read. (c,d and e) Frequency of mismatches at the start of reads with mapping accuracies 1e−3P < 1e−2, 1e−2P < 0.5 and 0.5 ≤ P < 1, respectively (shown in orange, yellow and blue in panel a). The fraction of mismatches involving G's at position 2–5 increases. (f) Percentage of ‘GG’ dinucleotides at positions 1–5 in reads split up by mapping quality intervals. The background color highlights large percentages. The first and last row for nucleotides ‘GT’ and ‘GC’ are shown for comparison.

References

    1. Carninci C, et al. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 2006;38:626–635. - PubMed
    1. Cock PJ, et al. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Res. 2009;38:1767–1771. - PMC - PubMed
    1. Frith MC, et al. A code for transcription initiation in mammalian genomes. Genome Res. 2008;18:1–12. - PMC - PubMed
    1. Li H, Durbin R. Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform. Bioinformatics. 2009;25:1754–1760. - PMC - PubMed
    1. Li H, et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008;18:1851–1858. - PMC - PubMed

Publication types