Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013;14 Suppl 11(Suppl 11):S3.
doi: 10.1186/1471-2105-14-S11-S3. Epub 2013 Sep 13.

Efficient digest of high-throughput sequencing data in a reproducible report

Efficient digest of high-throughput sequencing data in a reproducible report

Zhe Zhang et al. BMC Bioinformatics. 2013.

Abstract

Background: High-throughput sequencing (HTS) technologies are spearheading the accelerated development of biomedical research. Processing and summarizing the large amount of data generated by HTS presents a non-trivial challenge to bioinformatics. A commonly adopted standard is to store sequencing reads aligned to a reference genome in SAM (Sequence Alignment/Map) or BAM (Binary Alignment/Map) files. Quality control of SAM/BAM files is a critical checkpoint before downstream analysis. The goal of the current project is to facilitate and standardize this process.

Results: We developed bamchop, a robust program to efficiently summarize key statistical metrics of HTS data stored in BAM files, and to visually present the results in a formatted report. The report documents information about various aspects of HTS data, such as sequencing quality, mapping to a reference genome, sequencing coverage, and base frequency. Bamchop uses the R language and Bioconductor packages to calculate statistical matrices and the Sweave utility and associated LaTeX markup for documentation. Bamchop's efficiency and robustness were tested on BAM files generated by local sequencing facilities and the 1000 Genomes Project. Source code, instruction and example reports of bamchop are freely available from https://github.com/CBMi-BiG/bamchop.

Conclusions: Bamchop enables biomedical researchers to quickly and rigorously evaluate HTS data by providing a convenient synopsis and user-friendly reports.

PubMed Disclaimer

Figures

Figure 1
Figure 1
System architecture of bamchop program.
Figure 2
Figure 2
Estimation of summary statistics by randomly selected sequencing reads. The x-axis indicates the number of reads selected from a BAM file while the y-axis represents the values of eight summary statistics estimated using the selected reads. Each "violin" in the plots represents the distribution of estimated statistics based on 100 resamplings while the horizontal lines correspond to the global averages.
Figure 3
Figure 3
This graphic index represents the sequencing depth along chromosomes. It can be used to quickly identify large regions with extraordinarily high or low depth.
Figure 4
Figure 4
Sequencing depth of an exome sequencing sample. (A) The number and percentage of genomic locations with sequencing depth exceeding given values. (B) Mean sequencing depth of different genomic features. As expected for this exome sample, the exons have much higher mean depth than the other regions.
Figure 5
Figure 5
Sequencing quality. (A) Global distribution of single-base quality scores. An overall high quality of HTS data is suggested as most of the bases were scored around 30 (p = 0.001). (B) Heat map of sequencing quality as a function of read position. The blue solid line and black dashed line indicate the position-specific medians and means of scores, respectively. This figure suggests that the overall sequencing quality of this sample was substantially reduced after 80 bases.
Figure 6
Figure 6
Read mapping statistics. (A) Map quality thresholds. The majority of reads (~95%) have the best mapping score (mapq = 70) assigned by the alignment program, suggesting high confidence of mapping results. (B) Base-level mismatch information, where M = matched bases; I = inserted bases; D = deleted bases; and S = soft clipping bases due to mismatches. (C) Duplication mapping (multiple reads mapped to the same genomic location). The x-axis represents the number of reads sharing the same mapping locations and the y-axis represents the total number of such locations. (D) Insertion size. When the BAM file includes information about paired-end reads, bamchop also summarizes the distribution of the distance between the mapped locations of the read pairs, which is known as insertion size. Insertion size equals to the size of a DNA fragment in sequencing library to be sequenced in pair.
Figure 7
Figure 7
Base frequency. (A) Frequency of N (uncalled) bases due to low quality or ambiguity. (B) Expected versus observed base frequency. (C) Distribution of per-read GC percentages. (D) Position-specific base frequencies at the beginning of reads, which shows a bias in favor of bases G and A to start the sequencing with.
Figure 8
Figure 8
The runtime of bamchop depends on the total number of mapped reads in each BAM file. Diamonds represent the BAM files described in Table 1. The basic runtime of bamchop is about 11 minutes and each 100 million extra reads requires about 10.5 more minute to finish.

References

    1. Davey JW, Hohenlohe PA, Etter PD, Boone JQ, Catchen JM, Blaxter ML. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat Rev Genet. 2011;12(7):499–510. doi: 10.1038/nrg3012. - DOI - PubMed
    1. Bamshad MJ, Ng SB, Bigham AW, Tabor HK, Emond MJ, Nickerson DA, Shendure J. Exome sequencing as a tool for Mendelian disease gene discovery. Nat Rev Genet. 2011;12(11):745–755. doi: 10.1038/nrg3031. - DOI - PubMed
    1. Chiu RW, Akolekar R, Zheng YW, Leung TY, Sun H, Chan KC, Lun FM, Go AT, Lau ET, To WW. et al.Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. BMJ. 2011;342:c7401. doi: 10.1136/bmj.c7401. - DOI - PMC - PubMed
    1. Veltman JA, Brunner HG. De novo mutations in human genetic disease. Nat Rev Genet. 2012;13(8):565–575. doi: 10.1038/nrg3241. - DOI - PubMed
    1. Furey TS. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012. - PMC - PubMed

Publication types

LinkOut - more resources