Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jan 31;6(1):e16685.
doi: 10.1371/journal.pone.0016685.

Detection and removal of biases in the analysis of next-generation sequencing reads

Affiliations

Detection and removal of biases in the analysis of next-generation sequencing reads

Schraga Schwartz et al. PLoS One. .

Abstract

Since the emergence of next-generation sequencing (NGS) technologies, great effort has been put into the development of tools for analysis of the short reads. In parallel, knowledge is increasing regarding biases inherent in these technologies. Here we discuss four different biases we encountered while analyzing various Illumina datasets. These biases are due to both biological and statistical effects that in particular affect comparisons between different genomic regions. Specifically, we encountered biases pertaining to the distributions of nucleotides across sequencing cycles, to mappability, to contamination of pre-mRNA with mRNA, and to non-uniform hydrolysis of RNA. Most of these biases are not specific to one analyzed dataset, but are present across a variety of datasets and within a variety of genomic contexts. Importantly, some of these biases correlated in a highly significant manner with biological features, including transcript length, gene expression levels, conservation levels, and exon-intron architecture, misleadingly increasing the credibility of results due to them. We also demonstrate the relevance of these biases in the context of analyzing an NGS dataset mapping transcriptionally engaged RNA polymerase II (RNAPII) in the context of exon-intron architecture, and show that elimination of these biases is crucial for avoiding erroneous interpretation of the data. Collectively, our results highlight several important pitfalls, challenges and approaches in the analysis of NGS reads.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Examination of nucleotide biases within reads across different datasets of deep-sequencing experiments.
For each dataset, we present sequence logos of the first twenty positions of all reads that could be aligned to the reference genome (left panel), and positional nucleotide charts (right panel). In the sequence logos, the height of each letter is proportional to the frequency of the corresponding base at the given position, and bases are listed in descending order of frequency from top to bottom. The positional nucleotide charts display the frequency of each base-pair at each position. Data for additional datasets is presented in Supporting Information S1. (A) Data for RNA-seq reads from human lymph node obtained from . (B) Data for RNA-seq reads from human lymphoblastoid tissue obtained from . (C) Data for RNA-seq reads from CD4 cells were obtained from . (D) Data for genomic reads from PhiX control lanes following 26 cycles were from . (E) Data for ChIP-seq lane mapping PAF binding sites in human CD4 cells were from .
Figure 2
Figure 2. Mappability within genomic regions.
(A) Mean mappability density values within internal exons and within the exons and introns flanking them. Error bars represent the standard error of the mean (SEM). (B) Mappability in the region surrounding exon/intron junction. The dashed line represents the exon/intron junction. (C) Mappability in the region surrounding exon/intron junction as a function of total transcript length. Each exon was distributed into one of five bins based on the length of the transcript containing it. (D) Mappability in the region surrounding exon/intron junction as a function of exon conservation level, divided into five bins. (E) Mappability in the region surrounding exon/intron junction as a function of transcript expression level, divided into five bins. Transcript expression levels were obtained from . (F) Mappability in the regions surrounding transcription start and end sites. (G) Mappability in the regions surrounding CD box snoRNA start and end sites. (H) Mappability in the regions surrounding tRNA start and end sites.
Figure 3
Figure 3. GRO-seq reads localization along exons and introns.
Exons were aligned by their 3′ss (left panel) or by their 5′ss (right panel). The dashed line represents the exon/intron junction. Exons were divided into five bins based on microarray-based transcript expression levels in lung fibroblasts obtained from . Insets present blowups of the regions marked by black rectangles.
Figure 4
Figure 4. Control analyses of GRO-seq reads.
(A) Analysis of 36,905 exonic compositions regions (ECRs) obtained from . ECRs were defined as exon-sized region within intronic or intergenic regions with sequence content similar to that of exons, flanked by regions with intronic sequence content. (B) Analysis of 49,276 pseudo-exons obtained from . Pseudo-exons were defined as regions with a length distribution similar to that of exons flanked by relative strong splicing signals. (C) Sequence logos of all aligned GRO-seq reads, aligned by their 5′ end, as in Figure 1. (D) Positional nucleotide charts for GRO-seq reads, as in Figure 1. (E) Alignment of GRO-seq reads in the 200 nt surrounding transcription start and end sites (left and right panels, respectively). (F) Analysis as in Figure 1 following normalization of all read counts by the relative frequency of the nucleotide at the first position of each read.
Figure 5
Figure 5. Analysis of effect of contamination of run-on experiment with mature RNA.
(A) Mean number of GRO-seq read densities within exons and their flanking exons and introns as a function of expression levels obtained from , following normalization by mappability. (B) Analysis as in panel A, showing mappability values at a single base pair level. The dotted rectangle marks the region harboring the ∼30 terminal nt. (C) Analysis as in B, but incorporating reads obtained from exon-exon junctions. (D) Exons were divided into 200 equally sized bins based on gene expression levels derived from . The percentage of exons with reads overlapping the junction between the central exon and the exon upstream to it are plotted for each bin.
Figure 6
Figure 6. Plots showing GRO-seq read distribution in the along start and end sites of various non-coding RNA genes.
The name of the RNA gene family and number of genes analyzed per family are indicated in red within the left and right panels, respectively.

References

    1. Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 11:31–46. - PubMed
    1. Flicek P, Birney E. Sense from sequence reads: methods for alignment and assembly. Nat Methods. 2009;6:S6–S12. - PubMed
    1. Medvedev P, Stanciu M, Brudno M. Computational methods for discovering structural variation with next-generation sequencing. Nat Methods. 2009;6:S13–20. - PubMed
    1. Gilad Y, Pritchard JK, Thornton K. Characterizing natural variation using next-generation sequencing technologies. Trends Genet. 2009;25:463–471. - PMC - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10:57–63. - PMC - PubMed

Publication types