Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Jul 19:12:290.
doi: 10.1186/1471-2105-12-290.

Bias detection and correction in RNA-Sequencing data

Affiliations

Bias detection and correction in RNA-Sequencing data

Wei Zheng et al. BMC Bioinformatics. .

Abstract

Background: High throughput sequencing technology provides us unprecedented opportunities to study transcriptome dynamics. Compared to microarray-based gene expression profiling, RNA-Seq has many advantages, such as high resolution, low background, and ability to identify novel transcripts. Moreover, for genes with multiple isoforms, expression of each isoform may be estimated from RNA-Seq data. Despite these advantages, recent work revealed that base level read counts from RNA-Seq data may not be randomly distributed and can be affected by local nucleotide composition. It was not clear though how the base level read count bias may affect gene level expression estimates.

Results: In this paper, by using five published RNA-Seq data sets from different biological sources and with different data preprocessing schemes, we showed that commonly used estimates of gene expression levels from RNA-Seq data, such as reads per kilobase of gene length per million reads (RPKM), are biased in terms of gene length, GC content and dinucleotide frequencies. We directly examined the biases at the gene-level, and proposed a simple generalized-additive-model based approach to correct different sources of biases simultaneously. Compared to previously proposed base level correction methods, our method reduces bias in gene-level expression estimates more effectively.

Conclusions: Our method identifies and corrects different sources of biases in gene-level expression measures from RNA-Seq data, and provides more accurate estimates of gene expression levels from RNA-Seq. This method should prove useful in meta-analysis of gene expression levels using different platforms or experimental protocols.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Differences in experiment protocols for RNA-Seq. Major steps in standard Illumina RNA-Seq sample preparation protocol in Marioni et al. [7], Bullard et al. [20], Lee et al. [23] (left), alternative RNA-Seq protocols in Nagalakshmi et al. [24] (middle) and FRT-Seq protocol in Mamanova et al. [25] (right) are compared.
Figure 2
Figure 2
Bias plots for MAQC data (Procedure 1, gene-level). Genes were grouped into bins according to log gene length, GC content, and dinucleotide frequencies, and the median expression levels in log(FPKM) units versus median bias factors were plotted for MAQC2 brain and UHR samples before and after GAM correction. Each bin contains 500 genes. Data were processed by Procedure 1. This data set showed strong linear relationship between expression levels and gene length, GC content and dinucleotide frequencies that are related to GC content (i.e. AA, AT, TA, TT, GG, GC, CG, CC). Moreover, the patterns from two different biological samples (brain and UHR) were very similar. After GAM correction, the bias patterns diminished.
Figure 3
Figure 3
GAM is robust to sequencing depth and gene expression levels. Seven lanes of MAQC2 UHR data to perform GAM correction by adding one lane at a time. The log fold changes of estimated gene expression levels before and after GAM correction were calculated for genes with high, medium and low expression. Fractions of genes with log fold change within ± 5% of the final value were plotted. Overall the correction was robust to sequencing depth, with ~80% genes showing the fold change within ± 5% of the final estimates using only one lane. Moreover, genes with lower expression were only slightly more sensitive to sequencing depth.
Figure 4
Figure 4
Platform and sample specific biases. Bias patterns of gene expression levels in terms of log gene length, GC content and dinucleotide frequencies for 571 and 606 genes measured by both Taqman RT-PCR (expression levels in ΔCT unit) and RNA-Seq platforms (expression levels in log FPKM unit) and expressed above the detection thresholds in brain and UHR samples. The expression levels were rescaled to mean of 0 and standard deviation of 1 on Y-axes. In most bias plots, the fitted lowess curves for sequencing platform (blue and red) were separated from those from RT-PCR platform (green and black), which indicates platform specific biases. Only a few plots (e.g. for AG and GA dinucleotide frequencies) showed separation between brain samples and UHR samples, which indicates sample specific biases.

References

    1. Reinartz J, Bruyns E, Lin JZ, Burcham T, Brenner S, Bowen B, Kramer M, Woychik R. Massively parallel signature sequencing (MPSS) as a tool for in-depth quantitative gene expression profiling in all organisms. Brief Funct Genomic Proteomic. 2002;1(1):95–104. doi: 10.1093/bfgp/1.1.95. - DOI - PubMed
    1. Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE. Using the transcriptome to annotate the genome. Nat Biotechnol. 2002;20(5):508–512. doi: 10.1038/nbt0502-508. - DOI - PubMed
    1. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW. Serial analysis of gene expression. Science. 1995;270(5235):484–487. doi: 10.1126/science.270.5235.484. - DOI - PubMed
    1. Adams MD, Kelley JM, Gocayne JD, Dubnick M, Polymeropoulos MH, Xiao H, Merril CR, Wu A, Olde B, Moreno RF. Complementary DNA sequencing: expressed sequence tags and human genome project. Science. 1991;252(5013):1651–1656. doi: 10.1126/science.2047873. - DOI - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed

Publication types

LinkOut - more resources