Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 15;26(4):493-500.
doi: 10.1093/bioinformatics/btp692. Epub 2009 Dec 18.

RNA-Seq gene expression estimation with read mapping uncertainty

Affiliations

RNA-Seq gene expression estimation with read mapping uncertainty

Bo Li et al. Bioinformatics. .

Abstract

Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.

Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20-25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
The graphical model for RNA-Seq data used by our method.
Fig. 2.
Fig. 2.
Gene expression estimation accuracy varies with read length given fixed base throughput (T). The curves are (1) mouse liver, T=375 × 106, (2) mouse liver, T=750 × 106, (3) mouse liver, T=1.5 × 107, (4) mouse brain, T=750 × 106 and (5) maize, T=750 × 106. The τ MPE was calculated with respect to the true expression values for all genes with true level at least 1 TPM.

References

    1. Beissbarth T, et al. Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics. 2004;20(Suppl. 1):i31–i39. - PubMed
    1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. - PubMed
    1. Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977;39:1–38.
    1. Dohm JC, et al. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105. - PMC - PubMed
    1. Faulkner GJ, et al. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics. 2008;91:281–288. - PubMed

Publication types