RNA-Seq gene expression estimation with read mapping uncertainty

Bo Li¹, Victor Ruotti, Ron M Stewart, James A Thomson, Colin N Dewey

Affiliations

PMID: 20022975
PMCID: PMC2820677
DOI: 10.1093/bioinformatics/btp692

RNA-Seq gene expression estimation with read mapping uncertainty

Bo Li et al. Bioinformatics. 2010.

. 2010 Feb 15;26(4):493-500.

doi: 10.1093/bioinformatics/btp692. Epub 2009 Dec 18.

Authors

Bo Li¹, Victor Ruotti, Ron M Stewart, James A Thomson, Colin N Dewey

Affiliation

¹ Department of Computer Sciences, University of Wisconsin, Madison, WI 53706, USA.

PMID: 20022975
PMCID: PMC2820677
DOI: 10.1093/bioinformatics/btp692

Abstract

Motivation: RNA-Seq is a promising new technology for accurately measuring gene expression levels. Expression estimation with RNA-Seq requires the mapping of relatively short sequencing reads to a reference genome or transcript set. Because reads are generally shorter than transcripts from which they are derived, a single read may map to multiple genes and isoforms, complicating expression analyses. Previous computational methods either discard reads that map to multiple locations or allocate them to genes heuristically.

Results: We present a generative statistical model and associated inference methods that handle read mapping uncertainty in a principled manner. Through simulations parameterized by real RNA-Seq data, we show that our method is more accurate than previous methods. Our improved accuracy is the result of handling read mapping uncertainty with a statistical model and the estimation of gene expression levels as the sum of isoform expression levels. Unlike previous methods, our method is capable of modeling non-uniform read distributions. Simulations with our method indicate that a read length of 20-25 bases is optimal for gene-level expression estimation from mouse and maize RNA-Seq data when sequencing throughput is fixed.

PubMed Disclaimer

Figures

**Fig. 1.**
The graphical model for RNA-Seq data used by our method.

**Fig. 2.**
Gene expression estimation accuracy varies with read length given fixed base throughput (T). The curves are (1) mouse liver, T=375 × 10⁶, (2) mouse liver, T=750 × 10⁶, (3) mouse liver, T=1.5 × 10⁷, (4) mouse brain, T=750 × 10⁶ and (5) maize, T=750 × 10⁶. The τ MPE was calculated with respect to the true expression values for all genes with true level at least 1 TPM.

See this image and copyright information in PMC

References

1. Beissbarth T, et al. Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics. 2004;20(Suppl. 1):i31–i39. - PubMed
1. Cloonan N, et al. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat. Methods. 2008;5:613–619. - PubMed
1. Dempster AP, et al. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977;39:1–38.
1. Dohm JC, et al. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008;36:e105. - PMC - PubMed
1. Faulkner GJ, et al. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics. 2008;91:281–288. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

RNA-Seq gene expression estimation with read mapping uncertainty

Affiliation

RNA-Seq gene expression estimation with read mapping uncertainty

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources