Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Sep 3;16(1):177.
doi: 10.1186/s13059-015-0734-x.

Errors in RNA-Seq quantification affect genes of relevance to human disease

Affiliations

Errors in RNA-Seq quantification affect genes of relevance to human disease

Christelle Robert et al. Genome Biol. .

Abstract

Background: RNA-Seq has emerged as the standard for measuring gene expression and is an important technique often used in studies of human disease. Gene expression quantification involves comparison of the sequenced reads to a known genomic or transcriptomic reference. The accuracy of that quantification relies on there being enough unique information in the reads to enable bioinformatics tools to accurately assign the reads to the correct gene.

Results: We apply 12 common methods to estimate gene expression from RNA-Seq data and show that there are hundreds of genes whose expression is underestimated by one or more of those methods. Many of these genes have been implicated in human disease, and we describe their roles. We go on to propose a two-stage analysis of RNA-Seq data in which multi-mapped or ambiguous reads can instead be uniquely assigned to groups of genes. We apply this method to a recently published mouse cancer study, and demonstrate that we can extract relevant biological signal from data that would otherwise have been discarded.

Conclusions: For hundreds of genes in the human genome, RNA-Seq is unable to measure expression accurately. These genes are enriched for gene families, and many of them have been implicated in human disease. We show that it is possible to use data that may otherwise have been discarded to measure group-level expression, and that such data contains biologically relevant information.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Comparison of methods on global simulated data. a Scatter plots comparing FPKM for each of the 12 methods against the known FPKM from simulated data. The red line indicates the y = x line. b Histograms of read counts for each of the 12 methods. All methods should have a single peak at 1000. c A heatmap of read counts from 843 grossly underestimated genes and 187 grossly overestimated genes. Black and darker colours indicate read counts close to 1000 (accurate); green colours indicate underestimation and red colours overestimation
Fig. 2
Fig. 2
General characteristics of problematic genes. Boxplots comparing the length of the shortest exon, the length of the longest exon, the mean exon length, the total number of exons, the transcript length, transcript percentage GC, the number of reads overlapping from the STAR alignment and the number of reads overlapping the TopHat alignment for the 958 problematic genes and the 18,696 other genes
Fig. 3
Fig. 3
Comparison of methods on difficult genes. a Scatter plots comparing observed FPKM for each of the 12 methods against the known FPKM from simulated data. The red line indicates the y = x line. b Scatter plots comparing observed read counts for each of the 12 methods against the known read counts from simulated data. The red line indicates the y = x line
Fig. 4
Fig. 4
Principle components analysis (PCA) of mouse cancer study. a PCA of tumour (red) and normal (blue) RNA-Seq datasets from each of five cell types. Input data are log(FPKM) values after mapping data using STAR and counting only uniquely mapped reads against known mouse genes (stage 1 analysis) (b) PCA of tumour (red) and normal (blue) RNA-Seq datasets from each of five cell types. Input data are log(FPM) values of reads that cannot be assigned to a single gene but can be uniquely assigned to a multi-map group (MMG). The reads used in (b) are only those reads discarded from (a)
Fig. 5
Fig. 5
Heatmap of novel multi-map groups (MMGs). A heatmap of the log FPM (fragments per million) values for 672 differentially expressed MMGs that do not contain any genes present in the list of differentially expressed genes from an analysis of unique counts. The heatmap demonstrates that MMGs which are exclusive of differentially expressed genes from unique counts can be used to separate tumour from normal samples
Fig. 6
Fig. 6
Comparison of read counts for (a) ENSMUSG00000024121 and (b) MG4194. Read counts expressed as a percentage of the mapped reads for gene ENSMUSG00000024121, and MG4194, a single-gene MMG that contains only ENSMUSG00000024121. ENSMUSG00000024121 was not found to be differentially expressed by the unique read analysis, but MG4194 was found to be differentially expressed by the MMG analysis. Black bars represent tumour samples, white bars normal samples

Similar articles

Cited by

References

    1. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8. doi: 10.1038/nmeth.1226. - DOI - PubMed
    1. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28:511–5. doi: 10.1038/nbt.1621. - DOI - PMC - PubMed
    1. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. - DOI - PMC - PubMed
    1. Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–11. doi: 10.1093/bioinformatics/btp120. - DOI - PMC - PubMed
    1. Dobin A, Davis CA, Schlesinger F, Drenkow J, Zaleski C, Jha S, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. - DOI - PMC - PubMed

Publication types