Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 1;16(4):194-204.
doi: 10.1093/bfgp/elw035.

Multi-perspective quality control of Illumina RNA sequencing data analysis

Multi-perspective quality control of Illumina RNA sequencing data analysis

Quanhu Sheng et al. Brief Funct Genomics. .

Abstract

Quality control (QC) is a critical step in RNA sequencing (RNA-seq). Yet, it is often ignored or conducted on a limited basis. Here, we present a multi-perspective strategy for QC of RNA-seq experiments. The QC of RNA-seq can be divided into four related stages: (1) RNA quality, (2) raw read data (FASTQ), (3) alignment and (4) gene expression. We illustrate the importance of conducting QC at each stage of an RNA-seq experiment and demonstrate our recommended RNA-seq QC strategy. Furthermore, we discuss the major and often neglected quality issues associated with the three major types of RNA-seq: mRNA, total RNA and small RNA. This RNA-seq QC overview provides comprehensive guidance for researchers who conduct RNA-seq experiments.

Keywords: RNA-seq; alignment; gene expression; quality control; raw data; small RNA-seq; total RNA-seq.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
The overall workflow of RNA-seq QC. (A colour version of this figure is available online at: http://bfg.oxfordjournals.org)
Figure 2.
Figure 2.
(A) RNA with good quality, RIN = 10. (B) RNA with feasible quality, RIN = 6.9. (C) RNA with poor quality, RIN = 1.8. (D) Typical small RNA quality, RIN is usually <5 for small RNA. (A colour version of this figure is available online at: http://bfg.oxfordjournals.org)
Figure 3.
Figure 3.
This figure was produced from QC3. (A) Example of a long RNA-seq sample with expected base quality score. Read 2 tends to have a slightly lower median base score than read 1, but it is not usually a quality concern. (B) Example of a long RNA-seq sample with potential base quality problem, as denoted by the sudden drops of median base quality in read 2 of pair-end read sequencing. (C) Example of a sRNA-seq sample with expected quality score. Owing to trimming, the reads of sRNA-seq are of unequal length, causing the quality dropping more dramatically toward the end of the read cycles. (D) Example of a long RNA-seq sample with expected nucleotide distribution, as denoted by the stable nucleotide distribution across the samples. (E) Example of a long RNA-seq sample with a potential nucleotide distribution issue, as denoted by the unstable distribution across the cycles. (F) Example nucleotide distribution from a sRNA-seq sample (same sample as C). Large variation of nucleotide distribution can be observed which is typical for sRNA-seq data. (A colour version of this figure is available online at: http://bfg.oxfordjournals.org)
Figure 4.
Figure 4.
(A) Example of a small RNA sample with potential quality issue based on read length distribution after trimming. The high peak at zero indicates the majority of the reads are adapter sequences. (B) Example of a small RNA sample with expected read length after trimming. We observe a high peak at 22 for miRNA and another at 33 for tRNA.
Figure 5.
Figure 5.
(A) Expected insert size distribution from RNA-seq data. A peak should be observed between 100 and 200 nucleotides, as it is the targeted fragment size in most RNA-seq kits. The high peak for insert size >1000 nucleotide is caused by improperly mapped pairs. These are usually errors caused by sequencing or alignment, or indicate structural variations. (B) A graphical explanation of the reason of inaccurate insert size estimation in the SAM/BAM file. (A colour version of this figure is available online at: http://bfg.oxfordjournals.org)
Figure 6.
Figure 6.
(A and B) Examples of expected read count distribution after variance stabilization normalization, where all samples share similar distributions. (C and D) Examples of uneven read count distribution after normalization. This might indicate batch effect or outlier samples. (A colour version of this figure is available online at: http://bfg.oxfordjournals.org)
Figure 7.
Figure 7.
Top miRNAs detected in each sample. Left: human small RNA data set with high detected abundance of hsa-miR-486-5p. Right: mouse small RNA data set with high detected abundance of mmu-miR-486a-5p. (A colour version of this figure is available online at: http://bfg.oxfordjournals.org)

References

    1. Wang Z, Gerstein M, Snyder M.. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet 2009;10:57–63. - PMC - PubMed
    1. Marioni JC, Mason CE, Mane SM, et al.RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008;18:1509–17. - PMC - PubMed
    1. Asmann YW, Klee EW, Thompson EA, et al.3' tag digital gene expression profiling of human brain and universal reference RNA using Illumina genome analyzer. BMC Genomics 2009;10:531.. - PMC - PubMed
    1. Cloonan N, Forrest AR, Kolle G, et al.Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nat Methods 2008;5:613–9. - PubMed
    1. Guo Y, Sheng Q, Li J, et al.Large scale comparison of gene expression levels by microarrays and RNAseq using TCGA data. PLoS One 2013;8:e71462.. - PMC - PubMed