Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Jun 23;16(1):131.
doi: 10.1186/s13059-015-0697-y.

The impact of read length on quantification of differentially expressed genes and splice junction detection

Affiliations

The impact of read length on quantification of differentially expressed genes and splice junction detection

Sagar Chhangawala et al. Genome Biol. .

Abstract

Background: The initial next-generation sequencing technologies produced reads of 25 or 36 bp, and only from a single-end of the library sequence. Currently, it is possible to reliably produce 300 bp paired-end sequences for RNA expression analysis. While read lengths have consistently increased, people have assumed that longer reads are more informative and that paired-end reads produce better results than single-end reads. We used paired-end 101 bp reads and trimmed them to simulate different read lengths, and also separated the pairs to produce single-end reads. For each read length and paired status, we evaluated differential expression levels between two standard samples and compared the results to those obtained by qPCR.

Results: We found that, with the exception of 25 bp reads, there is little difference for the detection of differential expression regardless of the read length. Once single-end reads are at a length of 50 bp, the results do not change substantially for any level up to, and including, 100 bp paired-end. However, splice junction detection significantly improves as the read length increases with 100 bp paired-end showing the best performance. We performed the same analysis on two ENCODE samples and found consistent results confirming that our conclusions have broad application.

Conclusions: A researcher could save substantial resources by using 50 bp single-end reads for differential expression analysis instead of using longer reads. However, splicing detection is unquestionably improved by paired-end and longer reads. Therefore, an appropriate read length should be used based on the final goal of the study.

PubMed Disclaimer

Figures

Fig. 1
Fig. 1
Mapping statistics of all samples and read lengths. a Each sample and read length is plotted with its respective percentage of uniquely mapped, multi-mapped and unmapped reads. The 25 bp reads have the lowest percent of uniquely mapped reads across all samples. Single-end reads also have higher percentage of multi-mapped reads and slightly lower percentage of uniquely mapped reads. b The number of splice junctions detected for each sample is plotted. The 25 bp samples detected the least number of junctions and single-end reads detected significantly fewer junctions overall than paired-end reads. The error bars represent the highest and lowest number of splice junctions detected across replicates
Fig. 2
Fig. 2
Determination of differentially expressed genes according to read length and differential expression method. a Single-end read samples. The number of orphan genes (read-length-specific genes) in the overlap of the top 200 genes sorted by -Log2-based fold change (-Log2FC; down-regulated), +Log2FC (up-regulated) and p value. b Paired-end read samples. The number of orphan genes (read-length-specific genes) in the overlap of the top 200 genes sorted by -Log2FC, +Log2FC and p value. c Single-end read samples. The plot shows the agreement for the top 200 differentially expressed genes by different read length. d Paired-end read samples. The plot shows the agreement for the top 200 differentially expressed genes by different read length
Fig. 3
Fig. 3
Comparison of previously reported qPCR results with our DEG results. a Pearson correlation between Log2FC of genes according to various differential expression methods and qPCR. Single-end 25 bp reads have the worst correlation when using DESeq and EdgeR. b Root mean square deviation (RMSD) between Log2FC and qPCR. Single-end 25 bp reads give results farthest from the true values. c Common genes between the top 200 genes identified by various differential expression methods and qPCR sorted by +Log2FC. d Same as (c), except sorted by –Log2FC. The overlap of common genes improves as read length increases. However, the gain is not significant for reads >50 bp for paired-end and >75 bp for single-end reads
Fig. 4
Fig. 4
Splice junction agreement and inter-replicate reproducibility. a Number of known and novel junctions that were orphans (read-length-specific junctions) according to the read length in a specific sample. b Percentage of known junctions that were common when paired-end and single-end samples of the same read length were intersected. Error bars represent the range of all the replicates
Fig. 5
Fig. 5
Common splice junctions detected with different read lengths. a Percentage of splice junctions detected with all four read lengths. b Percentage of splice junctions detected with all read lengths except 25 bp

References

    1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007;129:823–37. doi: 10.1016/j.cell.2007.05.009. - DOI - PubMed
    1. Rosenfeld JA, Xuan Z, DeSalle R. Investigating repetitively matching short sequencing reads: the enigmatic nature of H3K9me3. Epigenetics. 2009;4:476–86. doi: 10.4161/epi.4.7.9809. - DOI - PubMed
    1. Li S, Tighe SW, Nicolet CM, Grove D, Levy S, Farmerie W, et al. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat Biotechnol. 2014;32:915–25. doi: 10.1038/nbt.2972. - DOI - PMC - PubMed
    1. Li B, Dewey CN. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics. 2011;12:323. doi: 10.1186/1471-2105-12-323. - DOI - PMC - PubMed
    1. Leng N, Dawson J, Thomson J, Ruotti V, Rissman AI, Smits BMG. EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics. 2013;29:1035–43. doi: 10.1093/bioinformatics/btt087. - DOI - PMC - PubMed

Publication types

MeSH terms

Substances

LinkOut - more resources