Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan;6(1):e000320.
doi: 10.1099/mgen.0.000320.

Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics

Affiliations

Cost effective, experimentally robust differential-expression analysis for human/mammalian, pathogen and dual-species transcriptomics

Amol C Shetty et al. Microb Genom. 2020 Jan.

Abstract

As sequencing read length has increased, researchers have quickly adopted longer reads for their experiments. Here, we examine 14 pathogen or host-pathogen differential gene expression data sets to assess whether using longer reads is warranted. A variety of data sets was used to assess what genomic attributes might affect the outcome of differential gene expression analysis including: gene density, operons, gene length, number of introns/exons and intron length. No genome attribute was found to influence the data in principal components analysis, hierarchical clustering with bootstrap support, or regression analyses of pairwise comparisons that were undertaken on the same reads, looking at all combinations of paired and unpaired reads trimmed to 36, 54, 72 and 101 bp. Read pairing had the greatest effect when there was little variation in the samples from different conditions or in their replicates (e.g. little differential gene expression). But overall, 54 and 72 bp reads were typically most similar. Given differences in costs and mapping percentages, we recommend 54 bp reads for organisms with no or few introns and 72 bp reads for all others. In a third of the data sets, read pairing had absolutely no effect, despite paired reads having twice as much data. Therefore, single-end reads seem robust for differential-expression analyses, but in eukaryotes paired-end reads are likely desired to analyse splice variants and should be preferred for data sets that are acquired with the intent to be community resources that might be used in secondary data analyses.

Keywords: RNA-Seq; dual species RNA-Seq; sequencing; transcriptomics.

PubMed Disclaimer

Conflict of interest statement

The authors declare that there are no conflicts of interest.

Figures

Fig. 1.
Fig. 1.
The percentage of reads mapping (circles), reads mapping uniquely (triangles) and reads not mapping uniquely (squares) are compared for 36, 54, 72 and 101 bp reads for the human (a), mouse (b), Aspergillus (c), Candida/host (d), Candida only (e) and E. coli (f) data sets. Results are compared for mappings with the paired reads (red), only the first read in the pair (green) and only the second read in the pair (blue).
Fig. 2.
Fig. 2.
A PCA was undertaken for a vector representing data for the different read lengths (green, 36 bp; blue, 54 bp; magenta, 72 bp; purple, 101 bp), replicates and biological conditions. Four representative results are illustrated with E. coli paired-end data (circle, DMEM; triangle, LB) (a), Candida/human first-in-pair single-end reads (circle, 5h_c; triangle, 5h_oc) (b), CSHL ENCODE human first-in-pair single-end reads (circle, IMR-90; triangle, NHD) (c) and Wolbachia paired-end reads (circle, adult females; triangle, adult males) (d). All PCA plots for read length are provided in Additional Files S1–S12 and pairing statuses are provided in Additional Files S13–S24.
Fig. 3.
Fig. 3.
Hierarchical clustering using pvclust for bootstrap support was undertaken for a vector representing data for each sample at different read lengths. Samples are labelled according to the key in Table S2, followed by the read length (36, 54, 72 and 101 bp). Four representative results are illustrated here with E. coli paired-end data (a), Candida/human first-in-pair single-end reads (b), ENCODE human first-in-pair single-end reads (c) and Wolbachia paired-end reads (d). In the E. coli data, read length did not affect the clustering of the data, while the largest effect of read length was observed with the Wolbachia data. All hierarchical clustering plots for read length are provided in Additional Files S1–12 and pairing status are provided in Additional Files S13–24.
Fig. 4.
Fig. 4.
The differentially expressed genes identified in E. coli (L vs M) using an adjusted P value (FDR) cut-off ≤0.05 for paired-end reads at varying read lengths within a data set were compared using Pearson’s correlation implemented in the R statistical tool and illustrated as a matrix of scatterplots. The diagonal represents the histogram of log-transformed fold-changes within the comparison. The lower plots represent the correlation between comparisons with singleton differentially expressed genes identified for comparisons on the x-axis (pink) and y-axis (green). Genes with FDR >0.05 in both comparisons are not shown. The upper portion of the plot lists the corresponding Pearson’s correlation coefficient and the number of singleton differentially expressed genes identified in each comparison.
Fig. 5.
Fig. 5.
A PCA was undertaken for a vector representing data for the different pairing statuses (paired end, green; first-in-pair single end, blue; second-in-pair single end, pink) for the biological samples and their replicates. Six representative results are illustrated with 72 bp E. coli data for minimal media (circles) and rich media (triangles) (a), 72 bp data for ticks differentially infected with E. chaffeensis strain Arkansas (circle) and strain Heartland (triangle) (b), 101 bp data for ticks differentially infected with E. chaffeensis strain Arkansas (circle) and strain Heartland (triangle) (c), 101 bp Candida data from rhr2_comp (circle) and rh2_exp (triangle) (d), 72 bp B. malayi data from adult females (circle) and adult males (triangle) (e) and 72 bp H . pylori data from a 24 hour time point (circle) and a 2 hour time point (triangle) (f). All PCA plots for read length are provided in Additional Files S1–12 and pairing status are provided in Additional Files S13–24.
Fig. 6.
Fig. 6.
Hierarchical clustering using pvclust for bootstrap support was undertaken for a vector representing data for each sample at different read lengths with heatmaps illustrating the DESeq normalized read counts of the samples. Samples are labelled according to the key in Table S2, followed by the read length (36, 54, 72 and 101 bp). Four representative results are illustrated here with wBm paired-end data (a), B. malayi paired-end data (b), ENCODE human paired-end data (c) and E. coli paired-end data (d). Little variation is seen in the biological samples and their replicates from Wolbachia, as opposed to E. coli, which likely explains why the read length has a strong effect in the wBm data relative to the E. coli data.

References

    1. Chhangawala S, Rudy G, Mason CE, Rosenfeld JA. The impact of read length on quantification of differentially expressed genes and splice junction detection. Genome Biol. 2015;16:131. doi: 10.1186/s13059-015-0697-y. - DOI - PMC - PubMed
    1. Bruno VM, Shetty AC, Yano J, Fidel PL, Noverr MC, et al. Transcriptomic analysis of vulvovaginal candidiasis identifies a role for the NLRP3 inflammasome. MBio. 2015;6:e00182-15. doi: 10.1128/mBio.00182-15. - DOI - PMC - PubMed
    1. Watkins TN, Liu H, Chung M, Hazen TH, Dunning Hotopp JC, et al. Comparative transcriptomics of Aspergillus fumigatus strains upon exposure to human airway epithelial cells. Microb Genom. 2018;4:mgen.0.000154. doi: 10.1099/mgen.0.000154. - DOI - PMC - PubMed
    1. Liu Y, Shetty AC, Schwartz JA, Bradford LL, Xu W, et al. New signaling pathways govern the host response to C. albicans infection in various niches. Genome Res. 2015;25:679–689. doi: 10.1101/gr.187427.114. - DOI - PMC - PubMed
    1. Hazen TH, Daugherty SC, Shetty A, Mahurkar AA, White O, et al. RNA-Seq analysis of isolate- and growth phase-specific differences in the global transcriptomes of enteropathogenic Escherichia coli prototype isolates. Front Microbiol. 2015;6:569. doi: 10.3389/fmicb.2015.00569. - DOI - PMC - PubMed