Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 7;13(1):e0151324.
doi: 10.1128/spectrum.01513-24. Epub 2024 Dec 3.

RNA-seq reproducibility of Pseudomonas aeruginosa in laboratory models of cystic fibrosis

Affiliations

RNA-seq reproducibility of Pseudomonas aeruginosa in laboratory models of cystic fibrosis

Rebecca P Duncan et al. Microbiol Spectr. .

Abstract

Reproducibility is a fundamental expectation in science and enables investigators to have confidence in their research findings and the ability to compare data from disparate sources, but evaluating reproducibility can be elusive. For example, generating RNA sequencing (RNA-seq) data includes multiple steps where variance can be introduced. Thus, it is unclear if RNA-seq data from different sources can be validly compared. While most studies on RNA-seq reproducibility focus on eukaryotes, we evaluate bias in bacteria using Pseudomonas aeruginosa gene expression data from five laboratory models of cystic fibrosis. We leverage a large data set that includes samples prepared in three different laboratories and paired data sets where the same sample was sequenced using at least two different sequencing pipelines. We report here that expression data are highly reproducible across laboratories. In addition, while samples sequenced with different sequencing pipelines showed significantly more variance in expression profiles than between labs, gene expression was still highly reproducible between sequencing pipelines. Further investigation of expression differences between two sequencing pipelines revealed that library preparation methods were the largest source of error, though analyses to identify the source of this variance were inconclusive. Consistent with the reproducibility of expression between sequencing pipelines, we found that different pipelines detected over 80% of the same differentially expressed genes with large expression differences between conditions. Thus, bacterial RNA-seq data from different sources can be validly compared, facilitating the ability to advance understanding of bacterial behavior and physiology using the wide array of publicly available RNA-seq data sets.IMPORTANCERNA sequencing (RNA-seq) has revolutionized biology, but many steps in RNA-seq workflows can introduce variance, potentially compromising reproducibility. While reproducibility in RNA-seq has been thoroughly investigated in eukaryotes, less is known about pipelines and workflows that introduce variance and biases in bacterial RNA-seq data. By leveraging Pseudomonas aeruginosa transcriptomes in cystic fibrosis models from different laboratories and sequenced with different sequencing pipelines, we directly assess sources of bacterial RNA-seq variance. RNA-seq data were highly reproducible, with the largest variance due to sequencing pipelines, specifically library preparation. Different sequencing pipelines detected overlapping differentially expressed genes, especially those with large expression differences between conditions. This study confirms that different approaches to preparing and sequencing bacterial RNA libraries capture comparable transcriptional profiles, supporting investigators' ability to leverage diverse RNA-seq data sets to advance their science.

Keywords: Pseudomonas aeruginosa; RNA-seq; SCFM2; cystic fibrosis; epithelial cell model; reproducibility.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Fig 1
Fig 1
Steps that can introduce variance in the steps for producing RNA-seq data from PAO1 grown in SCFM2. Except for the PAO1 strain used and the data analysis step (white text in black boxes), which were the same for all samples and datasets reported here, all steps can introduce variance into RNA-seq data. Variance in sample preparation can be introduced by different investigators, media batches, days when samples are prepared, or laboratories. Downstream variance can be introduced by ribosomal RNA (rRNA) depletion and library preparation kits, and/or sequencing platforms used. In this study, we were able to distinguish between sample preparation, which collectively includes culture preparation, culturing, and RNA extraction, and sequencing pipeline, which collectively includes rRNA depletion, cDNA library preparation, size selection, and sequencing. The sequencing pipeline shown here is based on sequencing pipeline A (see Table 2 and Table S1). O/N, overnight. This figure was created in BioRender.
Fig 2
Fig 2
RNA-seq data are highly reproducible between labs and sequencing pipelines. (A–D) Representative plots of VST-normalized gene expression between two SCFM2 datasets for each of the four comparison categories. Spearman correlation coefficients (ρ) are given in each plot. Representative correlation plots show the median correlation coefficients of each comparison category. (E) Spearman correlation coefficients plotted by comparison type for SCFM2. Each data point represents the correlation coefficient for a pair of data sets. Different letters above comparison types indicate statistically significantly different medians between those types based on a Kruskal-Wallis test followed by a Dunn’s multiple comparisons test (a and b: P < 0.05 for replicates vs intra-lab comparisons and P < 0.0001 for replicates vs inter-lab comparisons; a–c: P < 0.0001, b and c: P < 0.001 for intra-lab comparisons vs pipeline comparisons and P < 0.01 for inter-lab comparisons vs pipeline comparisons). No significant difference was found between intra-lab and inter-lab comparisons. The arrow points to the correlation coefficient between data sets that were sequenced with sequencing pipelines B and E (outlined in black for clarity), which were sequenced at the same site with the same Illumina platform but had different ribosomal RNA depletion and library preparation procedures (see Materials and Methods and Dataset S1). The three circled data points correspond to correlation coefficients for paired data sets sequenced with sequencing pipelines B and D. (F) Spearman correlation coefficients plotted by comparison type for airway epithelial cell models. Data point shape represents model, and data point color represents replicate sample comparisons (orange) or sequencing pipeline comparisons (pink). AEC, CF airway epithelial cell model. epiSCFM2, CF airway epithelial cell-SCFM2 model. Asterisks represent statistically significantly different rank sums in correlation coefficients between the two comparison types based on a Mann-Whitney U test (**P < 0.01; ***P < 0.001). In all panels, data points are color-coded by comparison type (orange: replicates, blue: intra-lab comparisons, green: inter-lab comparisons, and pink: sequencing pipeline comparisons). In panels E and F, medians ± interquartile range are shown.
Fig 3
Fig 3
Operons are more highly expressed in pipeline A than in pipeline B. (A) Average Log10 (TPM-normalized expression) of all P. aeruginosa operons in sequencing pipelines A or B. Shape and color denote the pipeline. (B) Average Log10 (TPM-normalized expression) in pipelines A and B of P. aeruginosa operons highly expressed in pipeline A. Highly expressed operons were determined by ranking each operon’s average TPM-normalized expression in pipeline A and calculating the inflection point of the curve (Fig. S4A). Median ± interquartile range is shown. Significance was determined using a Mann-Whitney U test. ****P < 0.0001; ***P < 0.001.
Fig 4
Fig 4
Relationship between operon length and expression variance due to size selection. (A) Frequency of operons as a function of length for all operons and operons with the highest expression difference between pipelines A and B (see Fig. S6). (B) Frequency of operons with highest expression difference between pipelines A and B with y axis adjusted. Color and shape of data points denote the sequencing pipeline where the operons were more highly expressed. Bin widths for operon length in frequency distributions were set to 100. (C) Average lengths of operons with the high expression differences between pipelines A and B. Median and interquartile range of operon length for set of operons are shown. Significance between pipelines was determined by a Mann-Whitney U test (P = 0.07).
Fig 5
Fig 5
Sequencing pipelines detect similar differentially expressed genes. (A) Average Log2(fold change) for each gene between SCFM2 and epiSCFM2 using samples that were prepared in the same lab and sequenced with both pipelines A and B (Dataset S1). Significance, defined as adjusted P value < 0.001 and |LFC| ≥ 2, is indicated by color and shape (see key). The significance category “Not significant” includes genes with adjusted P values > 0.001 and genes with |LFC| < 2. (B) Venn diagram showing overlap in differentially expressed genes (DEGs between sequencing pipelines at a cutoff of |LFC| ≥ 2. Shown in parentheses are the number of DEGs detected by both pipelines at |LFC| ≥ 2 combined with DEGs that are detected by one pipeline at |LFC| ≥ 2 and one pipeline at |LFC| ≥ 1.5 to include DEGs just below the cutoff. (C) Venn diagram showing overlap in DEGs between sequencing pipelines at a cutoff of |LFC| ≥ 4. Shown in parentheses are the number of DEGs detected by both pipelines at |LFC| ≥ 4 combined with DEGs that are detected by one pipeline at |LFC| ≥ 4 and one pipeline at |LFC| ≥ 2 to include DEGs just below the cutoff.

References

    1. Casadevall A, Fang FC. 2010. Reproducible science. Infect Immun 78:4972–4975. doi:10.1128/IAI.00908-10 - DOI - PMC - PubMed
    1. Casadevall A, Fang FC. 2016. Rigorous science: a how-to guide. MBio 7:e01902-16. doi:10.1128/mBio.01902-16 - DOI - PMC - PubMed
    1. Shi H, Zhou Y, Jia E, Pan M, Bai Y, Ge Q. 2021. Bias in RNA-seq library preparation: current challenges and solutions. Biomed Res Int 2021:1–11. doi:10.1155/2021/6647597 - DOI - PMC - PubMed
    1. Su Z, Łabaj PP, Li S, Thierry-Mieg J, Thierry-Mieg D, Shi W, Wang C, Schroth GP, Setterquist RA, Thompson JF, et al. . 2014. A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the sequencing quality control consortium. Xuan J 32:903–914. - PMC - PubMed
    1. Li S, Tighe SW, Nicolet CM, Grove D, Levy S, Farmerie W, Viale A, Wright C, Schweitzer PA, Gao Y, et al. . 2014. Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study. Nat Biotechnol 32:915–925. doi:10.1038/nbt.2972 - DOI - PMC - PubMed