Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec 14;13(12):2362.
doi: 10.3390/genes13122362.

A Framework for Comparison and Assessment of Synthetic RNA-Seq Data

Affiliations

A Framework for Comparison and Assessment of Synthetic RNA-Seq Data

Felitsiya Shakola et al. Genes (Basel). .

Abstract

The ever-growing number of methods for the generation of synthetic bulk and single cell RNA-seq data have multiple and diverse applications. They are often aimed at benchmarking bioinformatics algorithms for purposes such as sample classification, differential expression analysis, correlation and network studies and the optimization of data integration and normalization techniques. Here, we propose a general framework to compare synthetically generated RNA-seq data and select a data-generating tool that is suitable for a set of specific study goals. As there are multiple methods for synthetic RNA-seq data generation, researchers can use the proposed framework to make an informed choice of an RNA-seq data simulation algorithm and software that are best suited for their specific scientific questions of interest.

Keywords: RNA-seq; comparative study; differential expression; sample classification; simulated data.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflict of interest.

Figures

Figure 1
Figure 1
Methodological framework for synthetic RNA-seq data generation for benchmarking of algorithms for statistical and pattern recognition analyses.
Figure 2
Figure 2
Application of the framework for comparison of bulk RNA-seq generators. Different colors indicate different types of methods. compcodeR and powsimR are parametric, SPsimSeq is semiparametric, and SimSeq and seqgendiff are nonparametric.
Figure 3
Figure 3
Q-Q plots of five synthetic data samples generated by the respective data generators (y axis) vs. five samples used as input for those data generators (x-axis), with (AE) NGSSPPG1 samples used as input data; (FJ) NGSSPP2 samples as input data; (KO) AD samples as input data. The axes represent the quantiles of the respective distributions. Blue: linear regression line. Red: diagonal line.
Figure 4
Figure 4
Dispersion vs. BCV plots of the: (A) NGSSPPG1; (B) NGSSPPG2; and (C) AD datasets and the datasets generated with these as input. Black dots: gene-wise dispersion estimates. Red curve: fitted mean-dispersion relationship. Blue circles: final dispersion estimates.
Figure 5
Figure 5
Mean-variance plots of the: (A) NGSSPPG1; (B) NGSSPPG2; and (C) AD datasets and the datasets generated with these as input.
Figure 6
Figure 6
Feature-feature correlation plots of the (A) NGSSPPG1; (B) NGSSPPG2; and (C) AD data and the datasets generated with these as input.
Figure 7
Figure 7
Volcano plots for: (A) NGSSPPG1; (B) NGSSPPG2; and (C) AD datasets and the datasets generated with these as input. Blue dots: transcripts with p-value > 0.05, denoting significant differential expression. Red dots: transcripts with p-value > 0.05 and log2(FoldChange) > 1 or with p-value > 0.05 and log2(FoldChange) < −1.
Figure 8
Figure 8
PCA plots for: (A) NGSSPPG1; (B) NGSSPPG2; and (C) AD datasets and the datasets generated with these as input.

Similar articles

Cited by

References

    1. Wang Z., Gerstein M., Snyder M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009;10:57–63. doi: 10.1038/nrg2484. - DOI - PMC - PubMed
    1. Ozsolak F., Milos P.M. RNA sequencing: Advances, challenges and opportunities. Nat. Rev. Genet. 2010;12:87–98. doi: 10.1038/nrg2934. - DOI - PMC - PubMed
    1. Thind A.S., Monga I., Thakur P.K., Kumari P., Dindhoria K., Krzak M., Ranson M., Ashford B. Demystifying emerging bulk RNA-Seq applications: The application and utility of bioinformatic methodology. Brief. Bioinform. 2021;22:bbab259. doi: 10.1093/bib/bbab259. - DOI - PubMed
    1. Li L., Clevers H. Coexistence of Quiescent and Active Adult Stem Cells in Mammals. Science. 2010;327:542–545. doi: 10.1126/science.1180794. - DOI - PMC - PubMed
    1. Huang S. Non-genetic heterogeneity of cells in development: More than just noise. Development. 2009;136:3853–3862. doi: 10.1242/dev.035139. - DOI - PMC - PubMed

Publication types