Cross-platform normalization of microarray and RNA-seq data for machine learning applications
- PMID: 26844019
- PMCID: PMC4736986
- DOI: 10.7717/peerj.1621
Cross-platform normalization of microarray and RNA-seq data for machine learning applications
Abstract
Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data. We developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms. We evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Our evaluation included both supervised and unsupervised machine learning approaches. We found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances. We also provide a TDM package for the R programming language.
Keywords: Cross-platform normalization; Distribution; Gene expression; Machine learning; Microarray; Nonparanormal transformation; Normalization; Quantile normalization; RNA-sequencing; Training.
Conflict of interest statement
The authors declare that they have no competing interests.
Figures






Similar articles
-
Cross-platform normalization enables machine learning model training on microarray and RNA-seq data simultaneously.Commun Biol. 2023 Feb 25;6(1):222. doi: 10.1038/s42003-023-04588-6. Commun Biol. 2023. PMID: 36841852 Free PMC article.
-
Feature-specific quantile normalization and feature-specific mean-variance normalization deliver robust bi-directional classification and feature selection performance between microarray and RNAseq data.BMC Bioinformatics. 2024 Mar 29;25(1):136. doi: 10.1186/s12859-024-05759-w. BMC Bioinformatics. 2024. PMID: 38549046 Free PMC article.
-
Feature specific quantile normalization enables cross-platform classification of molecular subtypes using gene expression data.Bioinformatics. 2018 Jun 1;34(11):1868-1874. doi: 10.1093/bioinformatics/bty026. Bioinformatics. 2018. PMID: 29360996 Free PMC article.
-
A Systematic Evaluation of Supervised Machine Learning Algorithms for Cell Phenotype Classification Using Single-Cell RNA Sequencing Data.Front Genet. 2022 Feb 23;13:836798. doi: 10.3389/fgene.2022.836798. eCollection 2022. Front Genet. 2022. PMID: 35281805 Free PMC article.
-
Comprehensive Assessments of RNA-seq by the SEQC Consortium: FDA-Led Efforts Advance Precision Medicine.Pharmaceutics. 2016 Mar 15;8(1):8. doi: 10.3390/pharmaceutics8010008. Pharmaceutics. 2016. PMID: 26999190 Free PMC article. Review.
Cited by
-
Increased comparability between RNA-Seq and microarray data by utilization of gene sets.PLoS Comput Biol. 2020 Sep 30;16(9):e1008295. doi: 10.1371/journal.pcbi.1008295. eCollection 2020 Sep. PLoS Comput Biol. 2020. PMID: 32997685 Free PMC article.
-
Identification of transcriptional subtypes in lung adenocarcinoma and squamous cell carcinoma through integrative analysis of microarray and RNA sequencing data.Sci Rep. 2021 Apr 22;11(1):8709. doi: 10.1038/s41598-021-88209-4. Sci Rep. 2021. PMID: 33888829 Free PMC article.
-
The Role of the Extracellular Matrix and Tumor-Infiltrating Immune Cells in the Prognostication of High-Grade Serous Ovarian Cancer.Cancers (Basel). 2022 Jan 14;14(2):404. doi: 10.3390/cancers14020404. Cancers (Basel). 2022. PMID: 35053566 Free PMC article.
-
Gene network profiling in muscle-invasive bladder cancer: A systematic review and meta-analysis.Urol Oncol. 2022 May;40(5):197.e11-197.e23. doi: 10.1016/j.urolonc.2021.11.003. Epub 2022 Jan 15. Urol Oncol. 2022. PMID: 35039218 Free PMC article.
-
Normalization and selecting non-differentially expressed genes improve machine learning modelling of cross-platform transcriptomic data.ArXiv [Preprint]. 2025 Jan 24:arXiv:2501.14248v1. ArXiv. 2025. Update in: Trans Artif Intell. 2025;1(1):5. doi: 10.53941/tai.2025.100005. PMID: 39975431 Free PMC article. Updated. Preprint.
References
-
- Atak ZK, Gianfelici V, Hulselmans G, De Keersmaecker K, Devasia AG, Geerdens E, Mentens N, Chiaretti S, Durinck K, Uyttebroeck A, Vandenberghe P, Wlodarska I, Cloos J, Foà R, Speleman F, Cools J, Aerts S. Comprehensive analysis of transcriptome variation uncovers known and novel driver events in t-cell acute lymphoblastic leukemia. PLoS Genetics. 2013;9(12):e1621. doi: 10.1371/journal.pgen.1003997. - DOI - PMC - PubMed
-
- Bolstad BM. Preprocesscore: A Collection of Pre-Processing Functions. (R package version 1.30.0) 2015
-
- Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, Speed D, Lynch AG, Samarajiwa S, Yuan Y, Gräf S, Ha G, Haffari G, Bashashati A, Russell R, McKinney S, METABRIC Group. Langerød A, Green A, Provenzano E, Wishart G, Pinder S, Watson P, Markowetz F, Murphy L, Ellis I, Purushotham A, Børresen-Dale AL-L, Brenton JD, Tavaré S, Caldas C, Aparicio S. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346–352. doi: 10.1038/nature10983. - DOI - PMC - PubMed
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources