Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2021 Sep 7;16(9):e0256648.
doi: 10.1371/journal.pone.0256648. eCollection 2021.

Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression

Affiliations

Random forest-integrated analysis in AD and LATE brain transcriptome-wide data to identify disease-specific gene expression

Xinxing Wu et al. PLoS One. .

Abstract

Alzheimer's disease (AD) is a complex neurodegenerative disorder that affects thinking, memory, and behavior. Limbic-predominant age-related TDP-43 encephalopathy (LATE) is a recently identified common neurodegenerative disease that mimics the clinical symptoms of AD. The development of drugs to prevent or treat these neurodegenerative diseases has been slow, partly because the genes associated with these diseases are incompletely understood. A notable hindrance from data analysis perspective is that, usually, the clinical samples for patients and controls are highly imbalanced, thus rendering it challenging to apply most existing machine learning algorithms to directly analyze such datasets. Meeting this data analysis challenge is critical, as more specific disease-associated gene identification may enable new insights into underlying disease-driving mechanisms and help find biomarkers and, in turn, improve prospects for effective treatment strategies. In order to detect disease-associated genes based on imbalanced transcriptome-wide data, we proposed an integrated multiple random forests (IMRF) algorithm. IMRF is effective in differentiating putative genes associated with subjects having LATE and/or AD from controls based on transcriptome-wide data, thereby enabling effective discrimination between these samples. Various forms of validations, such as cross-domain verification of our method over other datasets, improved and competitive classification performance by using identified genes, effectiveness of testing data with a classifier that is completely independent from decision trees and random forests, and relationships with prior AD and LATE studies on the genes linked to neurodegeneration, all testify to the effectiveness of IMRF in identifying genes with altered expression in LATE and/or AD. We conclude that IMRF, as an effective feature selection algorithm for imbalanced data, is promising to facilitate the development of new gene biomarkers as well as targets for effective strategies of disease prevention and treatment.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Overall scheme of IMRF.
As an illustration, we show the use of IMRF on synthetic dataset with or without tiny black points for visualization.
Fig 2
Fig 2. The procedure for calculation of feature importances from multiple RFs.
Fig 3
Fig 3. Demographics for the stratified study population of RNA array expression.
(a) Distribution with respect to four classes, LATE+AD, pure LATE, pure AD, and control, in sex. The vertical axis represents the number of samples. (b) Age distribution with respect to the four classes. The vertical axis represents the age of samples. The horizontal axes for (a) and (b) denote different classes.
Fig 4
Fig 4. Supervised feature selection on MNIST and synthetic data.
(a) MNIST with the digits 1 and 9; (b) MNIST with the digits 3 and 8; (c) Four classes of noise background images with or without black points; (d) Four classes of noise background images with or without cross black points. The black point in the middle of the right side is a common black point for classes 1 and 2; (e) Using classes 1 and 2 in Table 3 in Section 3 of S1 File for classification and feature selection; (f) Use classes 1 and 2 in Table 3 in Section 3 of S1 File for classification and feature selection. In (a)-(f), the selected features are marked in red for visualization. Best viewed with color when zoomed in.
Fig 5
Fig 5. The 31 genes selected from 48,803 genes by IMRF.
Red vertical lines with gene names represent the IMRF-identified genes.
Fig 6
Fig 6. Comparison of F1 scores and accuracies by SVM on the total and IMRF-selected genes.
(a) Class-wise F1 scores and overall accuracy for four-class classification; (b) Accuracy for three scenarios of binary classification.
Fig 7
Fig 7. Comparison of F1 scores and accuracy for three scenarios of binary classification using the total genes and using the IMRF-selected genes.
(a) LATE+AD vs. pure LATE; (b) LATE+AD vs. pure AD; (c) pure LATE vs. pure AD.
Fig 8
Fig 8. SVM classification performance in F1 score using the original number of genes and using the selected genes by different RF-based algorithms.
Fig 9
Fig 9. The ratios of genes with p-value ⩾ 0.05 vs. p-value < 0.05 for 31 selected genes by different algorithms.
Fig 10
Fig 10. SVM classification performance in F1 score on the original number of genes and the selected genes by different feature selection algorithms.
Without (a) or with (b) using SMOTE as a preprocessing procedure to counteract the class imbalance.
Fig 11
Fig 11. Schematic representation of the p-values of the IMRF-selected genes for four classes and six pair-wise classes.

Similar articles

Cited by

References

    1. Nelson PT, Dickson DW, Trojanowski JQ, Jack CR, Boyle PA, Arfanakis K, et al.. Limbic-predominant age-related TDP-43 encephalopathy (LATE): Consensus working group report. Brain. 2019;142(6):1503–1527. doi: 10.1093/brain/awz099 - DOI - PMC - PubMed
    1. Besser LM, Teylan MA, Nelson PT. Limbic Predominant Age-Related TDP-43 Encephalopathy (LATE): Clinical and Neuropathological Associations. Journal of Neuropathology and Experimental Neurology. 2020;79(3):305–313. doi: 10.1093/jnen/nlz126 - DOI - PMC - PubMed
    1. Robinson JL, Porta S, Garrett FG, Zhang P, Xie SX, Suh E, et al.. Limbic-predominant age-related TDP-43 encephalopathy differs from frontotemporal lobar degeneration. 2020;143(9):2844–2857. doi: 10.1093/brain/awaa219 - DOI - PMC - PubMed
    1. Chao Chen AL, Breiman L. Using Random Forest to Learn Imbalanced Data. Berkeley, California, United States: University of California; 2004.
    1. Brownlee J. Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning. 1st ed. Machine Learning Mastery; 2020.

Publication types

MeSH terms