Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan 21;13(1):1225.
doi: 10.1038/s41598-023-28536-w.

Artificial neural network identified the significant genes to distinguish Idiopathic pulmonary fibrosis

Affiliations

Artificial neural network identified the significant genes to distinguish Idiopathic pulmonary fibrosis

Zhongzheng Li et al. Sci Rep. .

Abstract

Idiopathic pulmonary fibrosis (IPF) is a progressive interstitial lung disease that causes irreversible damage to lung tissue characterized by excessive deposition of extracellular matrix (ECM) and remodeling of lung parenchyma. The current diagnosis of IPF is complex and usually completed by a multidisciplinary team including clinicians, radiologists and pathologists they work together and make decision for an effective treatment, it is imperative to introduce novel practical methods for IPF diagnosis. This study provided a new diagnostic model of idiopathic pulmonary fibrosis based on machine learning. Six genes including CDH3, DIO2, ADAMTS14, HS6ST2, IL13RA2, and IGFL2 were identified based on the differentially expressed genes in IPF patients compare to healthy subjects through a random forest classifier with the existing gene expression databases. An artificial neural network model was constructed for IPF diagnosis based these genes, and this model was validated by the distinctive public datasets with a satisfactory diagnostic accuracy. These six genes identified were significant correlated with lung function, and among them, CDH3 and DIO2 were further determined to be significantly associated with the survival. Putting together, artificial neural network model identified the significant genes to distinguish idiopathic pulmonary fibrosis from healthy people and it is potential for molecular diagnosis of IPF.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
Flow chat.
Figure 2
Figure 2
Differential gene expression analysis in IPF. (A) Volcano plot of differential expression analysis results. The abscissa is log2Fold Change and the ordinate is –log10 (adj.P value). The upper right part has a adj.P value less than 0.05 and a fold change greater than 2, indicating significant DEGs with higher expression levels. The upper left part has a adj.P value less than 0.05 and a fold change less than − 2, indicating significant DEGs with reduced expression. The gray dots represent the remaining stable genes. (B) Heatmap of DEGs. The colors in the graph from red to blue indicate high to low expression. On the upper part of the heatmap, the red band indicates the disease samples and the blue band indicates the normal samples. C-D. Matescape toll function enrichment results bar graph. The x-axis represents −log10(adj P) values and the y-axis represents enriched pathways. Pathways with Log10(P value) of > 2.5 are marked and shown in the figure. (C) shows a bar graph of the enriched pathways that were significantly up-regulated in IPF patients compared to healthy controls. (D) shows a bar graph of the enrichment pathway results that were significantly downregulated in IPF patients compared to healthy controls.
Figure 3
Figure 3
Random Forest screening for DEGs. (A) The effect of the number of decision trees on the error rate. The x-axis represents the number of decision trees, and the y-axis represents the error rate. When the number of decision trees is about 500, the error rate is relatively stable. (B) Results of the Gini coefficient method in the random forest classifier. The x-axis represents the importance index, and the y-axis represents the genetic variables. Rank and display the top 20 genes of importance coefficient. (C) The unsupervised clustering heatmap shows the hierarchical clustering results generated from six significant genes generated by a random forest in GSE47460. On the upper part of the heatmap, the red band in the status module represents normal samples, and the blue band represents disease samples; the color in the age module gradually changes from white to green, representing the increasing age of the sample; the light green band in the gender module represents male samples, the purple strip represents female samples; the green strip in the gold stage module means AT Risk, the green strip means Moderate COPD; the purple strip means Severe COPD; the rose-red strip means unknown; the yellow strip in the smoking history module means the current still Smoking; green strips have ever smoked; blue strips have never smoked; orange strips are unknown.
Figure 4
Figure 4
Construction of the artificial neural network model. (A) Verification of the ROC curve results by the five-time cross-validation model in GSE32537. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (B) Results of neural network visualization.
Figure 5
Figure 5
Model accuracy verification. (A) Verification of the ROC curve results in GSE47460. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (B) GSE47460 confusion matrix result. The x-axis represents the predicted results, and the y-axis represents the actual results. (C) Verification of the ROC curve results in GSE110147. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (D) GSE110147 confusion matrix result. The x-axis represents the predicted results, and the y-axis represents the actual results. (E) Verification of the ROC curve results in GSE53845. The points marked on the ROC curve are the optimal threshold points, and the values in parentheses represent sensitivity and specificity. The AUC value is the area under the ROC curve. (F) GSE53845 confusion matrix result. The x-axis represents the predicted results, and the y-axis represents the actual results.
Figure 6
Figure 6
Survival predictive analysis. (A) CDH3 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (B) ADAMTS14 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (C) IL13RA2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (D) HS6ST2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (E) DIO2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. (F) IGFL2 as a prognostic factor to evaluate the prognosis and survival status of IPF patients. The x-axis represents time and the y-axis represents survival probability. The yellow line represents the high gene expression group, and the blue line represents the gene low expression group. Each point on the curve represents the patient's survival rate at that time point.
Figure 7
Figure 7
Six signature genes were significantly associated with clinical features. (A) The heatmap illustrates the computationally derived meta lung function variable combing multiple lung function parameters. In the upper part of the heatmap, the color in the meta lung function module gradually changes from white to green, representing an increase in the sample meta lung function; the blue bars in the gender module represent male samples, and the red bars represent female samples; the age module The color gradually changes from white to purple, representing the increasing age of the sample. On the right side of the heat map, there are clinical indicators DLCO, FVC (pred), FVC (post), FEV1(pred), and FEV1(post). Pred, predict; Post, post-bronchodilator. (B) The scatter plots show the positive correlation of the indicated genes with meta lung function. The x-axis represents gene expression, and the y-axis represents meta lung function.

Similar articles

Cited by

References

    1. Xia Y, Lei C, Yang D, Luo H. Construction and validation of a bronchoalveolar lavage cell-associated gene signature for prognosis prediction in idiopathic pulmonary fibrosis. Int. Immunopharmacol. 2021;92:107369. doi: 10.1016/j.intimp.2021.107369. - DOI - PubMed
    1. Hogan BL, Barkauskas CE, Chapman HA, Epstein JA, Jain R, Hsia CC, Niklason L, Calle E, Le A, Randell SH, et al. Repair and regeneration of the respiratory system: Complexity, plasticity, and mechanisms of lung stem cell function. Cell Stem Cell. 2014;15:123–138. doi: 10.1016/j.stem.2014.07.012. - DOI - PMC - PubMed
    1. Rosmark O, Åhrman E, Müller C, Elowsson Rendin L, Eriksson L, Malmström A, Hallgren O, Larsson-Callerfelt AK, Westergren-Thorsson G, Malmström J. Quantifying extracellular matrix turnover in human lung scaffold cultures. Sci. Rep. 2018;8:5409. doi: 10.1038/s41598-018-23702-x. - DOI - PMC - PubMed
    1. Zhou Y, Horowitz JC, Naba A, Ambalavanan N, Atabai K, Balestrini J, Bitterman PB, Corley RA, Ding BS, Engler AJ, et al. Extracellular matrix in lung development, homeostasis and disease. Matrix Biol. 2018;73:77–104. doi: 10.1016/j.matbio.2018.03.005. - DOI - PMC - PubMed
    1. Glass DS, Grossfeld D, Renna HA, Agarwala P, Spiegler P, Kasselman LJ, Glass AD, DeLeon J, Reiss AB. Idiopathic pulmonary fibrosis: Molecular mechanisms and potential treatment approaches. Respir. Investig. 2020;58:320–335. doi: 10.1016/j.resinv.2020.04.002. - DOI - PubMed

Publication types