. 2021 Jan 22;22(1):27.

doi: 10.1186/s12859-021-03972-5.

Mining influential genes based on deep learning

Lingpeng Kong^#¹, Yuanyuan Chen^#², Fengjiao Xu², Mingmin Xu¹, Zutan Li¹, Jingya Fang¹, Liangyun Zhang³, Cong Pian⁴

Affiliations

¹ College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China.
² Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
³ Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China. zlyun@njau.edu.cn.
⁴ Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China. piancong@njau.edu.cn.

^# Contributed equally.

PMID: 33482718
PMCID: PMC7821411
DOI: 10.1186/s12859-021-03972-5

Mining influential genes based on deep learning

Lingpeng Kong et al. BMC Bioinformatics. 2021.

. 2021 Jan 22;22(1):27.

doi: 10.1186/s12859-021-03972-5.

Authors

Lingpeng Kong^#¹, Yuanyuan Chen^#², Fengjiao Xu², Mingmin Xu¹, Zutan Li¹, Jingya Fang¹, Liangyun Zhang³, Cong Pian⁴

Affiliations

¹ College of Agriculture, Nanjing Agricultural University, Jiangsu, 210095, Nanjing, China.
² Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China.
³ Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China. zlyun@njau.edu.cn.
⁴ Department of Mathematics, College of Science, Nanjing Agricultural University, Nanjing, 210095, China. piancong@njau.edu.cn.

^# Contributed equally.

PMID: 33482718
PMCID: PMC7821411
DOI: 10.1186/s12859-021-03972-5

Abstract

Background: Currently, large-scale gene expression profiling has been successfully applied to the discovery of functional connections among diseases, genetic perturbation, and drug action. To address the cost of an ever-expanding gene expression profile, a new, low-cost, high-throughput reduced representation expression profiling method called L1000 was proposed, with which one million profiles were produced. Although a set of ~ 1000 carefully chosen landmark genes that can capture ~ 80% of information from the whole genome has been identified for use in L1000, the robustness of using these landmark genes to infer target genes is not satisfactory. Therefore, more efficient computational methods are still needed to deep mine the influential genes in the genome.

Results: Here, we propose a computational framework based on deep learning to mine a subset of genes that can cover more genomic information. Specifically, an AutoEncoder framework is first constructed to learn the non-linear relationship between genes, and then DeepLIFT is applied to calculate gene importance scores. Using this data-driven approach, we have re-obtained a landmark gene set. The result shows that our landmark genes can predict target genes more accurately and robustly than that of L1000 based on two metrics [mean absolute error (MAE) and Pearson correlation coefficient (PCC)]. This reveals that the landmark genes detected by our method contain more genomic information.

Conclusions: We believe that our proposed framework is very suitable for the analysis of biological big data to reveal the mysteries of life. Furthermore, the landmark genes inferred from this study can be used for the explosive amplification of gene expression profiles to facilitate research into functional connections.

Keywords: AutoEncoder; Deep learning; DeepLIFT; Landmark genes.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
The workflow for mining influential genes based deep learning. a The architecture and parameter settings of AutoEncoder. b Application of DeepLIFT to compute the importance scores in the Encoder network and use of D-GEX as a baseline method to predict target genes for performance evaluation

**Fig. 2**
Performance evaluation of the AutoEncoder model in both gene (a) and sample dimensions (b). a The density plots of the predictive error (MAE) and the similarity (PCC) of all genes. b The circular diagram of clustering for three types of samples, including normal (Normal), lung adenocarcinoma (ADC) and lung squamous cell carcinoma (SCC)

**Fig. 3**
The density plot (a, c) and scatter plot (b, d) are used for comparison of the landmark genes inferred from our method (labelled as “D1000”) and that of L1000 (labelled as “L1000”) in terms of MAE (a, b) and PCC (c, d). In B and D, each dot represents a predicted target gene, and the red dot indicates that D1000 is better than L1000

**Fig. 4**
Cross-platform generalization analysis of the landmark genes inferred from our method

**Fig. 5**
Enriched GO molecular functions term by using the landmark genes as a set

See this image and copyright information in PMC

References

1. Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Hayden Gephart MG, Barres BA, Quake SR. A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci U S A. 2015;112(23):7285–7290. doi: 10.1073/pnas.1507125112. - DOI - PMC - PubMed
1. Calon A, Lonardo E, Berenguer-Llergo A, Espinet E, Hernando-Momblona X, Iglesias M, Sevillano M, Palomo-Ponce S, Tauriello DV, Byrom D, et al. Stromal gene expression defines poor-prognosis subtypes in colorectal cancer. Nat Genet. 2015;47(4):320–329. doi: 10.1038/ng.3225. - DOI - PubMed
1. Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN, et al. The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science (New York, NY) 2006;313(5795):1929–1935. doi: 10.1126/science.1132939. - DOI - PubMed
1. Ntranos V, Kamath GM, Zhang JM, Pachter L, Tse DN. Fast and accurate single-cell RNA-seq analysis by clustering of transcript-compatibility counts. Genome Biol. 2016;17(1):112. doi: 10.1186/s13059-016-0970-8. - DOI - PMC - PubMed
1. Heimberg G, Bhatnagar R, El-Samad H, Thomson M. Low Dimensionality in Gene Expression Data Enables the Accurate Extraction of Transcriptional Programs from Shallow Sequencing. Cell Syst. 2016;2(4):239–250. doi: 10.1016/j.cels.2016.04.001. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

11571173/Natural Science Foundation of Jilin Province (CN)

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Mining influential genes based on deep learning

Affiliations

Mining influential genes based on deep learning

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources