. 2013 Dec 21:14:910.

doi: 10.1186/1471-2164-14-910.

A new computational strategy for predicting essential genes

Jian Cheng, Wenwu Wu, Yinwen Zhang, Xiangchen Li, Xiaoqian Jiang, Gehong Wei¹, Shiheng Tao

Affiliations

Affiliation

¹ College of Life Science, State Key Laboratory of Crop Stress Biology for Arid Areas, Northwest A&F University, Yangling, Shaanxi, China. weigehong@nwsuaf.edu.cn.

PMID: 24359534
PMCID: PMC3880044
DOI: 10.1186/1471-2164-14-910

A new computational strategy for predicting essential genes

Jian Cheng et al. BMC Genomics. 2013.

. 2013 Dec 21:14:910.

doi: 10.1186/1471-2164-14-910.

Authors

Jian Cheng, Wenwu Wu, Yinwen Zhang, Xiangchen Li, Xiaoqian Jiang, Gehong Wei¹, Shiheng Tao

Affiliation

¹ College of Life Science, State Key Laboratory of Crop Stress Biology for Arid Areas, Northwest A&F University, Yangling, Shaanxi, China. weigehong@nwsuaf.edu.cn.

PMID: 24359534
PMCID: PMC3880044
DOI: 10.1186/1471-2164-14-910

Abstract

Background: Determination of the minimum gene set for cellular life is one of the central goals in biology. Genome-wide essential gene identification has progressed rapidly in certain bacterial species; however, it remains difficult to achieve in most eukaryotic species. Several computational models have recently been developed to integrate gene features and used as alternatives to transfer gene essentiality annotations between organisms.

Results: We first collected features that were widely used by previous predictive models and assessed the relationships between gene features and gene essentiality using a stepwise regression model. We found two issues that could significantly reduce model accuracy: (i) the effect of multicollinearity among gene features and (ii) the diverse and even contrasting correlations between gene features and gene essentiality existing within and among different species. To address these issues, we developed a novel model called feature-based weighted Naïve Bayes model (FWM), which is based on Naïve Bayes classifiers, logistic regression, and genetic algorithm. The proposed model assesses features and filters out the effects of multicollinearity and diversity. The performance of FWM was compared with other popular models, such as support vector machine, Naïve Bayes model, and logistic regression model, by applying FWM to reciprocally predict essential genes among and within 21 species. Our results showed that FWM significantly improves the accuracy and robustness of essential gene prediction.

Conclusions: FWM can remarkably improve the accuracy of essential gene prediction and may be used as an alternative method for other classification work. This method can contribute substantially to the knowledge of the minimum gene sets required for living organisms and the discovery of new drug targets.

PubMed Disclaimer

Figures

**Figure 1**
**Flow chart for constructing FWM and assessing its performance in predicting essential genes between and within species. (A)** FWM construction. During essential gene prediction from species 1 to species 2, the goal of FWM is to calculate the score vector S_i and the weighted coefficient vector W. To calculate S_i, we mainly employ kernel density estimation (KDE) combined with Naïve Bayes estimation (see Methods). When calculating W, we first collect prior information (e.g., known essential genes in species 2 or from a closely related species); this information is used as training-prediction dataset to assess W in combination with the training set. Finally, we calculate the posterior probability of the genes in species 2 belonging to essential genes based on the weighted Naïve Bayes (WNB) method. **(B)** FWM performance for predicting essential genes between and within species. To assess the performance of FWM within species (e.g., *SCE*–*SCE* or *SPO*–*SPO*), 20%, 50%, and 80% of the whole genes were randomly selected as the training set, respectively, and the rest as testing set. We used the training set itself as a training-prediction set to calculate weights; the AUC score for the testing set was then calculated through the WNB method. Finally, the process was replicated 1,000 times to obtain the corresponding AUC distributions. To predict essential genes between species (e.g., *SCE*–*SPO* or *SPO*–*SCE*), all of the genes in *SCE* (or *SPO*) were selected as the training set, 20% (or 50%, 80%) of the *SPO* (or *SCE*) genes were randomly selected as the training-prediction set, and the rest of the genes were designated as the testing set. Similar to the comparison within species, AUC distributions were obtained by replicating the process 1,000 times.

**Figure 2**
**Essential gene prediction within and between species by NBM and FWM. A**, C, and E show the AUC distributions within species (*SCE*–*SCE*), which are generated by randomly selecting 20% **(A)**, 50% **(C)**, and 80% **(E)** of the *SCE* genes as training data. B, D, and F show the AUC distributions between species (*SCE*–*SPO*), which are generated by randomly selecting 20% **(B)**, 50% **(D)**, and 80% **(F)** of the *SPO* genes as a training-prediction set to estimate the weight vector W. Blue and red lines represent the distributions obtained by NBM and FWM, respectively.

**Figure 3**
**Comparison of FWM with LRM, NBM, and SVM.** Four AUC matrices among the 21 species are produced using the four methods. AUC scores (m_ij) in the same position of the four matrices are then sorted and replaced with markers (first with the maximum AUC score, followed by the second and third, and, finally, fourth with the minimum score) for the four methods. By calculating the frequency of the ranking list (i.e., first, second, third, and forth) in the four matrices, performance distributions for the four methods were generated. The AUC score with ranking the first and second can be classified as high-quality performance, while third and fourth can be classified as low-quality performance. Significant differences are tested by Fisher’s exact test, and the results are shown in the lower triangular table.

**Figure 4**
**Comparison of FWM and NBM in stepwise discriminant.** Examples of *SCE–ECO***(A)**, *ECO–SSA***(B)**, *SSA–SPO***(C)**, and *SPO–SCE***(D)** are plotted. The labels on the X-axis from the left to right indicate the order of the features selected into the model according to their prediction effects. The values above the X-axis represent the singular prediction effect of the corresponding feature. FWM indicates feature-based weighted Naïve Bayes model and NBM indicates Naïve Bayes model.

**Figure 5**
**Comparison of ROC curves between FWM and NBM.** Examples of *SCE–ECO***(A)**, *ECO–SSA***(B)**, *SSA–SPO***(C)**, and *SPO–SCE***(D)** are plotted. TPR (sensitivity) is plotted on the Y-axis and FPR (1-Specificity) is plotted on the X-axis with threshold values from 0 to 1. Blue lines represent ROC curves generated by NBM and red lines represent ROC curves generated by FWM. FWM indicates feature-based weighted Naïve Bayes model, NBM indicates Naïve Bayes model, and AUC indicates area under curve.

See this image and copyright information in PMC

Cited by

Network-based features enable prediction of essential genes across diverse organisms.
Azhagesan K, Ravindran B, Raman K. Azhagesan K, et al. PLoS One. 2018 Dec 13;13(12):e0208722. doi: 10.1371/journal.pone.0208722. eCollection 2018. PLoS One. 2018. PMID: 30543651 Free PMC article.
Machine learning methods for predicting essential metabolic genes from Plasmodium falciparum genome-scale metabolic network.
Isewon I, Binaansim S, Adegoke F, Emmanuel J, Oyelade J. Isewon I, et al. PLoS One. 2024 Dec 23;19(12):e0315530. doi: 10.1371/journal.pone.0315530. eCollection 2024. PLoS One. 2024. PMID: 39715240 Free PMC article.
A Comprehensive Overview of Online Resources to Identify and Predict Bacterial Essential Genes.
Peng C, Lin Y, Luo H, Gao F. Peng C, et al. Front Microbiol. 2017 Nov 27;8:2331. doi: 10.3389/fmicb.2017.02331. eCollection 2017. Front Microbiol. 2017. PMID: 29230204 Free PMC article. Review.
Predicting Essential Genes and Proteins Based on Machine Learning and Network Topological Features: A Comprehensive Review.
Zhang X, Acencio ML, Lemke N. Zhang X, et al. Front Physiol. 2016 Mar 8;7:75. doi: 10.3389/fphys.2016.00075. eCollection 2016. Front Physiol. 2016. PMID: 27014079 Free PMC article. Review.
Bacterial genome reductions: Tools, applications, and challenges.
LeBlanc N, Charles TC. LeBlanc N, et al. Front Genome Ed. 2022 Aug 31;4:957289. doi: 10.3389/fgeed.2022.957289. eCollection 2022. Front Genome Ed. 2022. PMID: 36120530 Free PMC article. Review.

See all "Cited by" articles

References

1. Itaya M. An estimation of minimal genome size required for life. FEBS letters. 1995;14(3):257–260. doi: 10.1016/0014-5793(95)00233-Y. - DOI - PubMed
1. Kobayashi K, Ehrlich SD, Albertini A, Amati G, Andersen K, Arnaud M, Asai K, Ashikaga S, Aymerich S, Bessieres P. Essential Bacillus subtilis genes. Proc Natl Acad Sci. 2003;14(8):4678–4683. doi: 10.1073/pnas.0730515100. - DOI - PMC - PubMed
1. Papp B, Pal C, Hurst LD. Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature. 2004;14(6992):661–664. doi: 10.1038/nature02636. - DOI - PubMed
1. Yu H, Greenbaum D, Lu HX, Zhu X, Gerstein M. Genomic analysis of essentiality within protein networks. RNA. 2004;14:817–846. - PubMed
1. Gerdes S, Edwards R, Kubal M, Fonstein M, Stevens R, Osterman A. Essential genes on metabolic maps. Curr Opin Biotechnol. 2006;14(5):448. doi: 10.1016/j.copbio.2006.08.006. - DOI - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A new computational strategy for predicting essential genes

Affiliation

A new computational strategy for predicting essential genes

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases