Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Dec 21:14:910.
doi: 10.1186/1471-2164-14-910.

A new computational strategy for predicting essential genes

Affiliations

A new computational strategy for predicting essential genes

Jian Cheng et al. BMC Genomics. .

Abstract

Background: Determination of the minimum gene set for cellular life is one of the central goals in biology. Genome-wide essential gene identification has progressed rapidly in certain bacterial species; however, it remains difficult to achieve in most eukaryotic species. Several computational models have recently been developed to integrate gene features and used as alternatives to transfer gene essentiality annotations between organisms.

Results: We first collected features that were widely used by previous predictive models and assessed the relationships between gene features and gene essentiality using a stepwise regression model. We found two issues that could significantly reduce model accuracy: (i) the effect of multicollinearity among gene features and (ii) the diverse and even contrasting correlations between gene features and gene essentiality existing within and among different species. To address these issues, we developed a novel model called feature-based weighted Naïve Bayes model (FWM), which is based on Naïve Bayes classifiers, logistic regression, and genetic algorithm. The proposed model assesses features and filters out the effects of multicollinearity and diversity. The performance of FWM was compared with other popular models, such as support vector machine, Naïve Bayes model, and logistic regression model, by applying FWM to reciprocally predict essential genes among and within 21 species. Our results showed that FWM significantly improves the accuracy and robustness of essential gene prediction.

Conclusions: FWM can remarkably improve the accuracy of essential gene prediction and may be used as an alternative method for other classification work. This method can contribute substantially to the knowledge of the minimum gene sets required for living organisms and the discovery of new drug targets.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Flow chart for constructing FWM and assessing its performance in predicting essential genes between and within species. (A) FWM construction. During essential gene prediction from species 1 to species 2, the goal of FWM is to calculate the score vector Si and the weighted coefficient vector W. To calculate Si, we mainly employ kernel density estimation (KDE) combined with Naïve Bayes estimation (see Methods). When calculating W, we first collect prior information (e.g., known essential genes in species 2 or from a closely related species); this information is used as training-prediction dataset to assess W in combination with the training set. Finally, we calculate the posterior probability of the genes in species 2 belonging to essential genes based on the weighted Naïve Bayes (WNB) method. (B) FWM performance for predicting essential genes between and within species. To assess the performance of FWM within species (e.g., SCESCE or SPOSPO), 20%, 50%, and 80% of the whole genes were randomly selected as the training set, respectively, and the rest as testing set. We used the training set itself as a training-prediction set to calculate weights; the AUC score for the testing set was then calculated through the WNB method. Finally, the process was replicated 1,000 times to obtain the corresponding AUC distributions. To predict essential genes between species (e.g., SCESPO or SPOSCE), all of the genes in SCE (or SPO) were selected as the training set, 20% (or 50%, 80%) of the SPO (or SCE) genes were randomly selected as the training-prediction set, and the rest of the genes were designated as the testing set. Similar to the comparison within species, AUC distributions were obtained by replicating the process 1,000 times.
Figure 2
Figure 2
Essential gene prediction within and between species by NBM and FWM. A, C, and E show the AUC distributions within species (SCESCE), which are generated by randomly selecting 20% (A), 50% (C), and 80% (E) of the SCE genes as training data. B, D, and F show the AUC distributions between species (SCESPO), which are generated by randomly selecting 20% (B), 50% (D), and 80% (F) of the SPO genes as a training-prediction set to estimate the weight vector W. Blue and red lines represent the distributions obtained by NBM and FWM, respectively.
Figure 3
Figure 3
Comparison of FWM with LRM, NBM, and SVM. Four AUC matrices among the 21 species are produced using the four methods. AUC scores (mij) in the same position of the four matrices are then sorted and replaced with markers (first with the maximum AUC score, followed by the second and third, and, finally, fourth with the minimum score) for the four methods. By calculating the frequency of the ranking list (i.e., first, second, third, and forth) in the four matrices, performance distributions for the four methods were generated. The AUC score with ranking the first and second can be classified as high-quality performance, while third and fourth can be classified as low-quality performance. Significant differences are tested by Fisher’s exact test, and the results are shown in the lower triangular table.
Figure 4
Figure 4
Comparison of FWM and NBM in stepwise discriminant. Examples of SCE–ECO(A), ECO–SSA(B), SSA–SPO(C), and SPO–SCE(D) are plotted. The labels on the X-axis from the left to right indicate the order of the features selected into the model according to their prediction effects. The values above the X-axis represent the singular prediction effect of the corresponding feature. FWM indicates feature-based weighted Naïve Bayes model and NBM indicates Naïve Bayes model.
Figure 5
Figure 5
Comparison of ROC curves between FWM and NBM. Examples of SCE–ECO(A), ECO–SSA(B), SSA–SPO(C), and SPO–SCE(D) are plotted. TPR (sensitivity) is plotted on the Y-axis and FPR (1-Specificity) is plotted on the X-axis with threshold values from 0 to 1. Blue lines represent ROC curves generated by NBM and red lines represent ROC curves generated by FWM. FWM indicates feature-based weighted Naïve Bayes model, NBM indicates Naïve Bayes model, and AUC indicates area under curve.

Similar articles

Cited by

References

    1. Itaya M. An estimation of minimal genome size required for life. FEBS letters. 1995;14(3):257–260. doi: 10.1016/0014-5793(95)00233-Y. - DOI - PubMed
    1. Kobayashi K, Ehrlich SD, Albertini A, Amati G, Andersen K, Arnaud M, Asai K, Ashikaga S, Aymerich S, Bessieres P. Essential Bacillus subtilis genes. Proc Natl Acad Sci. 2003;14(8):4678–4683. doi: 10.1073/pnas.0730515100. - DOI - PMC - PubMed
    1. Papp B, Pal C, Hurst LD. Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast. Nature. 2004;14(6992):661–664. doi: 10.1038/nature02636. - DOI - PubMed
    1. Yu H, Greenbaum D, Lu HX, Zhu X, Gerstein M. Genomic analysis of essentiality within protein networks. RNA. 2004;14:817–846. - PubMed
    1. Gerdes S, Edwards R, Kubal M, Fonstein M, Stevens R, Osterman A. Essential genes on metabolic maps. Curr Opin Biotechnol. 2006;14(5):448. doi: 10.1016/j.copbio.2006.08.006. - DOI - PubMed

Publication types