Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Jan 18;11 Suppl 1(Suppl 1):S58.
doi: 10.1186/1471-2105-11-S1-S58.

Active machine learning for transmembrane helix prediction

Affiliations

Active machine learning for transmembrane helix prediction

Hatice U Osmanbeyoglu et al. BMC Bioinformatics. .

Abstract

Background: About 30% of genes code for membrane proteins, which are involved in a wide variety of crucial biological functions. Despite their importance, experimentally determined structures correspond to only about 1.7% of protein structures deposited in the Protein Data Bank due to the difficulty in crystallizing membrane proteins. Algorithms that can identify proteins whose high-resolution structure can aid in predicting the structure of many previously unresolved proteins are therefore of potentially high value. Active machine learning is a supervised machine learning approach which is suitable for this domain where there are a large number of sequences but only very few have known corresponding structures. In essence, active learning seeks to identify proteins whose structure, if revealed experimentally, is maximally predictive of others.

Results: An active learning approach is presented for selection of a minimal set of proteins whose structures can aid in the determination of transmembrane helices for the remaining proteins. TMpro, an algorithm for high accuracy TM helix prediction we previously developed, is coupled with active learning. We show that with a well-designed selection procedure, high accuracy can be achieved with only few proteins. TMpro, trained with a single protein achieved an F-score of 94% on benchmark evaluation and 91% on MPtopo dataset, which correspond to the state-of-the-art accuracies on TM helix prediction that are achieved usually by training with over 100 training proteins.

Conclusion: Active learning is suitable for bioinformatics applications, where manually characterized data are not a comprehensive representation of all possible data, and in fact can be a very sparse subset thereof. It aids in selection of data instances which when characterized experimentally can improve the accuracy of computational characterization of remaining raw data. The results presented here also demonstrate that the feature extraction method of TMpro is well designed, achieving a very good separation between TM and non TM segments.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The coverage of SOM network over the data. Figure represents the coverage of the SOM network. 1000 data points are just shown for more clear representation.
Figure 2
Figure 2
Segment level TM prediction F-score results for MPtopo. (A) Random, (B) Node-coverage, (C) Confusion-rated, (D) Node-coverage and confusion-rated. It can be seen that TMpro achieves high segment accuracy (F-score) even if the classifier is trained with just one protein that is found by active learning algorithms. Node-Coverage shows best performance.

References

    1. Wallin E, von Heijne G. Genome-wide analysis of integral membrane proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci. 1998;7(4):1029–1038. - PMC - PubMed
    1. Tusnady GE, Dosztanyi Z, Simon I. PDB_TM: selection and membrane localization of transmembrane proteins in the protein data bank. Nucleic acids research. 2005. pp. D275–278. - PMC - PubMed
    1. White SH. Biophysical dissection of membrane proteins. Nature. 2009;459(7245):344–346. doi: 10.1038/nature08142. - DOI - PubMed
    1. White MA, Clark KM, Grayhack EJ, Dumont ME. Characteristics affecting expression and solubilization of yeast membrane proteins. J Mol Biol. 2007;365(3):621–636. doi: 10.1016/j.jmb.2006.10.004. - DOI - PMC - PubMed
    1. Tseitin VM, Nikiforovich GV. Isolated transmembrane helices arranged across a membrane: computational studies. Protein engineering. 1999;12(4):305–311. doi: 10.1093/protein/12.4.305. - DOI - PubMed

Publication types

Substances

LinkOut - more resources