Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 29;8(1):13009.
doi: 10.1038/s41598-018-31395-5.

A novel logistic regression model combining semi-supervised learning and active learning for disease classification

Affiliations

A novel logistic regression model combining semi-supervised learning and active learning for disease classification

Hua Chai et al. Sci Rep. .

Abstract

Traditional supervised learning classifier needs a lot of labeled samples to achieve good performance, however in many biological datasets there is only a small size of labeled samples and the remaining samples are unlabeled. Labeling these unlabeled samples manually is difficult or expensive. Technologies such as active learning and semi-supervised learning have been proposed to utilize the unlabeled samples for improving the model performance. However in active learning the model suffers from being short-sighted or biased and some manual workload is still needed. The semi-supervised learning methods are easy to be affected by the noisy samples. In this paper we propose a novel logistic regression model based on complementarity of active learning and semi-supervised learning, for utilizing the unlabeled samples with least cost to improve the disease classification accuracy. In addition to that, an update pseudo-labeled samples mechanism is designed to reduce the false pseudo-labeled samples. The experiment results show that this new model can achieve better performances compared the widely used semi-supervised learning and active learning methods in disease classification and gene selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Figure 1
Figure 1
The work flow of proposed logistic regression model combining SSL and AL.
Algorithm 1
Algorithm 1
The algorithm of the semi-supervised logistic regression model.
Figure 2
Figure 2
The classification accuracy of different methods in simulation experiments.
Figure 3
Figure 3
The ROC curves of different methods in simulation experiments.
Figure 4
Figure 4
ROC curves obtained by different methods in real datasets (a) DLBCL (b) Prostate (c) GSE21050 (d) GSE32603.
Figure 5
Figure 5
The number of genes selected by different methods in real datasets.

Similar articles

Cited by

References

    1. King G, Zeng L. Logistic regression in rare events data. Political analysis. 2001;9:137–163. doi: 10.1093/oxfordjournals.pan.a004868. - DOI
    1. Gunn SR. Support vector machines for classification and regression. ISIS technical report. 1998;14:85–86.
    1. Zhu X. Semi-supervised learning literature survey. Computer Science. 2–4 (2006).
    1. Fu, Y., Zhu, X. & Li, B. A survey on instance selection for active learning. Knowledge and information systems. 1–35 (2013).
    1. Lewis, D. D. & Catlett, J. Heterogeneous uncertainty sampling for supervised learning. Proceedings of the eleventh international conference on machine learning. 148–156 (1994).

Publication types