Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Oct;197(Pt A):110963.
doi: 10.1016/j.compbiomed.2025.110963. Epub 2025 Aug 25.

Automated labeling in medical data: A semi-supervised density-based approach for efficient diagnosis model development

Affiliations

Automated labeling in medical data: A semi-supervised density-based approach for efficient diagnosis model development

Lincy Meera Mathews et al. Comput Biol Med. 2025 Oct.

Abstract

Background: In the rapidly expanding landscape of medical data acquisition, the demand for automated diagnosis and analysis models is paramount to support healthcare practitioners. Providing a learning model for automatic diagnosis and analysis is a necessity to support them. To formulate a diagnosis model, labeling the entire data manually is necessary. Machine learning and human intervention tasks are demanding, expensive, and error-prone.

Method: To simplify the above specified effort, the presented work aimed to improve the performance of semi-supervised learning by automating the labeling process and thus decreasing the development cost. The same is demonstrated using benchmarked medical datasets, which have only a small subset of the labeled data samples. Effective labeling is incorporated through the identification of peak density samples and the construction of the density clusters from the unlabeled data. The distribution of samples within the clusters are further analyzed to identify the high and low confidence regions. The samples within the regions are appended to the labeled dataset and are mapped to the class of the peak sample. This smaller subset of the data is selected for manual labeling which can then be leveraged to propagate labels to the rest of the data, thus minimizing the project budget.

Results and conclusion: The results suggest that the proposed SSDCCR- Semi - Supervised Density Based Clustering with a confidence region outperforms existing algorithms across multiple health datasets with a significant increase of at least 2 percent in accuracy. The algorithm approach is scalable to larger datasets and memory efficient with less complexity.

Keywords: Confidence regions; Density based clustering; Peak samples; Semi; Supervised learning.

PubMed Disclaimer

Conflict of interest statement

Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

LinkOut - more resources