. 2023 Oct 5:14:1243874.

doi: 10.3389/fgene.2023.1243874. eCollection 2023.

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

Daniel Voskergian¹, Burcu Bakir-Gungor², Malik Yousef^{3

4}

Affiliations

¹ Computer Engineering Department, Faculty of Engineering, Al-Quds University, Jerusalem, Palestine.
² Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Türkiye.
³ Department of Information Systems, Zefat Academic College, Zefat, Israel.
⁴ Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel.

PMID: 37867598
PMCID: PMC10585361
DOI: 10.3389/fgene.2023.1243874

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

Daniel Voskergian et al. Front Genet. 2023.

. 2023 Oct 5:14:1243874.

doi: 10.3389/fgene.2023.1243874. eCollection 2023.

Authors

Daniel Voskergian¹, Burcu Bakir-Gungor², Malik Yousef^{3

4}

Affiliations

¹ Computer Engineering Department, Faculty of Engineering, Al-Quds University, Jerusalem, Palestine.
² Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Kayseri, Türkiye.
³ Department of Information Systems, Zefat Academic College, Zefat, Israel.
⁴ Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel.

PMID: 37867598
PMCID: PMC10585361
DOI: 10.3389/fgene.2023.1243874

Abstract

With the exponential growth in the daily publication of scientific articles, automatic classification and categorization can assist in assigning articles to a predefined category. Article titles are concise descriptions of the articles' content with valuable information that can be useful in document classification and categorization. However, shortness, data sparseness, limited word occurrences, and the inadequate contextual information of scientific document titles hinder the direct application of conventional text mining and machine learning algorithms on these short texts, making their classification a challenging task. This study firstly explores the performance of our earlier study, TextNetTopics on the short text. Secondly, here we propose an advanced version called TextNetTopics Pro, which is a novel short-text classification framework that utilizes a promising combination of lexical features organized in topics of words and topic distribution extracted by a topic model to alleviate the data-sparseness problem when classifying short texts. We evaluate our proposed approach using nine state-of-the-art short-text topic models on two publicly available datasets of scientific article titles as short-text documents. The first dataset is related to the Biomedical field, and the other one is related to Computer Science publications. Additionally, we comparatively evaluate the predictive performance of the models generated with and without using the abstracts. Finally, we demonstrate the robustness and effectiveness of the proposed approach in handling the imbalanced data, particularly in the classification of Drug-Induced Liver Injury articles as part of the CAMDA challenge. Taking advantage of the semantic information detected by topic models proved to be a reliable way to improve the overall performance of ML classifiers.

Keywords: feature selection; short text; sparse data; text classification; topic modeling; topic projection; topic selection.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**FIGURE 1**
Workflow of TextNetTopics Pro.

**FIGURE 2**
TextNetTopics performance over accumulated top-ranked topics for the CAMDA dataset using various short-text topic models in the T component. Symbols along the line represent the number of accumulated topics.

**FIGURE 3**
TextNetTopics performance over 140 features/terms for the CAMDA dataset using various short-text topic models in the T component.

**FIGURE 4**
TextNetTopics performance over accumulated top-ranked topics for the arXiv dataset using various short-text topic models in the T component. Symbols along the line represent the number of accumulated topics.

**FIGURE 5**
TextNetTopics performance over 140 features/terms for the arXiv dataset using various short-text topic models in the T component.

**FIGURE 6**
Classification performance of CAMDA dataset when utilizing topical words (TW) extracted by TextNetTopics, topic distribution features (TD) generated by Topic Models, and our proposed approach, combining words of top-ranked topics extracted by TextNetTopics with topic distribution features (TW + TD). The light-colored columns represent the highest achieved values.

**FIGURE 7**
Classification performance of our proposed approach over the CAMDA dataset, compared with taking all preprocessed terms with the semantic features.

**FIGURE 8**
Classification performance on the arXiv dataset when utilizing topical words (TW) extracted by TextNetTopics, topic distribution features (TD) generated by Topic Models, and our proposed approach, combining words of top-ranked topics extracted by TextNetTopics with topic distribution features (TW + TD). The light-colored columns represent the highest achieved values.

**FIGURE 9**
Classification performance on the arXiv dataset when utilizing our proposed approach *versus* taking all preprocessed terms with the semantic features.

**FIGURE 10**
Performance of TextNetTopics over accumulated top-ranked topics using various short-text topic models in the T component on regular-sized text, i.e., titles + abstract (CAMDA dataset). Symbols along the line represent the number of accumulated topics.

**FIGURE 11**
Performance of TextNetTopics Pro over accumulated topic distributions with top-ranked topics using various short-text topic models in the T component on regular-sized text, i.e., titles + abstract (CAMDA dataset). Symbols along the line represent the number of accumulated topics.

See this image and copyright information in PMC

Cited by

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach.
Voskergian D, Jayousi R, Yousef M. Voskergian D, et al. Sci Rep. 2024 Oct 9;14(1):23516. doi: 10.1038/s41598-024-74022-2. Sci Rep. 2024. PMID: 39384798 Free PMC article.
RCE-IFE: recursive cluster elimination with intra-cluster feature elimination.
Kuzudisli C, Bakir-Gungor B, Qaqish B, Yousef M. Kuzudisli C, et al. PeerJ Comput Sci. 2025 Feb 7;11:e2528. doi: 10.7717/peerj-cs.2528. eCollection 2025. PeerJ Comput Sci. 2025. PMID: 40062294 Free PMC article.

References

1. Al Qundus J., Paschke A., Gupta S., Alzouby A. M., Yousef M. (2020). Exploring the impact of short-text complexity and structure on its quality in social media. JEIM 33 (6), 1443–1466. 10.1108/JEIM-06-2019-0156 - DOI
1. Alsmadi I., Gan K. H. (2019). Review of short-text classification. IJWIS 15 (2), 155–182. 10.1108/IJWIS-12-2017-0083 - DOI
1. arXiv Paper Abstracts (2022). arXiv paper abstracts. Available at: https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts (Accessed January 27 2023).
1. Bagheri A., Sammani A., van der Heijden P. G. M., Asselbergs F. W., Oberski D. L. (2020). Etm: enrichment by topic modeling for automated clinical sentence classification to detect patients’ disease history. J. Intell. Inf. Syst. 55, 329–349. 10.1007/s10844-020-00605-w - DOI
1. Barde B. V., Bainwad A. M. (2017). “An overview of topic modeling methods and tools,” in Proceeding of the 2017 International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, June 2017 (IEEE; ), 745–750. 10.1109/ICCONS.2017.8250563 - DOI

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

Affiliations

TextNetTopics Pro, a topic model-based text classification for short text by integration of semantic and document-topic distribution information

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

LinkOut - more resources

Full Text Sources

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Related information

LinkOut - more resources

Full Text Sources