Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jul 14:3:42.
doi: 10.3389/frai.2020.00042. eCollection 2020.

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis

Affiliations

Using Topic Modeling Methods for Short-Text Data: A Comparative Analysis

Rania Albalawi et al. Front Artif Intell. .

Abstract

With the growth of online social network platforms and applications, large amounts of textual user-generated content are created daily in the form of comments, reviews, and short-text messages. As a result, users often find it challenging to discover useful information or more on the topic being discussed from such content. Machine learning and natural language processing algorithms are used to analyze the massive amount of textual social media data available online, including topic modeling techniques that have gained popularity in recent years. This paper investigates the topic modeling subject and its common application areas, methods, and tools. Also, we examine and compare five frequently used topic modeling methods, as applied to short textual social data, to show their benefits practically in detecting important topics. These methods are latent semantic analysis, latent Dirichlet allocation, non-negative matrix factorization, random projection, and principal component analysis. Two textual datasets were selected to evaluate the performance of included topic modeling methods based on the topic quality and some standard statistical evaluation metrics, like recall, precision, F-score, and topic coherence. As a result, latent Dirichlet allocation and non-negative matrix factorization methods delivered more meaningful extracted topics and obtained good results. The paper sheds light on some common topic modeling methods in a short-text context and provides direction for researchers who seek to apply these methods.

Keywords: natural language processing; online social networks; short text; topic modeling; user-generated content.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The steps involved in a text mining process (Kaur and Singh, 2019).
Figure 2
Figure 2
Topic modeling for text data.
Figure 3
Figure 3
SVD of the LSA topic modeling method (Neogi et al., 2020).
Figure 4
Figure 4
The original structure of the LDA topic model.
Figure 5
Figure 5
The original structure of the NMF topic model (Chen et al., 2019).
Figure 6
Figure 6
The F-score average results with different numbers of features f = 10, 100, 1,000, 10,000 (20-newsgroup dataset).

References

    1. Ahmed Taloba I., Eisa D. A., Safaa Ismail S. I. (2018). A comparative study on using principle component analysis with different text classifiers. Int. J. Comp. Appl. 180, 1–6. 10.5120/ijca2018916800 - DOI
    1. Albalawi R., Yeap T. H. (2019). “ChatWithRec: Toward a real-time conversational recommender system,” in ISERD 174th International Conference. The International Conference on Computer Science, Machine Learning and Big Data (ICCSMLBD) (New York, NY: ), 67–71. Available online at: http://www.worldresearchlibrary.org/up_proc/pdf/3216-157319215067-71.pdf
    1. Albalawi R., Yeap T. H., Benyoucef M. (2019). “Toward a real-time social recommendation system,” in MEDES'19 (Limassol, Cyprus: ), 336–340. Available online at: 10.1145/3297662.3365789 10.1145/3297662.3365789 - DOI - DOI
    1. Alghamdi R., Alfalqi K. (2015). A survey of topic modeling in text mining. Int. J. Adv. Comp. Sci. Appl. 6, 147–153. 10.14569/IJACSA.2015.060121 - DOI
    1. Anantharaman A., Jadiya A., Siri C. T. S., Bharath Nvs A., Mohan B. (2019). “Performance evaluation of topic modeling algorithms for text classification,” in 2019 3rd International Conference on Trends in Electronics and Informatics (ICOEI) (Tirunelveli: ).