A survey on text classification: Practical perspectives on the Italian language
- PMID: 35793328
- PMCID: PMC9258888
- DOI: 10.1371/journal.pone.0270904
A survey on text classification: Practical perspectives on the Italian language
Abstract
Text Classification methods have been improving at an unparalleled speed in the last decade thanks to the success brought about by deep learning. Historically, state-of-the-art approaches have been developed for and benchmarked against English datasets, while other languages have had to catch up and deal with inevitable linguistic challenges. This paper offers a survey with practical and linguistic connotations, showcasing the complications and challenges tied to the application of modern Text Classification algorithms to languages other than English. We engage this subject from the perspective of the Italian language, and we discuss in detail issues related to the scarcity of task-specific datasets, as well as the issues posed by the computational expensiveness of modern approaches. We substantiate this by providing an extensively researched list of available datasets in Italian, comparing it with a similarly sought list for French, which we use for comparison. In order to simulate a real-world practical scenario, we apply a number of representative methods to custom-tailored multilabel classification datasets in Italian, French, and English. We conclude by discussing results, future challenges, and research directions from a linguistically inclusive perspective.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
References
-
- Bender EM. The #BenderRule: On Naming the Languages We Study and Why It Matters; 2019 Sep 14. In: The Gradient [Internet] [cited 2022 Apr 13]. Available from: https://thegradient.pub/the-benderrule-on-naming-the-languages-we-study-....
-
- Magnini B, Cappelli A, Tamburini F, Bosco C, Mazzei A, Lombardo V, et al. Evaluation of Natural Language Tools for Italian: EVALITA 2007. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08); 2008 May 28–30. Marrakech, Morocco: European Language Resources Association (ELRA).
-
- Bender EM. On Achieving and Evaluating Language-Independence in NLP. Linguistic Issues in Language Technology. 2011. Oct 01;6. doi: 10.33011/lilt.v6i.1239 - DOI
-
- Li Q, Peng H, Li J, Xia C, Yang R, Sun L, et al. A Survey on Text Classification: From Shallow to Deep Learning. arXiv. 2020 Aug 02;
-
- Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D. Text Classification Algorithms: A Survey. Information. 2019. Apr 23;10(4). doi: 10.3390/info10040150 - DOI
MeSH terms
LinkOut - more resources
Full Text Sources
