Leveraging machine learning approaches for predicting potential Lyme disease cases and incidence rates in the United States using Twitter
- PMID: 37845666
- PMCID: PMC10578027
- DOI: 10.1186/s12911-023-02315-z
Leveraging machine learning approaches for predicting potential Lyme disease cases and incidence rates in the United States using Twitter
Abstract
Background: Lyme disease is one of the most commonly reported infectious diseases in the United States (US), accounting for more than [Formula: see text] of all vector-borne diseases in North America.
Objective: In this paper, self-reported tweets on Twitter were analyzed in order to predict potential Lyme disease cases and accurately assess incidence rates in the US.
Methods: The study was done in three stages: (1) Approximately 1.3 million tweets were collected and pre-processed to extract the most relevant Lyme disease tweets with geolocations. A subset of tweets were semi-automatically labelled as relevant or irrelevant to Lyme disease using a set of precise keywords, and the remaining portion were manually labelled, yielding a curated labelled dataset of 77, 500 tweets. (2) This labelled data set was used to train, validate, and test various combinations of NLP word embedding methods and prominent ML classification models, such as TF-IDF and logistic regression, Word2vec and XGboost, and BERTweet, among others, to identify potential Lyme disease tweets. (3) Lastly, the presence of spatio-temporal patterns in the US over a 10-year period were studied.
Results: Preliminary results showed that BERTweet outperformed all tested NLP classifiers for identifying Lyme disease tweets, achieving the highest classification accuracy and F1-score of [Formula: see text]. There was also a consistent pattern indicating that the West and Northeast regions of the US had a higher tweet rate over time.
Conclusions: We focused on the less-studied problem of using Twitter data as a surveillance tool for Lyme disease in the US. Several crucial findings have emerged from the study. First, there is a fairly strong correlation between classified tweet counts and Lyme disease counts, with both following similar trends. Second, in 2015 and early 2016, the social media network like Twitter was essential in raising popular awareness of Lyme disease. Third, counties with a high incidence rate were not necessarily related with a high tweet rate, and vice versa. Fourth, BERTweet can be used as a reliable NLP classifier for detecting relevant Lyme disease tweets.
Keywords: BERT; Incidence Rate; Lyme disease; Twitter.
© 2023. BioMed Central Ltd., part of Springer Nature.
Conflict of interest statement
The authors declare no competing interests.
Figures




Similar articles
-
Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.J Med Internet Res. 2023 Oct 16;25:e47014. doi: 10.2196/47014. J Med Internet Res. 2023. PMID: 37843893 Free PMC article.
-
Mapping tweets to a known disease epidemiology; a case study of Lyme disease in the United Kingdom and Republic of Ireland.J Biomed Inform. 2019;100S:100060. doi: 10.1016/j.yjbinx.2019.100060. Epub 2019 Oct 18. J Biomed Inform. 2019. PMID: 34384577
-
Toward Using Twitter for Tracking COVID-19: A Natural Language Processing Pipeline and Exploratory Data Set.J Med Internet Res. 2021 Jan 22;23(1):e25314. doi: 10.2196/25314. J Med Internet Res. 2021. PMID: 33449904 Free PMC article.
-
Machine Learning Classifiers for Twitter Surveillance of Vaping: Comparative Machine Learning Study.J Med Internet Res. 2020 Aug 12;22(8):e17478. doi: 10.2196/17478. J Med Internet Res. 2020. PMID: 32784184 Free PMC article.
-
A natural language processing pipeline to advance the use of Twitter data for digital epidemiology of adverse pregnancy outcomes.J Biomed Inform. 2020;112S:100076. doi: 10.1016/j.yjbinx.2020.100076. Epub 2020 Aug 8. J Biomed Inform. 2020. PMID: 34417007 Free PMC article.
Cited by
-
Identifying the geographic leading edge of Lyme disease in the United States with internet searches: A spatiotemporal analysis of Google Health Trends data.PLoS One. 2024 Nov 13;19(11):e0312277. doi: 10.1371/journal.pone.0312277. eCollection 2024. PLoS One. 2024. PMID: 39535983 Free PMC article.
-
Different environmental factors predict the occurrence of tick-borne encephalitis virus (TBEV) and reveal new potential risk areas across Europe via geospatial models.Int J Health Geogr. 2025 Mar 14;24(1):3. doi: 10.1186/s12942-025-00388-9. Int J Health Geogr. 2025. PMID: 40087786 Free PMC article.
-
Application of large language models in disease diagnosis and treatment.Chin Med J (Engl). 2025 Jan 20;138(2):130-142. doi: 10.1097/CM9.0000000000003456. Epub 2024 Dec 26. Chin Med J (Engl). 2025. PMID: 39722188 Free PMC article. Review.
-
Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis.J Med Internet Res. 2023 Oct 16;25:e47014. doi: 10.2196/47014. J Med Internet Res. 2023. PMID: 37843893 Free PMC article.
-
Lyme rashes disease classification using deep feature fusion technique.Skin Res Technol. 2023 Nov;29(11):e13519. doi: 10.1111/srt.13519. Skin Res Technol. 2023. PMID: 38009027 Free PMC article.
References
-
- Murphree Bacon R, Kugeler KJ, Mead PS. Surveillance for Lyme disease--United States, 1992-2006. 2008. - PubMed
-
- Kumar D, Downs LP, Adegoke A, Machtinger E, Oggenfuss K, Ostfeld RS, et al. An Exploratory Study on the Microbiome of Northern and Southern Populations of Ixodes scapularis Ticks Predicts Changes and Unique Bacterial Interactions. Pathogens. 2022;11(2):130. 10.3390/pathogens11020130. Accessed 17 Sep 2022. - PMC - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
Miscellaneous