What is in a food store name? Leveraging large language models to enhance food environment data
- PMID: 39712471
- PMCID: PMC11660183
- DOI: 10.3389/frai.2024.1476950
What is in a food store name? Leveraging large language models to enhance food environment data
Abstract
Introduction: It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).
Methods: To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.
Results: Based on this rubric, the model was tuned and evaluated (F1micro = 0.710, F1macro = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.
Discussion: This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.
Keywords: administrative food data; deep learning; food environment classification; food store name; health department; large language models; machine learning; natural language processing.
Copyright © 2024 Etheredge, Hosmer, Crossa, Suss and Torrey.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures




Similar articles
-
Assessing the Retail Food Environment in Madrid: An Evaluation of Administrative Data against Ground Truthing.Int J Environ Res Public Health. 2019 Sep 21;16(19):3538. doi: 10.3390/ijerph16193538. Int J Environ Res Public Health. 2019. PMID: 31546670 Free PMC article.
-
Comparing Pre-trained and Feature-Based Models for Prediction of Alzheimer's Disease Based on Speech.Front Aging Neurosci. 2021 Apr 27;13:635945. doi: 10.3389/fnagi.2021.635945. eCollection 2021. Front Aging Neurosci. 2021. PMID: 33986655 Free PMC article.
-
Evaluating large language models for health-related text classification tasks with public social media data.J Am Med Inform Assoc. 2024 Oct 1;31(10):2181-2189. doi: 10.1093/jamia/ocae210. J Am Med Inform Assoc. 2024. PMID: 39121174 Free PMC article.
-
A systematic review of factors that influence food store owner and manager decision making and ability or willingness to use choice architecture and marketing mix strategies to encourage healthy consumer purchases in the United States, 2005-2017.Int J Behav Nutr Phys Act. 2019 Jan 14;16(1):5. doi: 10.1186/s12966-019-0767-8. Int J Behav Nutr Phys Act. 2019. PMID: 30642352 Free PMC article.
-
Leveraging Citizen Science for Healthier Food Environments: A Pilot Study to Evaluate Corner Stores in Camden, New Jersey.Front Public Health. 2018 Mar 26;6:89. doi: 10.3389/fpubh.2018.00089. eCollection 2018. Front Public Health. 2018. PMID: 29632857 Free PMC article. Review.
References
LinkOut - more resources
Full Text Sources