What is in a food store name? Leveraging large language models to enhance food environment data
- PMID: 39712471
- PMCID: PMC11660183
- DOI: 10.3389/frai.2024.1476950
What is in a food store name? Leveraging large language models to enhance food environment data
Abstract
Introduction: It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).
Methods: To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.
Results: Based on this rubric, the model was tuned and evaluated (F1micro = 0.710, F1macro = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.
Discussion: This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.
Keywords: administrative food data; deep learning; food environment classification; food store name; health department; large language models; machine learning; natural language processing.
Copyright © 2024 Etheredge, Hosmer, Crossa, Suss and Torrey.
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
Figures
References
LinkOut - more resources
Full Text Sources
