What is in a food store name? Leveraging large language models to enhance food environment data

Analee J Etheredge¹, Samuel Hosmer¹, Aldo Crossa¹, Rachel Suss¹, Mark Torrey¹

Affiliations

PMID: 39712471
PMCID: PMC11660183
DOI: 10.3389/frai.2024.1476950

What is in a food store name? Leveraging large language models to enhance food environment data

Analee J Etheredge et al. Front Artif Intell. 2024.

. 2024 Dec 6:7:1476950.

doi: 10.3389/frai.2024.1476950. eCollection 2024.

Authors

Analee J Etheredge¹, Samuel Hosmer¹, Aldo Crossa¹, Rachel Suss¹, Mark Torrey¹

Affiliation

¹ Center for Population Health Data Science, NYC Department of Health and Mental Hygiene, New York City, NY, United States.

PMID: 39712471
PMCID: PMC11660183
DOI: 10.3389/frai.2024.1476950

Abstract

Introduction: It is not uncommon to repurpose administrative food data to create food environment datasets in the health department and research settings; however, the available administrative data are rarely categorized in a way that supports meaningful insight or action, and ground-truthing or manually reviewing an entire city or neighborhood is rate-limiting to essential operations and analysis. We show that such categorizations should be viewed as a classification problem well addressed by recent advances in natural language processing and deep learning-with the advent of large language models (LLMs).

Methods: To demonstrate how to automate the process of categorizing food stores, we use the foundation model BERT to give a first approximation to such categorizations: a best guess by store name. First, 10 food retail classes were developed to comprehensively categorize food store types from a public health perspective.

Results: Based on this rubric, the model was tuned and evaluated (F1_micro = 0.710, F1_macro = 0.709) on an extensive storefront directory of New York City. Second, the model was applied to infer insights from a large, unlabeled dataset using store names alone, aiming to replicate known temporospatial patterns. Finally, a complimentary application of the model as a data quality enhancement tool was demonstrated on a secondary, pre-labeled restaurant dataset.

Discussion: This novel application of an LLM to the enumeration of the food environment allowed for marked gains in efficiency compared to manual, in-person methods, addressing a known challenge to research and operations in a local health department.

Keywords: administrative food data; deep learning; food environment classification; food store name; health department; large language models; machine learning; natural language processing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Sensitivity and recall for the hold-out test set. Dashed vertical lines show adopted sensitivity cutoffs for food environment classification were applied to our sensitivity analyses: <20% very poor, 21–30% poor, 31–50% fair, 51%–71 moderate, 71–90% good, and > 90% excellent (Paquet et al., 2008; Bishop et al., 2021).

**Figure 2**
NYC maps for mean annual change in Grocery and Specialty Stores and Convenience Stores, 2019-2021. Hexbin size: 2640 ft. with overlay border 2010 NTAs **(A)** NYC Grocery and Specialty Stores: Mean businesses per hexbin, 3.7 (SD=0.55) **(B)** NYC Convenience Stores: Mean businesses per hexbin, 7.4 (SD=0.82) **(C)** Chinatown and Sunset Park Feature - Grocery and Specialty Stores: Mean businesses per hexbin, 3.7 (SD=0.55) **(D)** Chinatown and Sunset Park Feature - Convenience Stores: Mean businesses per hexbin, 7.4 (SD=0.82).

**Figure 3**
Heatmap of service description tag frequency in the restaurants dataset as compared to the fast food and restaurant classifier labels. Counts are overlayed for clarity.

**Figure 4**
Frequencies of venue tags in the restaurant dataset: the 11 most frequent of the 41 tags belonging to this variable.

See this image and copyright information in PMC

References

1. Agurs-Collins T., Alvidrez J., ElShourbagy Ferreira S., Evans M., Gibbs K., Kowtha B., et al. (2024). Perspective: nutrition health disparities framework: a model to advance health equity. Adv. Nutr. 15:100194. doi: 10.1016/j.advnut.2024.100194 - DOI - PMC - PubMed
1. Bishop T. R. P., von Hinke S., Hollingsworth B., Lake A. A., Brown H., Burgoine T. (2021). Automatic classification of takeaway food outlet cuisine type using machine (deep) learning. Mach Learn Appl 6:100106. doi: 10.1016/j.mlwa.2021.100106, PMID: - DOI - PMC - PubMed
1. Block J. P., Subramanian S. (2015). Moving beyond “food deserts”: reorienting United States policies to reduce disparities in diet quality. PLoS Med. 12:e1001914. doi: 10.1371/journal.pmed.1001914, PMID: - DOI - PMC - PubMed
1. Boise S., Crossa A., Etheredge A. J., McCulley E. M., Lovasi G. S. (2023). Concepts, characterizations, and cautions: A public health guide and glossary for planning food environment measurement. Open Public Health J 16, 1–17. doi: 10.2174/18749445-v16-230821-2023-51 - DOI - PMC - PubMed
1. Braid L., Oliva R., Nichols K., Reyes A., Guzman J., Goldman R. E., et al. (2022). Community perceptions in new York City: sugar-sweetened beverage policies and programs in the first 1000 days. Matern. Child Health J. 26, 193–204. doi: 10.1007/s10995-021-03255-8, PMID: - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources
- Frontiers Media SA
- PubMed Central

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

What is in a food store name? Leveraging large language models to enhance food environment data

Affiliation

What is in a food store name? Leveraging large language models to enhance food environment data

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources