Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 22;11(1):221.
doi: 10.1038/s41597-024-02958-1.

Ethnicity data resource in population-wide health records: completeness, coverage and granularity of diversity

Affiliations

Ethnicity data resource in population-wide health records: completeness, coverage and granularity of diversity

Marta Pineda-Moncusí et al. Sci Data. .

Abstract

Intersectional social determinants including ethnicity are vital in health research. We curated a population-wide data resource of self-identified ethnicity data from over 60 million individuals in England primary care, linking it to hospital records. We assessed ethnicity data in terms of completeness, consistency, and granularity and found one in ten individuals do not have ethnicity information recorded in primary care. By linking to hospital records, ethnicity data were completed for 94% of individuals. By reconciling SNOMED-CT concepts and census-level categories into a consistent hierarchy, we organised more than 250 ethnicity sub-groups including and beyond "White", "Black", "Asian", "Mixed" and "Other, and found them to be distributed in proportions similar to the general population. This large observational dataset presents an algorithmic hierarchy to represent self-identified ethnicity data collected across heterogeneous healthcare settings. Accurate and easily accessible ethnicity data can lead to a better understanding of population diversity, which is important to address disparities and influence policy recommendations that can translate into better, fairer health for all.

PubMed Disclaimer

Conflict of interest statement

AA is supported by Health Data Research UK (HDR-9006), which receives its funding from the UK Medical Research Council (MRC, grant MR/V028367/1); and Administrative Data Research UK, which is funded by the ESRC (grant ES/S007393/1). SD is supported by: BigData@Heart Consortium, funded by the Innovative Medicines Initiative-2 Joint Undertaking under grant agreement 116074, The British Heart Foundation Data Science Centre (grant No SP/19/3/34678, awarded to Health Data Research (HDR) UK), NIHR Biomedical Research Centre at University College London (UCL) Hospital NHS Trust, the NIHR-UKRI CONVALESCENCE study, BHF Accelerator Award (AA/18/6/24223). CT is supported by a UCL UKRI Centre for Doctoral Training in AI-enabled Healthcare studentship (EP/S021612/1), MRC Clinical Top-Up and a studentship from the NIHR Biomedical Research Centre at University College London Hospital NHS Trust. KK is the director of Centre for Ethnic Health Research, and trustee of South Asian Health Foundation. SK has received research grant funding from the UKRI and Alan Turing Institute for this work, and from Amgen and UCB Biopharma, and Bill & Melinda Gates Foundation outside of this work. DPA’s research group has received grant/s from Amgen, Chiesi-Taylor, Lilly, Janssen, Novartis, and UCB Biopharma. His research group has received consultancy fees from Astra Zeneca and UCB Biopharma. Amgen, Astellas, Janssen, Synapse Management Partners and UCB Biopharma have funded or supported training programmes organised by SK and DPA’s department. The remaining authors have nothing to declare.

Figures

Fig. 1
Fig. 1
How ethnicity is collected in the UK and typically used for research. The A-Z letters are the nomenclature observed in the data to represent the NHS ethnicity codes. Abbreviations: High-level ethnicity groups, general ethnicity classification groups from the Office for National Statistics commonly used in research; NHS, National Health Service in the UK; SNOMED, SNOMED-CT records containing ethnicity concepts.
Fig. 2
Fig. 2
Visual representation of the hierarchy between the three ethnicity classifications, from the broader to the most specific: High-level ethnicity groups, NHS ethnicity codes and SNOMED concepts. The A-Z letters are the nomenclature observed in the data to represent the NHS ethnicity codes. The colours displayed from the High-level ethnicity groups show how the NHS ethnicity concepts and SNOMED-CT can be aggregated into this 6-category classification. The highlights the different colour for the letters C and T, in respect to the colours of their concepts, Chinese and Gypsy/Irish Traveller, respectively. The colours from the concepts represent the current aggregation algorithm available in the NHS England SDE, whilst the colour of the letters show the aggregation suggested by the UK Office of National Statistics. Abbreviations: *, the Unknown category is not always included; NHS, National Health Service in the UK; SNOMED, SNOMED-CT records containing ethnicity codes; SDE, Secure Data Environment.
Fig. 3
Fig. 3
Decision tree of preferred source of ethnicity. Solid arrows mark the preferred option whilst dashed arrows indicate the alternative route. Abbreviations: GDPPR, General Practice Extraction Service (GPES) Data for Pandemic Planning and Research; HES-APC, hospital episode statistics; SNOMED, SNOMED-CT records containing ethnicity codes.
Fig. 4
Fig. 4
Flow chart of availability of ethnicity records for individuals present in GDPPR. Abbreviations: GDPPR, General Practice Extraction Service (GPES) Data for Pandemic Planning and Research; HES-APC, hospital episode statistics; NHS, National Health Service in the UK; SNOMED, SNOMED-CT records containing ethnicity concepts; NA, not available ethnicity.
Fig. 5
Fig. 5
Sankey plot showing potential discrepancies between SNOMED concepts and NHS ethnicity codes mapping. Abbreviations: NHS, National Health Service in the UK; SNOMED, SNOMED-CT records containing ethnicity concepts.

References

    1. Arcaya MC, Arcaya AL, Subramanian SV. Inequalities in health: definitions, concepts, and theories. Glob Health Action. 2015;8:27106. doi: 10.3402/gha.v8.27106. - DOI - PMC - PubMed
    1. Chen N, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet. 2020;395:507–513. doi: 10.1016/S0140-6736(20)30211-7. - DOI - PMC - PubMed
    1. Williamson EJ, et al. Factors associated with COVID-19-related death using OpenSAFELY. Nature. 2020;584:430–436. doi: 10.1038/s41586-020-2521-4. - DOI - PMC - PubMed
    1. Clift AK, et al. Living risk prediction algorithm (QCOVID) for risk of hospital admission and mortality from coronavirus 19 in adults: national derivation and validation cohort study. BMJ. 2020;371:m3731. doi: 10.1136/bmj.m3731. - DOI - PMC - PubMed
    1. Saunders, C. L., Abel, G. A., El Turabi, A., Ahmed, F. & Lyratzopoulos, G. Accuracy of routinely recorded ethnic group information compared with self-reported ethnicity: evidence from the English Cancer Patient Experience survey. BMJ Open3, 10.1136/bmjopen-2013-002882 (2013). - PMC - PubMed