Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Jan:87:104413.
doi: 10.1016/j.ebiom.2022.104413. Epub 2022 Dec 21.

Generalisable long COVID subtypes: findings from the NIH N3C and RECOVER programmes

Collaborators, Affiliations

Generalisable long COVID subtypes: findings from the NIH N3C and RECOVER programmes

Justin T Reese et al. EBioMedicine. 2023 Jan.

Abstract

Background: Stratification of patients with post-acute sequelae of SARS-CoV-2 infection (PASC, or long COVID) would allow precision clinical management strategies. However, long COVID is incompletely understood and characterised by a wide range of manifestations that are difficult to analyse computationally. Additionally, the generalisability of machine learning classification of COVID-19 clinical outcomes has rarely been tested.

Methods: We present a method for computationally modelling PASC phenotype data based on electronic healthcare records (EHRs) and for assessing pairwise phenotypic similarity between patients using semantic similarity. Our approach defines a nonlinear similarity function that maps from a feature space of phenotypic abnormalities to a matrix of pairwise patient similarity that can be clustered using unsupervised machine learning.

Findings: We found six clusters of PASC patients, each with distinct profiles of phenotypic abnormalities, including clusters with distinct pulmonary, neuropsychiatric, and cardiovascular abnormalities, and a cluster associated with broad, severe manifestations and increased mortality. There was significant association of cluster membership with a range of pre-existing conditions and measures of severity during acute COVID-19. We assigned new patients from other healthcare centres to clusters by maximum semantic similarity to the original patients, and showed that the clusters were generalisable across different hospital systems. The increased mortality rate originally identified in one cluster was consistently observed in patients assigned to that cluster in other hospital systems.

Interpretation: Semantic phenotypic clustering provides a foundation for assigning patients to stratified subgroups for natural history or therapy studies on PASC.

Funding: NIH (TR002306/OT2HL161847-01/OD011883/HG010860), U.S.D.O.E. (DE-AC02-05CH11231), Donald A. Roux Family Fund at Jackson Laboratory, Marsico Family at CU Anschutz.

Keywords: COVID-19; Human Phenotype Ontology; Long COVID; Machine learning; Precision medicine; Semantic similarity.

PubMed Disclaimer

Conflict of interest statement

Declaration of interests T. Bergquist received other support from Bill and Melinda Gates Foundation, H. Davis received support from Balvi Foundation and is a cofounder of Patient Led Research Collaborative. The other authors declare that they have no other competing interests.

Figures

Fig. 1
Fig. 1
Cohort construction. Patients with long COVID (U09.9 diagnosis) were extracted from the much larger dataset of the N3C. Long COVID patients were selected from the six data partners that provided data for at least 300 U09.9 patients and had an average of at least 7 long COVID HPO terms per patient. The data partner with the most U09.9 patients (data partner 1) was chosen for clustering, and additional U09.9 patients from five other data partners (data partners 2–6) were chosen to assess generalizability.
Fig. 2
Fig. 2
Calculating patient semantic similarity based on HPO phenotypes. A) HPO terms are arranged in a directed acyclic graph with specific terms such as Bradycardia (HP:0001662) being related to more general terms (here: Arrhythmia; HP:0011675) by subtype relations. An excerpt of the entire ontology (15,247 terms) is shown. B) Example showing a pair of patients with relatively high phenotypic similarity; for each of the HPO terms in patient 1, the best match is sought in patient 2. If an exact match is not found, the algorithm searches for the most informative common ancestor (MICA) in the ontology; the information content (a measure of specificity) of the exact matching term or most specific ancestor term is calculated to determine the specificity. For instance, Visual hallucinations (HP:0002367) and Auditory hallucinations (HP:0008765) are not an exact match, so the information content of their MICA Hallucinations (HP:0000738) is chosen. Hallucinations (HP:0002367) is still relatively specific (and shown in grey), while the MICA of Angina pectoris (HP:0001681) and Hypotension (HP:0002615) is more general (shown in red) and contributes less to the matching score. C) Example of a pair of patients with a relatively lower similarity due to (specific) fewer exact matches and one unmatched term. The pairwise similarity is calculated in this way for all pairs of patients to construct the similarity matrix that is used for clustering (Fig. 3).
Fig. 3
Fig. 3
Patient similarity matrix illustrating long COVID subtypes in data partner 1. A heatmap representing the 6 clusters created by k-means clustering is shown. Cluster hierarchy was calculated using the nearest point algorithm and Euclidean distance.
Fig. 4
Fig. 4
Phenotypically characterising long COVID subtype clusters. Shown are the most frequently co-occurring combinations of categories of HPO terms representing long COVID phenotypic features for patients in the overall cohort (A) and for each of the 6 clusters (B). Only those categories are shown that were found to be significantly correlated with cluster membership (chi-squared test, p < 0.00001). For the overall population of patients in data partner 1 and for each cluster, the frequency of each category of long COVID HPO terms (left) and the frequency of the three most common combinations of HPO categories (top) are shown (Six combinations are shown for cluster 5 because of a tie.) Notably, most clusters contain some widely shared features, but also distinguishing features such as symptoms in the pulmonary, neuropsychiatric, and cardiovascular systems. Data are shown as UpSet plots, which visualise set intersections in a matrix layout and show the counts of patients with the combination indicated by the black dots as bars above the matrix.
Fig. 5
Fig. 5
Summary of phenotypic feature distribution in the six clusters. A) The HPO terms corresponding to different phenotypic features are grouped in HPO categories shown on the left. Categories are colour-coded and are in the same order as shown in panel B. Laboratory abnormalities are grouped together because of their association with severe COVID-19 (see text). HPO terms are shown if at least 20% of patients in at least one cluster had the corresponding phenotypic feature and if Pearson's chi-squared test found a significant difference (p < 0.00001) in the phenotypic feature distribution. B) Post hoc analysis of categories of long COVID HPO phenotypic features by cluster. For each category of Long COVID HPO phenotypic feature, we performed a post hoc analysis (pairwise chi-squared test with Bonferroni correction) to assess differences between clusters. For each category, the percent of patients from each cluster that have at least one HPO term in the given category are shown, and red and blue cells mark the CLD group having the highest and lowest proportion, respectively. Letters a–e indicate CLD groups between which differences for the given category are statistically significant according to post hoc analysis (Methods).

Update of

References

    1. Weekly operational update on COVID-19-30 March 2022 [Internet] https://www.who.int/publications/m/item/weekly-operational-update-on-cov... Available from:
    1. Raveendran A.V., Jayadevan R., Sashidharan S. Long COVID: an overview. Diabetes Metabol Syndr. 2021;15(3):869–875. - PMC - PubMed
    1. Taquet M., Dercon Q., Luciano S., Geddes J.R., Husain M., Harrison P.J. Incidence, co-occurrence, and evolution of long-COVID features: a 6-month retrospective cohort study of 273,618 survivors of COVID-19. PLoS Med. 2021;18(9):e1003773. - PMC - PubMed
    1. Michelen M., Manoharan L., Elkheir N., et al. Characterising long COVID: a living systematic review. BMJ Glob Health. 2021;6(9) doi: 10.1136/bmjgh-2021-005427. - DOI - PMC - PubMed
    1. Nalbandian A., Sehgal K., Gupta A., et al. Post-acute COVID-19 syndrome. Nat Med. 2021;27(4):601–615. - PMC - PubMed

Grants and funding