. 2023 Sep 14:64:102210.

doi: 10.1016/j.eclinm.2023.102210. eCollection 2023 Oct.

Characterization of long COVID temporal sub-phenotypes by distributed representation learning from electronic health record data: a cohort study

Arianna Dagliati¹, Zachary H Strasser², Zahra Shakeri Hossein Abad³, Jeffrey G Klann², Kavishwar B Wagholikar², Rebecca Mesa¹, Shyam Visweswaran⁴, Michele Morris⁴, Yuan Luo⁵, Darren W Henderson⁶, Malarkodi Jebathilagam Samayamuthu⁴, Bryce W Q Tan⁷, Guillame Verdy⁸, Gilbert S Omenn⁹, Zongqi Xia¹⁰, Riccardo Bellazzi¹; Consortium for Clinical Characterization of COVID-19 by EHR (4CE),; Shawn N Murphy¹¹, John H Holmes¹², Hossein Estiri²; Consortium for Clinical Characterization of COVID-19 by EHR (4CE)

Collaborators, Affiliations

Collaborators

James R Aaron, Giuseppe Agapito, Adem Albayrak, Giuseppe Albi, Mario Alessiani, Anna Alloni, Danilo F Amendola, François Angoulvant, Li Llj Anthony, Bruce J Aronow, Fatima Ashraf, Andrew Atz, Paul Avillach, Paula S Azevedo, James Balshi, Brett K Beaulieu-Jones, Douglas S Bell, Antonio Bellasi, Riccardo Bellazzi, Vincent Benoit, Michele Beraghi, José Luis Bernal-Sobrino, Mélodie Bernaux, Romain Bey, Surbhi Bhatnagar, Alvar Blanco-Martínez, Clara-Lea Bonzel, John Booth, Silvano Bosari, Florence T Bourgeois, Robert L Bradford, Gabriel A Brat, Stéphane Bréant, Nicholas W Brown, Raffaele Bruno, William A Bryant, Mauro Bucalo, Emily Bucholz, Anita Burgun, Tianxi Cai, Mario Cannataro, Aldo Carmona, Charlotte Caucheteux, Julien Champ, Jin Chen, Krista Y Chen, Luca Chiovato, Lorenzo Chiudinelli, Kelly Cho, James J Cimino, Tiago K Colicchio, Sylvie Cormont, Sébastien Cossin, Jean B Craig, Juan Luis Cruz-Bermúdez, Jaime Cruz-Rojo, Arianna Dagliati, Mohamad Daniar, Christel Daniel, Priyam Das, Batsal Devkota, Audrey Dionne, Rui Duan, Julien Dubiel, Scott L DuVall, Loic Esteve, Hossein Estiri, Shirley Fan, Robert W Follett, Thomas Ganslandt, Noelia García- Barrio, Lana X Garmire, Nils Gehlenborg, Emily J Getzen, Alon Geva, Tobias Gradinger, Alexandre Gramfort, Romain Griffier, Nicolas Griffon, Olivier Grisel, Alba Gutiérrez-Sacristán, Larry Han, David A Hanauer, Christian Haverkamp, Derek Y Hazard, Bing He, Darren W Henderson, Martin Hilka, Yuk-Lam Ho, John H Holmes, Chuan Hong, Kenneth M Huling, Meghan R Hutch, Richard W Issitt, Anne Sophie Jannot, Vianney Jouhet, Ramakanth Kavuluru, Mark S Keller, Chris J Kennedy, Daniel A Key, Katie Kirchoff, Jeffrey G Klann, Isaac S Kohane, Ian D Krantz, Detlef Kraska, Ashok K Krishnamurthy, Sehi L'Yi, Trang T Le, Judith Leblanc, Guillaume Lemaitre, Leslie Lenert, Damien Leprovost, Molei Liu, Ne Hooi Will Loh, Qi Long, Sara Lozano-Zahonero, Yuan Luo, Kristine E Lynch, Sadiqa Mahmood, Sarah E Maidlow, Adeline Makoudjou, Alberto Malovini, Kenneth D Mandl, Chengsheng Mao, Anupama Maram, Patricia Martel, Marcelo R Martins, Jayson S Marwaha, Aaron J Masino, Maria Mazzitelli, Arthur Mensch, Marianna Milano, Marcos F Minicucci, Bertrand Moal, Taha Mohseni Ahooyi, Jason H Moore, Cinta Moraleda, Jeffrey S Morris, Michele Morris, Karyn L Moshal, Sajad Mousavi, Danielle L Mowery, Douglas A Murad, Shawn N Murphy, Thomas P Naughton, Carlos Tadeu Breda Neto, Antoine Neuraz, Jane Newburger, Kee Yuan Ngiam, Wanjiku Fm Njoroge, James B Norman, Jihad Obeid, Marina P Okoshi, Karen L Olson, Gilbert S Omenn, Nina Orlova, Brian D Ostasiewski, Nathan P Palmer, Nicolas Paris, Lav P Patel, Miguel Pedrera-Jiménez, Emily R Pfaff, Ashley C Pfaff, Danielle Pillion, Sara Pizzimenti, Hans U Prokosch, Robson A Prudente, Andrea Prunotto, Víctor Quirós-González, Rachel B Ramoni, Maryna Raskin, Siegbert Rieg, Gustavo Roig-Domínguez, Pablo Rojo, Paula Rubio-Mayo, Paolo Sacchi, Carlos Sáez, Elisa Salamanca, Malarkodi Jebathilagam Samayamuthu, L Nelson Sanchez-Pinto, Arnaud Sandrin, Nandhini Santhanam, Janaina Cc Santos, Fernando J Sanz Vidorreta, Maria Savino, Emily R Schriver, Petra Schubert, Juergen Schuettler, Luigia Scudeller, Neil J Sebire, Pablo Serrano-Balazote, Patricia Serre, Arnaud Serret-Larmande, Mohsin Shah, Zahra Shakeri Hossein Abad, Domenick Silvio, Piotr Sliz, Jiyeon Son, Charles Sonday, Andrew M South, Anastasia Spiridou, Zachary H Strasser, Amelia Lm Tan, Bryce Wq Tan, Byorn Wl Tan, Suzana E Tanni, Deanne M Taylor, Ana I Terriza-Torres, Valentina Tibollo, Patric Tippmann, Emma Ms Toh, Carlo Torti, Enrico M Trecarichi, Yi-Ju Tseng, Andrew K Vallejos, Gael Varoquaux, Margaret E Vella, Guillaume Verdy, Jill-Jênn Vie, Shyam Visweswaran, Michele Vitacca, Kavishwar B Wagholikar, Lemuel R Waitman, Xuan Wang, Demian Wassermann, Griffin M Weber, Martin Wolkewitz, Scott Wong, Zongqi Xia, Xin Xiong, Ye Ye, Nadir Yehya, William Yuan, Alberto Zambelli, Harrison G Zhang, Daniela Zo Ller, Valentina Zuccaro, Chiara Zucco

Affiliations

¹ Department of Electrical Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
² Department of Medicine, Massachusetts General Hospital, Boston, United States.
³ University of Toronto, Dalla Lana School of Public Health, Toronto, Canada.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, United States.
⁵ Department of Preventive Medicine, Northwestern University, Chicago, United States.
⁶ University of Kentucky, Center for Clinical and Translational Science, Lexington, United States.
⁷ National University Hospital, Singapore Department of Medicine, Singapore.
⁸ Bordeaux University Hospital, IAM Unit, Bordeaux, France.
⁹ University of Michigan, Department of Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics, and School of Public Health, Ann Arbor, United States.
¹⁰ University of Pittsburgh Department of Neurology, Pittsburgh, United States.
¹¹ Department of Neurology, Massachusetts General Hospital, Boston, United States.
¹² University of Pennsylvania Perelman School of Medicine, Department of Biostatistics, Epidemiology, and Informatics, Institute for Biomedical Informatics, Philadelphia, United States.

PMID: 37745021
PMCID: PMC10511779
DOI: 10.1016/j.eclinm.2023.102210

Characterization of long COVID temporal sub-phenotypes by distributed representation learning from electronic health record data: a cohort study

Arianna Dagliati et al. EClinicalMedicine. 2023.

. 2023 Sep 14:64:102210.

doi: 10.1016/j.eclinm.2023.102210. eCollection 2023 Oct.

Authors

Collaborators

James R Aaron, Giuseppe Agapito, Adem Albayrak, Giuseppe Albi, Mario Alessiani, Anna Alloni, Danilo F Amendola, François Angoulvant, Li Llj Anthony, Bruce J Aronow, Fatima Ashraf, Andrew Atz, Paul Avillach, Paula S Azevedo, James Balshi, Brett K Beaulieu-Jones, Douglas S Bell, Antonio Bellasi, Riccardo Bellazzi, Vincent Benoit, Michele Beraghi, José Luis Bernal-Sobrino, Mélodie Bernaux, Romain Bey, Surbhi Bhatnagar, Alvar Blanco-Martínez, Clara-Lea Bonzel, John Booth, Silvano Bosari, Florence T Bourgeois, Robert L Bradford, Gabriel A Brat, Stéphane Bréant, Nicholas W Brown, Raffaele Bruno, William A Bryant, Mauro Bucalo, Emily Bucholz, Anita Burgun, Tianxi Cai, Mario Cannataro, Aldo Carmona, Charlotte Caucheteux, Julien Champ, Jin Chen, Krista Y Chen, Luca Chiovato, Lorenzo Chiudinelli, Kelly Cho, James J Cimino, Tiago K Colicchio, Sylvie Cormont, Sébastien Cossin, Jean B Craig, Juan Luis Cruz-Bermúdez, Jaime Cruz-Rojo, Arianna Dagliati, Mohamad Daniar, Christel Daniel, Priyam Das, Batsal Devkota, Audrey Dionne, Rui Duan, Julien Dubiel, Scott L DuVall, Loic Esteve, Hossein Estiri, Shirley Fan, Robert W Follett, Thomas Ganslandt, Noelia García- Barrio, Lana X Garmire, Nils Gehlenborg, Emily J Getzen, Alon Geva, Tobias Gradinger, Alexandre Gramfort, Romain Griffier, Nicolas Griffon, Olivier Grisel, Alba Gutiérrez-Sacristán, Larry Han, David A Hanauer, Christian Haverkamp, Derek Y Hazard, Bing He, Darren W Henderson, Martin Hilka, Yuk-Lam Ho, John H Holmes, Chuan Hong, Kenneth M Huling, Meghan R Hutch, Richard W Issitt, Anne Sophie Jannot, Vianney Jouhet, Ramakanth Kavuluru, Mark S Keller, Chris J Kennedy, Daniel A Key, Katie Kirchoff, Jeffrey G Klann, Isaac S Kohane, Ian D Krantz, Detlef Kraska, Ashok K Krishnamurthy, Sehi L'Yi, Trang T Le, Judith Leblanc, Guillaume Lemaitre, Leslie Lenert, Damien Leprovost, Molei Liu, Ne Hooi Will Loh, Qi Long, Sara Lozano-Zahonero, Yuan Luo, Kristine E Lynch, Sadiqa Mahmood, Sarah E Maidlow, Adeline Makoudjou, Alberto Malovini, Kenneth D Mandl, Chengsheng Mao, Anupama Maram, Patricia Martel, Marcelo R Martins, Jayson S Marwaha, Aaron J Masino, Maria Mazzitelli, Arthur Mensch, Marianna Milano, Marcos F Minicucci, Bertrand Moal, Taha Mohseni Ahooyi, Jason H Moore, Cinta Moraleda, Jeffrey S Morris, Michele Morris, Karyn L Moshal, Sajad Mousavi, Danielle L Mowery, Douglas A Murad, Shawn N Murphy, Thomas P Naughton, Carlos Tadeu Breda Neto, Antoine Neuraz, Jane Newburger, Kee Yuan Ngiam, Wanjiku Fm Njoroge, James B Norman, Jihad Obeid, Marina P Okoshi, Karen L Olson, Gilbert S Omenn, Nina Orlova, Brian D Ostasiewski, Nathan P Palmer, Nicolas Paris, Lav P Patel, Miguel Pedrera-Jiménez, Emily R Pfaff, Ashley C Pfaff, Danielle Pillion, Sara Pizzimenti, Hans U Prokosch, Robson A Prudente, Andrea Prunotto, Víctor Quirós-González, Rachel B Ramoni, Maryna Raskin, Siegbert Rieg, Gustavo Roig-Domínguez, Pablo Rojo, Paula Rubio-Mayo, Paolo Sacchi, Carlos Sáez, Elisa Salamanca, Malarkodi Jebathilagam Samayamuthu, L Nelson Sanchez-Pinto, Arnaud Sandrin, Nandhini Santhanam, Janaina Cc Santos, Fernando J Sanz Vidorreta, Maria Savino, Emily R Schriver, Petra Schubert, Juergen Schuettler, Luigia Scudeller, Neil J Sebire, Pablo Serrano-Balazote, Patricia Serre, Arnaud Serret-Larmande, Mohsin Shah, Zahra Shakeri Hossein Abad, Domenick Silvio, Piotr Sliz, Jiyeon Son, Charles Sonday, Andrew M South, Anastasia Spiridou, Zachary H Strasser, Amelia Lm Tan, Bryce Wq Tan, Byorn Wl Tan, Suzana E Tanni, Deanne M Taylor, Ana I Terriza-Torres, Valentina Tibollo, Patric Tippmann, Emma Ms Toh, Carlo Torti, Enrico M Trecarichi, Yi-Ju Tseng, Andrew K Vallejos, Gael Varoquaux, Margaret E Vella, Guillaume Verdy, Jill-Jênn Vie, Shyam Visweswaran, Michele Vitacca, Kavishwar B Wagholikar, Lemuel R Waitman, Xuan Wang, Demian Wassermann, Griffin M Weber, Martin Wolkewitz, Scott Wong, Zongqi Xia, Xin Xiong, Ye Ye, Nadir Yehya, William Yuan, Alberto Zambelli, Harrison G Zhang, Daniela Zo Ller, Valentina Zuccaro, Chiara Zucco

Affiliations

¹ Department of Electrical Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
² Department of Medicine, Massachusetts General Hospital, Boston, United States.
³ University of Toronto, Dalla Lana School of Public Health, Toronto, Canada.
⁴ Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, United States.
⁵ Department of Preventive Medicine, Northwestern University, Chicago, United States.
⁶ University of Kentucky, Center for Clinical and Translational Science, Lexington, United States.
⁷ National University Hospital, Singapore Department of Medicine, Singapore.
⁸ Bordeaux University Hospital, IAM Unit, Bordeaux, France.
⁹ University of Michigan, Department of Computational Medicine and Bioinformatics, Internal Medicine, Human Genetics, and School of Public Health, Ann Arbor, United States.
¹⁰ University of Pittsburgh Department of Neurology, Pittsburgh, United States.
¹¹ Department of Neurology, Massachusetts General Hospital, Boston, United States.
¹² University of Pennsylvania Perelman School of Medicine, Department of Biostatistics, Epidemiology, and Informatics, Institute for Biomedical Informatics, Philadelphia, United States.

PMID: 37745021
PMCID: PMC10511779
DOI: 10.1016/j.eclinm.2023.102210

Abstract

Background: Characterizing Post-Acute Sequelae of COVID (SARS-CoV-2 Infection), or PASC has been challenging due to the multitude of sub-phenotypes, temporal attributes, and definitions. Scalable characterization of PASC sub-phenotypes can enhance screening capacities, disease management, and treatment planning.

Methods: We conducted a retrospective multi-centre observational cohort study, leveraging longitudinal electronic health record (EHR) data of 30,422 patients from three healthcare systems in the Consortium for the Clinical Characterization of COVID-19 by EHR (4CE). From the total cohort, we applied a deductive approach on 12,424 individuals with follow-up data and developed a distributed representation learning process for providing augmented definitions for PASC sub-phenotypes.

Findings: Our framework characterized seven PASC sub-phenotypes. We estimated that on average 15.7% of the hospitalized COVID-19 patients were likely to suffer from at least one PASC symptom and almost 5.98%, on average, had multiple symptoms. Joint pain and dyspnea had the highest prevalence, with an average prevalence of 5.45% and 4.53%, respectively.

Interpretation: We provided a scalable framework to every participating healthcare system for estimating PASC sub-phenotypes prevalence and temporal attributes, thus developing a unified model that characterizes augmented sub-phenotypes across the different systems.

Funding: Authors are supported by National Institute of Allergy and Infectious Diseases, National Institute on Aging, National Center for Advancing Translational Sciences, National Medical Research Council, National Institute of Neurological Disorders and Stroke, European Union, National Institutes of Health, National Center for Advancing Translational Sciences.

Keywords: COVID-19; Electronic health records; PASC; Post-acute sequelae of SARS-CoV-2; SARS-CoV-2.

PubMed Disclaimer

Conflict of interest statement

Riccardo Bellazzi is shareholder of Biomeris s. r.l. Gilbert Omenn holds patents for U.S. Application No. 16/169,048 Filed: 24-October- 2018 and License 2023–0632 with Radial Therapeutics, Inc.; Invention Disclosure No. 2022-382.

Figures

**Fig. 1**
**Overview of the Deductive Study Pipeline in Phase 1 of the Study.** MLHO leverages the informatics infrastructures developed by the 4CE for a distributed study of PASC sub-phenotypes in a deductive data-driven pipeline, in which we augmented clinical knowledge using an iterative approach.

**Fig. 2**
**The data-driven process for enriching initial PASC sub-phenotype definitions.** Leveraging the initial PASC sub-phenotype definitions, we developed a distributed representation learning that identifies additional EHR data elements (i.e., encounter records) that associate with a patient having a diagnosis code for a PASC problem 90 days or longer after COVID-19 hospitalization. The process included the following steps: 1. 4CE data model is transformed to MLHO input; 2. EHR data are time stamped based on the index data into pre-COVID, acute + phase, and post-COVID; 3. Using the initial data elements, we identified potential patients with specific symptoms after a SARS-2-CoV infection; 4. The initial (core) features are removed and MLHO is applied to identify data elements during the post-COVID and acute + phase that can predict the label for a given phenotype; 5. Step 4 is iterated 5 times to compute MLHO confidence score, which quantifies the number of times a feature is identified as a predictor for a prediction/classification task.

**Fig. 3**
**Illustration of Louvain method used to cluster features.** This figure shows the graph structure used to cluster core and MLHO features. Nodes annotated with f represent the features, and t nodes show the time. The weight of each connection presents the percentage of patients diagnosed with corresponding feature f at time t. In this example, clusters are separated using different colors.

**Fig. 4**
**Schematic construction of the augmented definition for a PASC sub-phenotypes.** An augmented definition for a PASC sub-phenotype encompassed time-stamped features from patients' EHRs. Core features (initial EHR markers) have an a priori temporal definition of being recorded for the first time 90 days or longer after the hospitalization. MLHO features (new EHR markers) can be observed any time post hospitalization, but are time stamped to capture the temporal relationships with the core features.

**Fig. 5**
**Prevalence estimates for the overall PASC phenotype and specific PASC sub-phenotypes in the hospitalized population.** Each plot reports on the horizontal axes the prevalence values as percentages of subjects identified by CORE and/or MLHO features over the total of COVID-19 hospitalized subjects. Each row represents a site via lollipop plots, reporting lower limit (green, col1), upper limit (red, col2) and average (gray, col3) values. Vertical lines represent average prevalence across hospitals, using as weight the number of subjects enrolled in the analyses by each site.

**Fig. 6**
**PASC sub-phenotype features temporal distribution.** For each PASC sub-phenotype we report the number of features in each 30-day time window. The plot, which reports days on the y-axis, illustrates kernel densities on the right of each PASC sub-phenotype, mean and standard deviation (the points and the intervals over the violin plots), and jittered raw data points on the left. Temporal distributions of PASC features were compared by pairwise Wilcoxon test, with a Bonferroni correction; p-values for significant results (<0.05) are reported on the vertical lines that connect different PASC sub-phenotypes.

**Fig. 7**
**Clustered presentation and temporal distribution of the core and MLHO features.** The clusters are defined using Louvain clustering. Each node of the clustering graph is presented as ***(f,t,p)***, where f presents the feature, t presents the time and p shows the percentage of patients. Blank squares present missing values and the gradient-colored dots show the value of p. The diamonds next to the features on the y-axis define the type of each feature (i.e., core vs. MLHO) and the sparklines on the right side present the overall temporal distribution of each feature.

See this image and copyright information in PMC

References

1. Huang L. 1-year outcomes in hospital survivors with COVID-19: a longitudinal cohort study. Lancet. 2021;398:747–758. - PMC - PubMed
1. Estiri H., Strasser Z.H., Brat G.A., et al. Evolving phenotypes of non-hospitalized patients that indicate long COVID. BMC Med. 2021;19(1):249. doi: 10.1186/s12916-021-02115-0. - DOI - PMC - PubMed
1. Al-Aly Z., Xie Y., Bowe B. High-dimensional characterization of post-acute sequelae of COVID-19. Nature. 2021;594(7862):259–264. doi: 10.1038/s41586-021-03553-9. - DOI - PubMed
1. Zhang H. Data-driven identification of post-acute SARS-CoV-2 infection subphenotypes. Nat Med. 2022;29(1):226–235. doi: 10.1038/s41591-022-02116-3. - DOI - PMC - PubMed
1. McGrath L.J. Use of the postacute sequelae of COVID-19 diagnosis code in routine clinical practice in the US. JAMA Netw Open. 2022;5:2235089. - PMC - PubMed

Grants and funding

LinkOut - more resources

Full Text Sources
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Characterization of long COVID temporal sub-phenotypes by distributed representation learning from electronic health record data: a cohort study

Collaborators

Affiliations

Characterization of long COVID temporal sub-phenotypes by distributed representation learning from electronic health record data: a cohort study

Authors

Collaborators

Affiliations

Abstract

Conflict of interest statement

Figures

References

Grants and funding

LinkOut - more resources

Full Text Sources

Miscellaneous