Joining Datasets Without Identifiers: Probabilistic Linkage of Virtual Pediatric Systems and PEDSnet
- PMID: 32511201
- DOI: 10.1097/PCC.0000000000002380
Joining Datasets Without Identifiers: Probabilistic Linkage of Virtual Pediatric Systems and PEDSnet
Abstract
Objectives: To 1) probabilistically link two important pediatric data sources, Virtual Pediatric Systems and PEDSnet, 2) evaluate linkage accuracy overall and in patients with severe sepsis or septic shock, and 3) identify variables important to linkage accuracy.
Design: Retrospective linkage of prospectively collected datasets from Virtual Pediatrics Systems, Inc (Los Angeles, CA) and the PEDSnet consortium.
Setting: Single-center academic PICU.
Patients: All PICU encounters between January 1, 2012, and December 31, 2017, that were deterministically matched between the two datasets.
Interventions: None.
Measurements and main results: We abstracted records from Virtual Pediatric Systems and PEDSnet corresponding to PICU encounters and probabilistically linked using 44 features shared by the two datasets. We generated a gold standard deterministic linkage using protected health information elements, which were then removed from datasets. We then calculated candidate pair log-likelihood ratios for all pairs of subjects and selected optimal pairs in a two-stage algorithm. A total of 22,051 gold standard PICU encounter pairs were identified over the study period. The optimal linkage model demonstrated excellent discrimination (area under the receiver operating characteristic curve > 0.99); 19,801 cases (89.9%) were matched with 13 false positives. The addition of two protected health information dates (admission month, birth day-of-year) increased to 20,189 (91.6%) the cases matched, with three false positives. Restricting to patients with Virtual Pediatric Systems diagnosis of severe sepsis or septic shock (n = 1,340 [6.1%]) matched 1,250 cases (93.2%) with zero false positives. Increased number of laboratory values present in the first 12 hours of admission significantly increased log-likelihood ratios, suggesting stronger candidate pair matching.
Conclusions: We demonstrated the use of probabilistic linkage to accurately join two complementary pediatric critical care datasets at a single academic PICU in the absence of protected health information. Combining datasets with curated diagnoses and granular measurements can validate patient acuity metrics and facilitate multicenter machine learning algorithms. We anticipate these methods will generalize to other common PICU diagnoses.
Comment in
-
Fuzzy Matchmaking: How Two Records Became One.Pediatr Crit Care Med. 2020 Sep;21(9):848-849. doi: 10.1097/PCC.0000000000002392. Pediatr Crit Care Med. 2020. PMID: 32890090 No abstract available.
References
-
- Wetzel RC. First get the data, then do the science! Pediatr Crit Care Med 2018; 19:382–383
-
- Bennett TD, Spaeder MC, Matos RI, et al.; Pediatric Acute Lung Injury and Sepsis Investigators (PALISI): Existing data analysis in pediatric critical care research. Front Pediatr 2014; 2:79
-
- Bennett TD, Callahan TJ, Feinstein JA, et al. Data science for child health. J Pediatr 2019; 208:12–22
-
- Wetzel RC. Pediatric intensive care databases for quality improvement. J Pediatr Intensive Care 2016; 5:81–88
-
- Virtual Pediatric Systems, LLC. 2018. Available at: http://www.myvps.org/
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Medical
