. 2019 Apr 9;21(4):e13043.

doi: 10.2196/13043.

Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform

Jacob McPadden^#¹, Thomas Js Durant^#^{2

3}, Dustin R Bunch², Andreas Coppi³, Nathaniel Price⁴, Kris Rodgerson⁴, Charles J Torre Jr⁴, William Byron⁴, Allen L Hsiao¹, Harlan M Krumholz^{3

5

6}, Wade L Schulz^{2

3}

Affiliations

¹ Department of Pediatrics, Yale University School of Medicine, New Haven, CT, United States.
² Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, United States.
³ Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, New Haven, CT, United States.
⁴ Yale New Haven Health Information Technology Services, New Haven, CT, United States.
⁵ Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, United States.
⁶ Department of Health Policy and Management, Yale School of Public Health, New Haven, CT, United States.

^# Contributed equally.

PMID: 30964441
PMCID: PMC6477571
DOI: 10.2196/13043

Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform

Jacob McPadden et al. J Med Internet Res. 2019.

. 2019 Apr 9;21(4):e13043.

doi: 10.2196/13043.

Authors

Affiliations

¹ Department of Pediatrics, Yale University School of Medicine, New Haven, CT, United States.
² Department of Laboratory Medicine, Yale University School of Medicine, New Haven, CT, United States.
³ Center for Outcomes Research and Evaluation, Yale-New Haven Hospital, New Haven, CT, United States.
⁴ Yale New Haven Health Information Technology Services, New Haven, CT, United States.
⁵ Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, United States.
⁶ Department of Health Policy and Management, Yale School of Public Health, New Haven, CT, United States.

^# Contributed equally.

PMID: 30964441
PMCID: PMC6477571
DOI: 10.2196/13043

Abstract

Background: Health care data are increasing in volume and complexity. Storing and analyzing these data to implement precision medicine initiatives and data-driven research has exceeded the capabilities of traditional computer systems. Modern big data platforms must be adapted to the specific demands of health care and designed for scalability and growth.

Objective: The objectives of our study were to (1) demonstrate the implementation of a data science platform built on open source technology within a large, academic health care system and (2) describe 2 computational health care applications built on such a platform.

Methods: We deployed a data science platform based on several open source technologies to support real-time, big data workloads. We developed data-acquisition workflows for Apache Storm and NiFi in Java and Python to capture patient monitoring and laboratory data for downstream analytics.

Results: Emerging data management approaches, along with open source technologies such as Hadoop, can be used to create integrated data lakes to store large, real-time datasets. This infrastructure also provides a robust analytics platform where health care and biomedical research data can be analyzed in near real time for precision medicine and computational health care use cases.

Conclusions: The implementation and use of integrated data science platforms offer organizations the opportunity to combine traditional datasets, including data from the electronic health record, with emerging big data sources, such as continuous patient monitoring and real-time laboratory results. These platforms can enable cost-effective and scalable analytics for the information that will be key to the delivery of precision medicine initiatives. Organizations that can take advantage of the technical advances found in data science platforms will have the opportunity to provide comprehensive access to health care data for computational health care and precision medicine research.

Keywords: big data; computational health care; data science; medical informatics computing; monitoring, physiologic.

©Jacob McPadden, Thomas JS Durant, Dustin R Bunch, Andreas Coppi, Nathaniel Price, Kris Rodgerson, Charles J Torre Jr, William Byron, Allen L Hsiao, Harlan M Krumholz, Wade L Schulz. Originally published in the Journal of Medical Internet Research (http://www.jmir.org), 09.04.2019.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: HMK was a recipient of a research grant, through Yale, from Medtronic and the US Food and Drug Administration to develop methods for postmarket surveillance of medical devices; is a recipient of research agreements with Medtronic and Johnson & Johnson (Janssen), through Yale, to develop methods of clinical trial data sharing; works under contract with the US Centers for Medicare & Medicaid Services to develop and maintain performance measures that are publicly reported; chairs a Cardiac Scientific Advisory Board for UnitedHealth Group Inc; is a participant and participant representative of the IBM Watson Health Life Sciences Board; is a member of the Advisory Board for Element Science, Inc, and the Physician Advisory Board for Aetna Inc; and is the founder of Hugo, a personal health information platform. WLS is a consultant for Hugo, a personal health information platform.

Figures

**Figure 1**
Baikal platform architecture. Cluster services are monitored, deployed, and provisioned by Ambari management console. Workflow management and configuration synchronization are handled by Zookeeper and Oozie. Data storage frameworks include Hadoop Distributed File System (HDFS) and a nonrelational database: Elasticsearch. Kafka messaging queues are used for incoming data with subsequent ingest and processing handled by Storm, Sqoop, and NiFi. Analytics can be performed by Spark and Hive. Kerberos and Ranger are used to secure cluster applications. Lastly, Docker Swarm is used to deploy custom applications that can be run within the data science platform. YARN: Yet Another Resource Negotiator.

**Figure 2**
System architecture for continuous patient monitoring. Multiple, increasing sources of clinical data (A) acquire and transmit the data to aggregation servers, which then forward Health-Level 7 (HL7) messages to an emissary service (B), where data are normalized and securely forwarded in standardized JSON format to the Baikal system (C) for denormalization, processing, and storage in the Hadoop Distributed File System (HDFS). Traditional historic databases (D) are individually prepared for ingestion in the Baikal system and storage in HDFS. The resulting data lake allows for integrated, distributed analytics by end users.

**Figure 3**
Comparison of storage and read/write efficiency. Avro increases storage space and write time modestly while significantly reducing read time. The addition of Snappy compression increases write time minimally, while significantly decreasing storage space and maintaining minimal read time. The resulting combination optimizes for single archival write with multiple read usage. CSV: comma-separated values. Error bars represent standard error.

**Figure 4**
System architecture for laboratory data monitoring. Health-Level 7 (HL7) observations and results messages generated by laboratory information system and laboratory middleware systems are received by the clinical integration engine Cloverleaf (A). HL7 messages are received and validated by a custom emissary service (B) and mapped to JSON documents, which are submitted to a Kafka message queue for downstream processing (C). Custom Python (version 2.7) scripts are executed in NiFi to denormalize messages and calculate quality improvement metrics. Raw HL7 messages are stored in a Hadoop Distributed File System (HDFS). Processed messages and quality improvement metrics are routed to Elasticsearch (D) for real-time analysis and Kibana (E) for visualization.

See this image and copyright information in PMC

References

1. EMC . The digital universe driving data growth in healthcare. Hopkinton, MA: Dell Inc; 2014. [2018-10-03]. https://www.emc.com/analyst-report/digital-universe-healthcare-vertical-... .
1. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung Byers A. Big data: the next frontier for innovation, competition, and productivity. New York, NY: McKinsey Global Institute; 2011. Jun, [2019-02-04]. https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McKinsey%... .
1. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015 Feb 26;372(9):793–5. doi: 10.1056/NEJMp1500523. - DOI - PMC - PubMed
1. Jameson JL, Longo DL. Precision medicine--personalized, problematic, and promising. N Engl J Med. 2015 Jun 04;372(23):2229–34. doi: 10.1056/NEJMsb1503104. - DOI - PubMed
1. Gligorijević V, Malod-Dognin N, Pržulj N. Integrative methods for analyzing big data in precision medicine. Proteomics. 2016 Mar;16(5):741–58. doi: 10.1002/pmic.201500396. - DOI - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform

Affiliations

Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical