Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2019 Apr 9;21(4):e13043.
doi: 10.2196/13043.

Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform

Affiliations

Health Care and Precision Medicine Research: Analysis of a Scalable Data Science Platform

Jacob McPadden et al. J Med Internet Res. .

Abstract

Background: Health care data are increasing in volume and complexity. Storing and analyzing these data to implement precision medicine initiatives and data-driven research has exceeded the capabilities of traditional computer systems. Modern big data platforms must be adapted to the specific demands of health care and designed for scalability and growth.

Objective: The objectives of our study were to (1) demonstrate the implementation of a data science platform built on open source technology within a large, academic health care system and (2) describe 2 computational health care applications built on such a platform.

Methods: We deployed a data science platform based on several open source technologies to support real-time, big data workloads. We developed data-acquisition workflows for Apache Storm and NiFi in Java and Python to capture patient monitoring and laboratory data for downstream analytics.

Results: Emerging data management approaches, along with open source technologies such as Hadoop, can be used to create integrated data lakes to store large, real-time datasets. This infrastructure also provides a robust analytics platform where health care and biomedical research data can be analyzed in near real time for precision medicine and computational health care use cases.

Conclusions: The implementation and use of integrated data science platforms offer organizations the opportunity to combine traditional datasets, including data from the electronic health record, with emerging big data sources, such as continuous patient monitoring and real-time laboratory results. These platforms can enable cost-effective and scalable analytics for the information that will be key to the delivery of precision medicine initiatives. Organizations that can take advantage of the technical advances found in data science platforms will have the opportunity to provide comprehensive access to health care data for computational health care and precision medicine research.

Keywords: big data; computational health care; data science; medical informatics computing; monitoring, physiologic.

PubMed Disclaimer

Conflict of interest statement

Conflicts of Interest: HMK was a recipient of a research grant, through Yale, from Medtronic and the US Food and Drug Administration to develop methods for postmarket surveillance of medical devices; is a recipient of research agreements with Medtronic and Johnson & Johnson (Janssen), through Yale, to develop methods of clinical trial data sharing; works under contract with the US Centers for Medicare & Medicaid Services to develop and maintain performance measures that are publicly reported; chairs a Cardiac Scientific Advisory Board for UnitedHealth Group Inc; is a participant and participant representative of the IBM Watson Health Life Sciences Board; is a member of the Advisory Board for Element Science, Inc, and the Physician Advisory Board for Aetna Inc; and is the founder of Hugo, a personal health information platform. WLS is a consultant for Hugo, a personal health information platform.

Figures

Figure 1
Figure 1
Baikal platform architecture. Cluster services are monitored, deployed, and provisioned by Ambari management console. Workflow management and configuration synchronization are handled by Zookeeper and Oozie. Data storage frameworks include Hadoop Distributed File System (HDFS) and a nonrelational database: Elasticsearch. Kafka messaging queues are used for incoming data with subsequent ingest and processing handled by Storm, Sqoop, and NiFi. Analytics can be performed by Spark and Hive. Kerberos and Ranger are used to secure cluster applications. Lastly, Docker Swarm is used to deploy custom applications that can be run within the data science platform. YARN: Yet Another Resource Negotiator.
Figure 2
Figure 2
System architecture for continuous patient monitoring. Multiple, increasing sources of clinical data (A) acquire and transmit the data to aggregation servers, which then forward Health-Level 7 (HL7) messages to an emissary service (B), where data are normalized and securely forwarded in standardized JSON format to the Baikal system (C) for denormalization, processing, and storage in the Hadoop Distributed File System (HDFS). Traditional historic databases (D) are individually prepared for ingestion in the Baikal system and storage in HDFS. The resulting data lake allows for integrated, distributed analytics by end users.
Figure 3
Figure 3
Comparison of storage and read/write efficiency. Avro increases storage space and write time modestly while significantly reducing read time. The addition of Snappy compression increases write time minimally, while significantly decreasing storage space and maintaining minimal read time. The resulting combination optimizes for single archival write with multiple read usage. CSV: comma-separated values. Error bars represent standard error.
Figure 4
Figure 4
System architecture for laboratory data monitoring. Health-Level 7 (HL7) observations and results messages generated by laboratory information system and laboratory middleware systems are received by the clinical integration engine Cloverleaf (A). HL7 messages are received and validated by a custom emissary service (B) and mapped to JSON documents, which are submitted to a Kafka message queue for downstream processing (C). Custom Python (version 2.7) scripts are executed in NiFi to denormalize messages and calculate quality improvement metrics. Raw HL7 messages are stored in a Hadoop Distributed File System (HDFS). Processed messages and quality improvement metrics are routed to Elasticsearch (D) for real-time analysis and Kibana (E) for visualization.

References

    1. EMC . The digital universe driving data growth in healthcare. Hopkinton, MA: Dell Inc; 2014. [2018-10-03]. https://www.emc.com/analyst-report/digital-universe-healthcare-vertical-... .
    1. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, Hung Byers A. Big data: the next frontier for innovation, competition, and productivity. New York, NY: McKinsey Global Institute; 2011. Jun, [2019-02-04]. https://www.mckinsey.com/~/media/McKinsey/Business%20Functions/McKinsey%... .
    1. Collins FS, Varmus H. A new initiative on precision medicine. N Engl J Med. 2015 Feb 26;372(9):793–5. doi: 10.1056/NEJMp1500523. - DOI - PMC - PubMed
    1. Jameson JL, Longo DL. Precision medicine--personalized, problematic, and promising. N Engl J Med. 2015 Jun 04;372(23):2229–34. doi: 10.1056/NEJMsb1503104. - DOI - PubMed
    1. Gligorijević V, Malod-Dognin N, Pržulj N. Integrative methods for analyzing big data in precision medicine. Proteomics. 2016 Mar;16(5):741–58. doi: 10.1002/pmic.201500396. - DOI - PubMed

MeSH terms