Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jul 30:11:1393123.
doi: 10.3389/fmed.2024.1393123. eCollection 2024.

A scalable and transparent data pipeline for AI-enabled health data ecosystems

Affiliations

A scalable and transparent data pipeline for AI-enabled health data ecosystems

Tuncay Namli et al. Front Med (Lausanne). .

Abstract

Introduction: Transparency and traceability are essential for establishing trustworthy artificial intelligence (AI). The lack of transparency in the data preparation process is a significant obstacle in developing reliable AI systems which can lead to issues related to reproducibility, debugging AI models, bias and fairness, and compliance and regulation. We introduce a formal data preparation pipeline specification to improve upon the manual and error-prone data extraction processes used in AI and data analytics applications, with a focus on traceability.

Methods: We propose a declarative language to define the extraction of AI-ready datasets from health data adhering to a common data model, particularly those conforming to HL7 Fast Healthcare Interoperability Resources (FHIR). We utilize the FHIR profiling to develop a common data model tailored to an AI use case to enable the explicit declaration of the needed information such as phenotype and AI feature definitions. In our pipeline model, we convert complex, high-dimensional electronic health records data represented with irregular time series sampling to a flat structure by defining a target population, feature groups and final datasets. Our design considers the requirements of various AI use cases from different projects which lead to implementation of many feature types exhibiting intricate temporal relations.

Results: We implement a scalable and high-performant feature repository to execute the data preparation pipeline definitions. This software not only ensures reliable, fault-tolerant distributed processing to produce AI-ready datasets and their metadata including many statistics alongside, but also serve as a pluggable component of a decision support application based on a trained AI model during online prediction to automatically prepare feature values of individual entities. We deployed and tested the proposed methodology and the implementation in three different research projects. We present the developed FHIR profiles as a common data model, feature group definitions and feature definitions within a data preparation pipeline while training an AI model for "predicting complications after cardiac surgeries".

Discussion: Through the implementation across various pilot use cases, it has been demonstrated that our framework possesses the necessary breadth and flexibility to define a diverse array of features, each tailored to specific temporal and contextual criteria.

Keywords: FHIR; artificial intelligence; dataset; harmonization; health data spaces; interoperability; transparency.

PubMed Disclaimer

Conflict of interest statement

TN, AS, SG, and GE were employed by the company Software Research and Development Consulting. The study presented in this manuscript is conducted in the scope of a research study. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

FIGURE 1
FIGURE 1
Population definition schema. “*” gives the cardinality of corresponding element and means it is an array and 0 or more cardinality.
FIGURE 2
FIGURE 2
FeatureGroup definition schema. “*” gives the cardinality of corresponding element and means it is an array and 0 or more cardinality.
FIGURE 3
FIGURE 3
FeatureSet definition schema. “*” gives the cardinality of corresponding element and means it is an array and 0 or more cardinality.
FIGURE 4
FIGURE 4
Remaining of FeatureSet schema. “*” gives the cardinality of corresponding element and means it is an array and 0 or more cardinality.
FIGURE 5
FIGURE 5
Population definition schema.

References

    1. European Commission. Proposal for a regulation of the European Parliament and of the council laying down harmonised rules on artificial intelligence (Artificial intelligence act) SND smending certain union legislative acts COM/2021/206 final. Brussels: European Commission; (2024).
    1. Mora-Cantallops M, Sánchez-Alonso S, García-Barriocanal E, Sicilia M. Traceability for trustworthy AI: A review of models and tools. Big Data Cogn Comput. (2021) 5:20. 10.3390/bdcc5020020 - DOI
    1. Health Level 7 [HL7]. Fat healthcare interoperability resources (FHIR). (2024). Available online at: https://www.hl7.org/fhir/ (accessed February 21, 2024).
    1. Directorate-General for Health and Food Safety. Proposal for a regulation - The European health data space COM(2022) 197/2. Brussels: Directorate-General for Health and Food Safety; (2022).
    1. Williams E, Kienast M, Medawar E, Reinelt J, Merola A, Klopfenstein S, et al. A standardized clinical data harmonization pipeline for scalable AI application deployment (FHIR-DHP): Validation and usability study. JMIR Med Inform. (2023) 11:847. 10.2196/43847 - DOI - PMC - PubMed

LinkOut - more resources