Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jul 21;4(1):130.
doi: 10.1038/s44172-025-00450-1.

Scorecard for synthetic medical data evaluation

Affiliations

Scorecard for synthetic medical data evaluation

Ghada Zamzmi et al. Commun Eng. .

Abstract

Although the interest in synthetic medical data (SMD) for developing and testing artificial intelligence (AI) methods is growing, the absence of a comprehensive framework to evaluate the quality and applicability of SMD hinders its wider adoption. Here, we outline an evaluation framework designed to meet the unique requirements of medical applications. We also introduce SMD scorecard, a comprehensive report accompanying artificially generated datasets. This scorecard provides a quantitative assessment of SMD across seven criteria (7 Cs), complemented by a descriptive section that contains all relevant information about the dataset. The SMD scorecard provides a practical framework for evaluating and reporting the quality of synthetic data, which can benefit SMD developers and users.

PubMed Disclaimer

Conflict of interest statement

Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Evaluation challenges in synthetic medical images and textual data.
a A synthetic image produced by a generative AI model (adopted from Ref. with high visual fidelity but containing structural inconsistencies such as broken ligaments, b, c Synthetic chest X-rays ( featuring misplaced medical devices (pacemakers and tubes) located outside of anatomically plausible regions, and (c) digital mammograms that score highly on fidelity and statistical metrics but display clinically implausible artifacts including abnormal breast shape and the erroneous presence of multiple nipple-like structures indicated by the red arrows. (d) shows textual outputs from LLMs in a medical query task where the LLM provides incorrect responses to critical clinical questions which can often be detected through direct comparison with ground truth. However, in the summarization task (e), the LLM generates a symptom (leg swelling) not present in the original EHR note, which is harder to detect using similarity- or overlap-based metrics.
Fig. 2
Fig. 2. Scorecard for evaluating and reporting synthetic medical data.
Congruence measures the alignment between the distributions of synthetic and real data. Coverage highlights the variability in synthetic data, demonstrated using convex hull volume, where synthetic data shows a smaller spread compared to patient data. Constraint evaluates adherence to clinical context (e.g., tumor size and density) by identifying valid versus invalid data points. Completeness assesses the presence of necessary information for the intended use. Compliance ensures adherence to local and global standards such as de-identification, privacy preservation, and file format consistency. Comprehension evaluates the accessibility and clarity of SMD generation processes and documentation. The bottom panel illustrates the consistency across subgroups (e.g., demographic or disease-specific groups). Essential descriptive information about the synthetic data is presented in the middle panel.

References

    1. Noorbakhsh-Sabet, N., Zand, R., Zhang, Y. & Abedi, V. Artificial intelligence transforms the future of health care. Am. J. Med.132, 795–801 (2019). - PMC - PubMed
    1. Sizikova, E. et al. Synthetic data in radiological imaging: Current state and future outlook. BJR∣ Artificial Intelligence ubae007 (2024).
    1. Borji, A. Pros and cons of gan evaluation measures: New developments. Computer Vis. Image Underst.215, 103329 (2022).
    1. Chang, Y. et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol.15, 1–45 (2024).
    1. Dankar, F. K., Ibrahim, M. K. & Ismail, L. A multi-dimensional evaluation of synthetic data generators. IEEE Access10, 11147–11158 (2022).