Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Oct 12;14(1):6403.
doi: 10.1038/s41467-023-41980-6.

Simulation of undiagnosed patients with novel genetic conditions

Affiliations

Simulation of undiagnosed patients with novel genetic conditions

Emily Alsentzer et al. Nat Commun. .

Abstract

Rare Mendelian disorders pose a major diagnostic challenge and collectively affect 300-400 million patients worldwide. Many automated tools aim to uncover causal genes in patients with suspected genetic disorders, but evaluation of these tools is limited due to the lack of comprehensive benchmark datasets that include previously unpublished conditions. Here, we present a computational pipeline that simulates realistic clinical datasets to address this deficit. Our framework jointly simulates complex phenotypes and challenging candidate genes and produces patients with novel genetic conditions. We demonstrate the similarity of our simulated patients to real patients from the Undiagnosed Diseases Network and evaluate common gene prioritization methods on the simulated cohort. These prioritization methods recover known gene-disease associations but perform poorly on diagnosing patients with novel genetic disorders. Our publicly-available dataset and codebase can be utilized by medical genetics researchers to evaluate, compare, and improve tools that aid in the diagnostic process.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1. Identification and categorization of causal disease genes.
a Genomic variation uncovered in an affected patient through DNA sequencing is investigated using variant-level and gene-level evidence in order to identify the gene variant that is most likely responsible for causing the patient’s symptoms. Here, we depict a subset of relevant information that a care team may use to make this assessment. b The causal gene responsible for a patient’s disorder can be categorized based on the extent of medical knowledge that exists about the gene and its associated disorder. Intuitively, diagnosing patients where less is known about their causal gene and disease (bottom category) is a more challenging task than diagnosing patients where more is known about causal gene and disease (top category). The protein structure pictured is PDB: ID3B. Icons are from Microsoft PowerPoint.
Fig. 2
Fig. 2. Simulation process generates patients with multiple phenotype terms and candidate genes.
a Patients are first assigned a true disease and initialized with a gene known to cause that disease (blue circle) as well as with positive and negative phenotypes associated with that disease (gray diamonds). Phenotype terms are then randomly removed through phenotype dropout, randomly altered to be less specific according to their position in an ontology relating phenotype terms, and augmented with terms randomly selected by prevalence in a medical claims database. Finally, strong distractor candidate genes and relevant additional phenotypes are generated based on six distractor gene modules. b The six distractor gene modules are inspired by genes that are frequently considered in current clinical genomic workflows and are designed to generate highly plausible, yet ultimately non-causal, genes for each patient. Four of the distractor gene modules are defined by the overlap—or lack thereof—between the phenotypes associated with the distractor gene and the phenotypes associated with the patient’s causal gene. The remaining two modules are defined by their similar tissue expression as the true disease gene or solely by their frequent erroneous prioritization in computational pipelines.
Fig. 3
Fig. 3. Simulated patients mimic real-world patients.
Diagnosed, real-world patients from the Undiagnosed Diseases Network (orange) and a disease-matched cohort of simulated patients (teal) have similar numbers of a candidate genes per patient (average μ of 13.13 vs. 13.94) and b positive phenotype terms per patient (average of 24.08 vs. 21.57). c Real patients (orange) and simulated patients (teal) are indistinguishable based on their annotated positive phenotype terms within each Orphanet disease category, as visualized using non-linear dimension reduction via a Uniform Manifold Approximation and Projection (UMAP) plot. The horizontal and vertical axes are uniform across all plots. The number of real patients within each disease category, n, is listed in the corner of each plot; there are 20 simulated patients for each real patient. d For each real-world patient, all simulated patients in the disease-matched cohort are ranked randomly (black) and by the Jaccard similarity of their phenotype terms to the query real-world patient (purple). The Empirical Cumulative Distribution Function (ECDF) plot shows that the basic Jaccard similarity metric is able to retrieve simulated patients with the same disease as the query real patient more accurately than if the simulated patients were retrieved randomly. e The distributions of shortest path distances between all non-causal candidate and true causal genes in a gene–gene interaction network are indistinguishable between real-world and simulated patients. n is the number of patients in each patient category. Source data are provided as a Source Data file.
Fig. 4
Fig. 4. Ability of computational approaches to rank causal genes differs across disease-gene categories.
We group simulated patients and real-world UDN patients into five categories based on their type of causal gene-disease association (patient counts in Table 1). These categories, described in detail in Fig. 1b, are illustrated in the blue header bars above each plot and ordered decreasingly from left to right by the amount of existing knowledge of the association in the underlying knowledge graph. Each panel ae shows performance on patients in a single category. We run nine gene ranking algorithms implemented in six prioritization tools on the phenotype terms and candidate gene list for each simulated and real-world patient within each causal gene-disease category. These algorithms are separated into those that directly consider patient—gene phenotypic similarity (G1: Phrank–Gene, G2: ERIC–Gene), those that compute patient—disease phenotypic similarity (D1: Phrank–Disease, D2: ERIC–Disease, D3: Phenomizer, D4: LIRICAL), and those that consider additional interaction edges, such as gene–gene edges, interactions in other species, or predicted edges (I1: Phenolyzer, I2: HiPhive, I3: ERIC-Predicted). We show here the ability of these methods to correctly rank each patient’s causal gene within the top k ranked genes for varying values of k. For visual clarity, the color and width of each stacked bar section corresponds to causal gene rank grouping. The average rank of the causal gene is italicized above each bar. Dashed lines denote the average percent of patients where the causal gene appeared in the top 10 across ten random rankings of the candidate genes. Boxen plots displaying the distributions of causal gene ranks can be found in Supplementary Fig. 2. Source data are provided as a Source Data file.
Fig. 5
Fig. 5. Pipeline components increase the difficulty of causal gene identification in simulated patients.
We run a gene prioritization algorithm on patients simulated by our pipeline when varying subsets of pipeline components are included. We report the fraction of simulated patients where the causal gene was prioritized within the top k ranked genes for varying k (horizontal axis for all plots) when different components of the simulation pipeline are included (vertical axis for all plots). The average rank of the causal gene is listed in italics at the base of each bar. The color and width of each stacked bar section corresponds to causal gene rank grouping. We show gene prioritization performance on simulated patients produced when the following components are included in the simulation pipeline: a no phenotype- nor gene-based components (i.e., candidate genes sampled randomly and phenotype terms unaltered from initialization), all standalone phenotype-altering components alone, all distractor gene modules alone, or all pipeline components together; b a “gene-only” version of distractor gene modules and each possible combination of subsets of phenotype-altering components; c all three standalone phenotype-altering components and all but one distractor gene module at a time. Note that in b, horizontal purple lines in the vertical axis labels are for visual clarity, whereas in c, horizontal black lines in the vertical axis signify set difference. Source data are provided as a Source Data file.

References

    1. Nguengang Wakap S, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur. J. Hum. Genet. 2020;28:165–173. doi: 10.1038/s41431-019-0508-0. - DOI - PMC - PubMed
    1. Chong JX, et al. The genetic basis of Mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 2015;97:199. doi: 10.1016/j.ajhg.2015.06.009. - DOI - PMC - PubMed
    1. Gahl WA, et al. The national institutes of health undiagnosed diseases program: insights into rare diseases. Genet. Med. 2012;14:51–59. doi: 10.1038/gim.0b013e318232a005. - DOI - PMC - PubMed
    1. Splinter K, et al. Effect of genetic diagnosis on patients with previously undiagnosed disease. N. Engl. J. Med. 2018;379:2131–2139. doi: 10.1056/NEJMoa1714458. - DOI - PMC - PubMed
    1. Posey JE, et al. Insights into genetics, human biology and disease gleaned from family based genomic studies. Genet. Med. 2019;21:798–812. doi: 10.1038/s41436-018-0408-7. - DOI - PMC - PubMed

Publication types