. 2023 Aug 16;24(1):460.

doi: 10.1186/s12864-023-09561-5.

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Affiliations

¹ Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research, Frederick, MD, 21702, USA.
² Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD, 20892, USA.
³ Division of Translational Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, 20892, USA.
⁴ Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD, 20892, USA. elizabeth.ottinger@nih.gov.
⁵ Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research, Frederick, MD, 21702, USA. uma.mudunuri@nih.gov.

PMID: 37587458
PMCID: PMC10433598
DOI: 10.1186/s12864-023-09561-5

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Erica L Lyons et al. BMC Genomics. 2023.

. 2023 Aug 16;24(1):460.

doi: 10.1186/s12864-023-09561-5.

Affiliations

¹ Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research, Frederick, MD, 21702, USA.
² Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD, 20892, USA.
³ Division of Translational Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, 20892, USA.
⁴ Division of Preclinical Innovation, Therapeutic Development Branch, Therapeutics for Rare and Neglected Diseases (TRND) Program, National Center for Advancing Translational Sciences, National Institutes of Health, Bethesda, MD, 20892, USA. elizabeth.ottinger@nih.gov.
⁵ Advanced Biomedical Computational Science, Frederick National Laboratory for Cancer Research, Frederick, MD, 21702, USA. uma.mudunuri@nih.gov.

PMID: 37587458
PMCID: PMC10433598
DOI: 10.1186/s12864-023-09561-5

Abstract

Background: Approximately 4-8% of the world suffers from a rare disease. Rare diseases are often difficult to diagnose, and many do not have approved therapies. Genetic sequencing has the potential to shorten the current diagnostic process, increase mechanistic understanding, and facilitate research on therapeutic approaches but is limited by the difficulty of novel variant pathogenicity interpretation and the communication of known causative variants. It is unknown how many published rare disease variants are currently accessible in the public domain.

Results: This study investigated the translation of knowledge of variants reported in published manuscripts to publicly accessible variant databases. Variants, symptoms, biochemical assay results, and protein function from literature on the SLC6A8 gene associated with X-linked Creatine Transporter Deficiency (CTD) were curated and reported as a highly annotated dataset of variants with clinical context and functional details. Variants were harmonized, their availability in existing variant databases was analyzed and pathogenicity assignments were compared with impact algorithm predictions. 24% of the pathogenic variants found in PubMed articles were not captured in any database used in this analysis while only 65% of the published variants received an accurate pathogenicity prediction from at least one impact prediction algorithm.

Conclusions: Despite being published in the literature, pathogenicity data on patient variants may remain inaccessible for genetic diagnosis, therapeutic target identification, mechanistic understanding, or hypothesis generation. Clinical and functional details presented in the literature are important to make pathogenicity assessments. Impact predictions remain imperfect but are improving, especially for single nucleotide exonic variants, however such predictions are less accurate or unavailable for intronic and multi-nucleotide variants. Developing text mining workflows that use natural language processing for identifying diseases, genes and variants, along with impact prediction algorithms and integrating with details on clinical phenotypes and functional assessments might be a promising approach to scale literature mining of variants and assigning correct pathogenicity. The curated variants list created by this effort includes context details to improve any such efforts on variant curation for rare diseases.

Keywords: CTD; Gene variant; Literature curation; Rare disease; SLC6A8; Text mining; Variant database.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1**
Creatine Synthesis. Creatine can be synthesized in cells or transported via the creatine transporter *SLC6A8*. Human metabolic synthesis of creatine from arginine and glycine via *AGAT* and *GAMT* is shown. Loss of function in *SLC6A8* causes X-linked Creatine Transport Deficiency (CTD).

**Fig. 2**
Curated *SLC6A8* Variants. Variants curated from PubMed publications are displayed by subtype. Percent curated variants that were (A) exonic, intronic, large deletions, or in the 5’ or 3’ untranslated regions (UTR) for total, pathogenic, and benign variants. The percent is shown inside the bar. (B) single nucleotide (SNV) or multi nucleotide (MNV) variants are shown. (C) The number of variants per subtype is shown. The majority of variants curated from PubMed were exonic SNVs. Pathogenic*: Pathogenic variants except large deletions

**Fig. 3**
Variants in *SLC6A8*. The curated published single nucleotide exonic variant positions are shown on the 2-D (A) and 3-D (B) models of the structure of *SLC6A8*. The source for the 3-D model was AlphaFold, which was developed by DeepMind and EMBL-EBI. Orange: Likely Pathogenic and Pathogenic. Gray: Uncertain significance. Blue: Likely Benign and Benign. Full variant details can be found in the Supplemental Information. (C) Pathogenic variants are displayed as a lolliplot plot along the protein sequence; variants with available impaired creatine uptake rates are plotted above the line and variants with unmeasured creatine uptake are displayed below that line

**Fig. 4**
Curated Variant List Overlap with LOVD and ClinVar databases. *SLC6A8* variants curated from PubMed publications were compared with those recorded in LOVD (A) and ClinVar (B). Percent present (above bars) and total number (below x axis) of our curated list of 185 published variants are shown for total and by subtype. The percent within the bar shows the percent that matched the pathogenicity classification. (C) Overlap of total and pathogenic* variants in the curated list and ClinVar database. Only 53 of the 185 published variants were present in ClinVar. Pathogenic*: Pathogenic variants except large deletions

**Fig. 5**
Curated *SLC6A8* Variants in Population Databases dbSNP, gnomAD, and 1,000 Genomes. Both percent (above bar) and number of variants (below x axis) are shown. Pathogenic*: Pathogenic variants except large deletions

**Fig. 6**
Curated *SLC6A8* Variants Impact Prediction. (A) The number of predictions made from the total number of curated variants and (B) the accuracy of the prediction for *in silico* pathogenicity predictors are shown. The percentage is above the bar and the number is below the axis. Pathogenic*: Pathogenic variants except large deletions

**Fig. 7**
Methodology Workflow. (A) The curator discovered published rare disease genetic variants by searching PubMed for the gene name, disease synonyms, and biological pathways. Variants were compiled into a spreadsheet that included symptoms and phenotype associated with each variant and details that informed a pathogenicity assignment. The cDNA or protein variant names were harmonized to standard genetic notation. Impact analysis comparing pathogenicity predictions from algorithms to the ones reported in literature and comparisons with public data sources for finding the overlap and gaps of published variants were performed. (B) The categories of information collected and impact analysis algorithms and databases consulted. The full information is available as supplemental data

See this image and copyright information in PMC

References

1. Julkowska D, Austin CP, Cutillo CM, Gancberg D, Hager C, Halftermeyer J, et al. The importance of international collaboration for rare diseases research: a european perspective. Gene Ther. 2017;24(9):562–71. doi: 10.1038/gt.2017.29. - DOI - PMC - PubMed
1. Group L. The National Economic Burden of Rare Disease Study. EveryLife Foundation for Rare Diseases; 2021.
1. Dawkins HJS, Draghia-Akli R, Lasko P, Lau LPL, Jonker AH, Cutillo CM, et al. Progress in Rare Diseases Research 2010–2016: an IRDiRC perspective. Clin Transl Sci. 2018;11(1):11–20. doi: 10.1111/cts.12501. - DOI - PMC - PubMed
1. Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28(2):165–73. doi: 10.1038/s41431-019-0508-0. - DOI - PMC - PubMed
1. Ferreira CR, Med Genet A. 2019;179(6):885–92. Epub 2019/03/19. 10.1002/ajmg.a.61124. PubMed PMID: 30883013. - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Affiliations

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Medical