Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2023 Aug 16;24(1):460.
doi: 10.1186/s12864-023-09561-5.

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Affiliations

Rare disease variant curation from literature: assessing gaps with creatine transport deficiency in focus

Erica L Lyons et al. BMC Genomics. .

Abstract

Background: Approximately 4-8% of the world suffers from a rare disease. Rare diseases are often difficult to diagnose, and many do not have approved therapies. Genetic sequencing has the potential to shorten the current diagnostic process, increase mechanistic understanding, and facilitate research on therapeutic approaches but is limited by the difficulty of novel variant pathogenicity interpretation and the communication of known causative variants. It is unknown how many published rare disease variants are currently accessible in the public domain.

Results: This study investigated the translation of knowledge of variants reported in published manuscripts to publicly accessible variant databases. Variants, symptoms, biochemical assay results, and protein function from literature on the SLC6A8 gene associated with X-linked Creatine Transporter Deficiency (CTD) were curated and reported as a highly annotated dataset of variants with clinical context and functional details. Variants were harmonized, their availability in existing variant databases was analyzed and pathogenicity assignments were compared with impact algorithm predictions. 24% of the pathogenic variants found in PubMed articles were not captured in any database used in this analysis while only 65% of the published variants received an accurate pathogenicity prediction from at least one impact prediction algorithm.

Conclusions: Despite being published in the literature, pathogenicity data on patient variants may remain inaccessible for genetic diagnosis, therapeutic target identification, mechanistic understanding, or hypothesis generation. Clinical and functional details presented in the literature are important to make pathogenicity assessments. Impact predictions remain imperfect but are improving, especially for single nucleotide exonic variants, however such predictions are less accurate or unavailable for intronic and multi-nucleotide variants. Developing text mining workflows that use natural language processing for identifying diseases, genes and variants, along with impact prediction algorithms and integrating with details on clinical phenotypes and functional assessments might be a promising approach to scale literature mining of variants and assigning correct pathogenicity. The curated variants list created by this effort includes context details to improve any such efforts on variant curation for rare diseases.

Keywords: CTD; Gene variant; Literature curation; Rare disease; SLC6A8; Text mining; Variant database.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Creatine Synthesis. Creatine can be synthesized in cells or transported via the creatine transporter SLC6A8. Human metabolic synthesis of creatine from arginine and glycine via AGAT and GAMT is shown. Loss of function in SLC6A8 causes X-linked Creatine Transport Deficiency (CTD).
Fig. 2
Fig. 2
Curated SLC6A8 Variants. Variants curated from PubMed publications are displayed by subtype. Percent curated variants that were (A) exonic, intronic, large deletions, or in the 5’ or 3’ untranslated regions (UTR) for total, pathogenic, and benign variants. The percent is shown inside the bar. (B) single nucleotide (SNV) or multi nucleotide (MNV) variants are shown. (C) The number of variants per subtype is shown. The majority of variants curated from PubMed were exonic SNVs. Pathogenic*: Pathogenic variants except large deletions
Fig. 3
Fig. 3
Variants in SLC6A8. The curated published single nucleotide exonic variant positions are shown on the 2-D (A) and 3-D (B) models of the structure of SLC6A8. The source for the 3-D model was AlphaFold, which was developed by DeepMind and EMBL-EBI. Orange: Likely Pathogenic and Pathogenic. Gray: Uncertain significance. Blue: Likely Benign and Benign. Full variant details can be found in the Supplemental Information. (C) Pathogenic variants are displayed as a lolliplot plot along the protein sequence; variants with available impaired creatine uptake rates are plotted above the line and variants with unmeasured creatine uptake are displayed below that line
Fig. 4
Fig. 4
Curated Variant List Overlap with LOVD and ClinVar databases. SLC6A8 variants curated from PubMed publications were compared with those recorded in LOVD (A) and ClinVar (B). Percent present (above bars) and total number (below x axis) of our curated list of 185 published variants are shown for total and by subtype. The percent within the bar shows the percent that matched the pathogenicity classification. (C) Overlap of total and pathogenic* variants in the curated list and ClinVar database. Only 53 of the 185 published variants were present in ClinVar. Pathogenic*: Pathogenic variants except large deletions
Fig. 5
Fig. 5
Curated SLC6A8 Variants in Population Databases dbSNP, gnomAD, and 1,000 Genomes. Both percent (above bar) and number of variants (below x axis) are shown. Pathogenic*: Pathogenic variants except large deletions
Fig. 6
Fig. 6
Curated SLC6A8 Variants Impact Prediction. (A) The number of predictions made from the total number of curated variants and (B) the accuracy of the prediction for in silico pathogenicity predictors are shown. The percentage is above the bar and the number is below the axis. Pathogenic*: Pathogenic variants except large deletions
Fig. 7
Fig. 7
Methodology Workflow. (A) The curator discovered published rare disease genetic variants by searching PubMed for the gene name, disease synonyms, and biological pathways. Variants were compiled into a spreadsheet that included symptoms and phenotype associated with each variant and details that informed a pathogenicity assignment. The cDNA or protein variant names were harmonized to standard genetic notation. Impact analysis comparing pathogenicity predictions from algorithms to the ones reported in literature and comparisons with public data sources for finding the overlap and gaps of published variants were performed. (B) The categories of information collected and impact analysis algorithms and databases consulted. The full information is available as supplemental data

References

    1. Julkowska D, Austin CP, Cutillo CM, Gancberg D, Hager C, Halftermeyer J, et al. The importance of international collaboration for rare diseases research: a european perspective. Gene Ther. 2017;24(9):562–71. doi: 10.1038/gt.2017.29. - DOI - PMC - PubMed
    1. Group L. The National Economic Burden of Rare Disease Study. EveryLife Foundation for Rare Diseases; 2021.
    1. Dawkins HJS, Draghia-Akli R, Lasko P, Lau LPL, Jonker AH, Cutillo CM, et al. Progress in Rare Diseases Research 2010–2016: an IRDiRC perspective. Clin Transl Sci. 2018;11(1):11–20. doi: 10.1111/cts.12501. - DOI - PMC - PubMed
    1. Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28(2):165–73. doi: 10.1038/s41431-019-0508-0. - DOI - PMC - PubMed
    1. Ferreira CR, Med Genet A. 2019;179(6):885–92. Epub 2019/03/19. 10.1002/ajmg.a.61124. PubMed PMID: 30883013. - PubMed