Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Jan 1;34(1):80-87.
doi: 10.1093/bioinformatics/btx541.

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Affiliations

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Chih-Hsuan Wei et al. Bioinformatics. .

Abstract

Motivation: Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.

Results: We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.

Availability and implementation: The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.

Contact: zhiyong.lu@nih.gov.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.
An example of detecting variation mentions, rewriting in HGVS forms, and normalizing to a unique RS number
Fig. 2.
Fig. 2.
The overall workflow of our mutation normalization process
Fig. 3.
Fig. 3.
An example of the dictionary-lookup step
Fig. 4.
Fig. 4.
The distribution of PMID and RSID links
Fig. 5.
Fig. 5.
The overlap between tmVar, dbSNP and ClinVar results. The numbers refer to the pairs of PMID-RSID
Fig. 6.
Fig. 6.
The total counts of pathogenic and benign of ClinVar in different (a) functional consequence and (b) population frequency (MAF)
Fig. 7.
Fig. 7.
Enrichment of RS attributes in tmVar versus dbSNP RS set. The percentages, % tmVar or % dbSNP, of total RS having the above attributes (i.e. CDS-missense, Pathogenic, or Rare MAF) were first calculated for each tmVar and dbSNP RS dataset separately. The ratio of enrichment for each attribute was then computed by dividing the tmVar percentage over the corresponding dbSNP percentage and plotted along the Y-axis
Fig. 8.
Fig. 8.
The total counts of tmVar RS that already existed in ClinVar (Yellow) and additional novel variants (Blue) for each ACMG Gene. All genes, except two MYL3 and TMEM43, shown contain from 1 to 51 novel variants (Color version of this figure is available at Bioinformatics online.)

Similar articles

Cited by

References

    1. Amberger J.S. et al. (2015) OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders, Nuclear. Acids Res., 43, D789–D798. - PMC - PubMed
    1. Bonis J. et al. (2006) OSIRIS: a tool for retrieving literature about sequence variants. Bioinformatics, 22, 2567–2569. - PubMed
    1. Burger J.D. et al. (2014) Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. Database J. Biol. Datab. Cur., 2014, bau094. - PMC - PubMed
    1. Caporaso J.G. et al. (2007) MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics, 23, 1862–1865. - PMC - PubMed
    1. Coordinators N.R. (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 44, D7–D19. - PMC - PubMed

Publication types