. 2018 Jan 1;34(1):80-87.

doi: 10.1093/bioinformatics/btx541.

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Chih-Hsuan Wei¹, Lon Phan¹, Juliana Feltz¹, Rama Maiti¹, Tim Hefferon¹, Zhiyong Lu¹

Affiliations

PMID: 28968638
PMCID: PMC5860583
DOI: 10.1093/bioinformatics/btx541

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Chih-Hsuan Wei et al. Bioinformatics. 2018.

. 2018 Jan 1;34(1):80-87.

doi: 10.1093/bioinformatics/btx541.

Authors

Chih-Hsuan Wei¹, Lon Phan¹, Juliana Feltz¹, Rama Maiti¹, Tim Hefferon¹, Zhiyong Lu¹

Affiliation

¹ National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), Bethesda, MD 20894, USA.

PMID: 28968638
PMCID: PMC5860583
DOI: 10.1093/bioinformatics/btx541

Abstract

Motivation: Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.

Results: We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.

Availability and implementation: The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.

Contact: zhiyong.lu@nih.gov.

Published by Oxford University Press 2017. This work is written by US Government employees and are in the public domain in the US.

PubMed Disclaimer

Figures

**Fig. 1.**
An example of detecting variation mentions, rewriting in HGVS forms, and normalizing to a unique RS number

**Fig. 2.**
The overall workflow of our mutation normalization process

**Fig. 3.**
An example of the dictionary-lookup step

**Fig. 4.**
The distribution of PMID and RSID links

**Fig. 5.**
The overlap between tmVar, dbSNP and ClinVar results. The numbers refer to the pairs of PMID-RSID

**Fig. 6.**
The total counts of pathogenic and benign of ClinVar in different (a) functional consequence and (b) population frequency (MAF)

**Fig. 7.**
Enrichment of RS attributes in tmVar versus dbSNP RS set. The percentages, % tmVar or % dbSNP, of total RS having the above attributes (i.e. CDS-missense, Pathogenic, or Rare MAF) were first calculated for each tmVar and dbSNP RS dataset separately. The ratio of enrichment for each attribute was then computed by dividing the tmVar percentage over the corresponding dbSNP percentage and plotted along the Y-axis

**Fig. 8.**
The total counts of tmVar RS that already existed in ClinVar (Yellow) and additional novel variants (Blue) for each ACMG Gene. All genes, except two MYL3 and TMEM43, shown contain from 1 to 51 novel variants (Color version of this figure is available at *Bioinformatics* online.)

See this image and copyright information in PMC

Cited by

Cutting-Edge AI Technologies Meet Precision Medicine to Improve Cancer Care.
Lin PC, Tsai YS, Yeh YM, Shen MR. Lin PC, et al. Biomolecules. 2022 Aug 17;12(8):1133. doi: 10.3390/biom12081133. Biomolecules. 2022. PMID: 36009026 Free PMC article. Review.
ResidueFinder: extracting individual residue mentions from protein literature.
Becker TE, Jakobsson E. Becker TE, et al. J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3. J Biomed Semantics. 2021. PMID: 34289903 Free PMC article.
PGxMine: Text mining for curation of PharmGKB.
Lever J, Barbarino JM, Gong L, Huddart R, Sangkuhl K, Whaley R, Whirl-Carrillo M, Woon M, Klein TE, Altman RB. Lever J, et al. Pac Symp Biocomput. 2020;25:611-622. Pac Symp Biocomput. 2020. PMID: 31797632 Free PMC article.
GPDminer: a tool for extracting named entities and analyzing relations in biological literature.
Park YJ, Yang GJ, Sohn CB, Park SJ. Park YJ, et al. BMC Bioinformatics. 2024 Mar 6;25(1):101. doi: 10.1186/s12859-024-05710-z. BMC Bioinformatics. 2024. PMID: 38448845 Free PMC article.
PubMed and beyond: biomedical literature search in the age of artificial intelligence.
Jin Q, Leaman R, Lu Z. Jin Q, et al. EBioMedicine. 2024 Feb;100:104988. doi: 10.1016/j.ebiom.2024.104988. Epub 2024 Feb 1. EBioMedicine. 2024. PMID: 38306900 Free PMC article. Review.

See all "Cited by" articles

References

1. Amberger J.S. et al. (2015) OMIM.org: Online Mendelian Inheritance in Man (OMIM^®), an online catalog of human genes and genetic disorders, Nuclear. Acids Res., 43, D789–D798. - PMC - PubMed
1. Bonis J. et al. (2006) OSIRIS: a tool for retrieving literature about sequence variants. Bioinformatics, 22, 2567–2569. - PubMed
1. Burger J.D. et al. (2014) Hybrid curation of gene-mutation relations combining automated extraction and crowdsourcing. Database J. Biol. Datab. Cur., 2014, bau094. - PMC - PubMed
1. Caporaso J.G. et al. (2007) MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics, 23, 1862–1865. - PMC - PubMed
1. Coordinators N.R. (2016) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res., 44, D7–D19. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
Miscellaneous
- NCI CPTAC Assay Portal

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Affiliation

tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous