tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine
- PMID: 28968638
- PMCID: PMC5860583
- DOI: 10.1093/bioinformatics/btx541
tmVar 2.0: integrating genomic variant information from literature with dbSNP and ClinVar for precision medicine
Abstract
Motivation: Despite significant efforts in expert curation, clinical relevance about most of the 154 million dbSNP reference variants (RS) remains unknown. However, a wealth of knowledge about the variant biological function/disease impact is buried in unstructured literature data. Previous studies have attempted to harvest and unlock such information with text-mining techniques but are of limited use because their mutation extraction results are not standardized or integrated with curated data.
Results: We propose an automatic method to extract and normalize variant mentions to unique identifiers (dbSNP RSIDs). Our method, in benchmarking results, demonstrates a high F-measure of ∼90% and compared favorably to the state of the art. Next, we applied our approach to the entire PubMed and validated the results by verifying that each extracted variant-gene pair matched the dbSNP annotation based on mapped genomic position, and by analyzing variants curated in ClinVar. We then determined which text-mined variants and genes constituted novel discoveries. Our analysis reveals 41 889 RS numbers (associated with 9151 genes) not found in ClinVar. Moreover, we obtained a rich set worth further review: 12 462 rare variants (MAF ≤ 0.01) in 3849 genes which are presumed to be deleterious and not frequently found in the general population. To our knowledge, this is the first large-scale study to analyze and integrate text-mined variant data with curated knowledge in existing databases. Our results suggest that databases can be significantly enriched by text mining and that the combined information can greatly assist human efforts in evaluating/prioritizing variants in genomic research.
Availability and implementation: The tmVar 2.0 source code and corpus are freely available at https://www.ncbi.nlm.nih.gov/research/bionlp/Tools/tmvar/.
Contact: zhiyong.lu@nih.gov.
Published by Oxford University Press 2017. This work is written by US Government employees and are in the public domain in the US.
Figures








Similar articles
-
tmVar 3.0: an improved variant concept recognition and normalization tool.Bioinformatics. 2022 Sep 15;38(18):4449-4451. doi: 10.1093/bioinformatics/btac537. Bioinformatics. 2022. PMID: 35904569 Free PMC article.
-
tmVar: a text mining approach for extracting sequence variants in biomedical literature.Bioinformatics. 2013 Jun 1;29(11):1433-9. doi: 10.1093/bioinformatics/btt156. Epub 2013 Apr 5. Bioinformatics. 2013. PMID: 23564842 Free PMC article.
-
LitVar: a semantic search engine for linking genomic variant data in PubMed and PMC.Nucleic Acids Res. 2018 Jul 2;46(W1):W530-W536. doi: 10.1093/nar/gky355. Nucleic Acids Res. 2018. PMID: 29762787 Free PMC article.
-
The evolution of dbSNP: 25 years of impact in genomic research.Nucleic Acids Res. 2025 Jan 6;53(D1):D925-D931. doi: 10.1093/nar/gkae977. Nucleic Acids Res. 2025. PMID: 39530225 Free PMC article. Review.
-
Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature.Brief Bioinform. 2021 May 20;22(3):bbaa142. doi: 10.1093/bib/bbaa142. Brief Bioinform. 2021. PMID: 32770181 Free PMC article. Review.
Cited by
-
Cutting-Edge AI Technologies Meet Precision Medicine to Improve Cancer Care.Biomolecules. 2022 Aug 17;12(8):1133. doi: 10.3390/biom12081133. Biomolecules. 2022. PMID: 36009026 Free PMC article. Review.
-
ResidueFinder: extracting individual residue mentions from protein literature.J Biomed Semantics. 2021 Jul 21;12(1):14. doi: 10.1186/s13326-021-00243-3. J Biomed Semantics. 2021. PMID: 34289903 Free PMC article.
-
PGxMine: Text mining for curation of PharmGKB.Pac Symp Biocomput. 2020;25:611-622. Pac Symp Biocomput. 2020. PMID: 31797632 Free PMC article.
-
GPDminer: a tool for extracting named entities and analyzing relations in biological literature.BMC Bioinformatics. 2024 Mar 6;25(1):101. doi: 10.1186/s12859-024-05710-z. BMC Bioinformatics. 2024. PMID: 38448845 Free PMC article.
-
PubMed and beyond: biomedical literature search in the age of artificial intelligence.EBioMedicine. 2024 Feb;100:104988. doi: 10.1016/j.ebiom.2024.104988. Epub 2024 Feb 1. EBioMedicine. 2024. PMID: 38306900 Free PMC article. Review.
References
-
- Bonis J. et al. (2006) OSIRIS: a tool for retrieving literature about sequence variants. Bioinformatics, 22, 2567–2569. - PubMed
Publication types
MeSH terms
LinkOut - more resources
Full Text Sources
Other Literature Sources
Molecular Biology Databases
Miscellaneous