Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 1:2022:10.17912/micropub.biology.000578.
doi: 10.17912/micropub.biology.000578. eCollection 2022.

Accelerated variant curation from scientific literature using biomedical text mining

Affiliations

Accelerated variant curation from scientific literature using biomedical text mining

Rishab Mallick et al. MicroPubl Biol. .

Abstract

Biological databases collect and standardize data through biocuration. Even though major model organism databases have adopted some automation of curation methods, a large portion of biocuration is still performed manually. To speed up the extraction of the genomic positions of variants, we have developed a hybrid approach that combines regular expressions, Named Entity Recognition based on BERT (Bidirectional Encoder Representations from Transformers) and bag-of-words to extract variant genomic locations from C. elegans papers for WormBase. Our model has a precision of 82.59% for the gene-mutation matches tested on extracted text from 100 papers, and even recovers some data not discovered during manual curation. Code at: https://github.com/WormBase/genomic-info-from-papers.

PubMed Disclaimer

Figures

Figure 1.
<b>Overview of the data extraction process and key results</b>
Figure 1. Overview of the data extraction process and key results
(A) Using database publication IDs as input, the sentences from the corresponding papers are extracted using wbtools or TextPresso, and then processed by the hybrid system where all mutation mentions are extracted. Additional information relevant for curation is then extracted using bag-of-words. The gene and mutation matches are verified by validating that the reference base/codon is possible in at least one transcript in the gene. The final matches and additional data (eg strain name) are then presented in a structured output for validation by curators before being used to update the database. (B) Metrics on sentence classification with mutation mention tested on IDP4+ corpus.

References

    1. Caporaso JG, Baumgartner WA Jr, Randolph DA, Cohen KB, Hunter L. 2007. MutationFinder: a high-performance system for extracting point mutation mentions from text. Bioinformatics 23: 1862-5. - PMC - PubMed
    1. Cejuela JM, Bojchevski A, Uhlig C, Bekmukhametov R, Kumar Karn S, Mahmuti S, Baghudana A, Dubey A, Satagopam VP, Rost B. 2017. nala: text mining natural language mutation mentions. Bioinformatics 33: 1852-1858. - PMC - PubMed
    1. Gao S, Kotevska O, Sorokine A, Christian JB. 2021. A pre-training and self-training approach for biomedical named entity recognition. PLoS One 16: e0246310. - PMC - PubMed
    1. Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. 2020. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36: 1234-1240. - PMC - PubMed
    1. Müller HM, Van Auken KM, Li Y, Sternberg PW. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics. 2018 Mar 9;19(1):94–94. doi: 10.1186/s12859-018-2103-8. - DOI - PMC - PubMed

LinkOut - more resources