Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Jan 7;8(1):ooae129.
doi: 10.1093/jamiaopen/ooae129. eCollection 2025 Feb.

Assessing Artificial Intelligence (AI) Implementation for Assisting Gene Linking (at the National Library of Medicine)

Affiliations

Assessing Artificial Intelligence (AI) Implementation for Assisting Gene Linking (at the National Library of Medicine)

Rezarta Islamaj et al. JAMIA Open. .

Abstract

Objectives: The National Library of Medicine (NLM) currently indexes close to a million articles each year pertaining to more than 5300 medicine and life sciences journals. Of these, a significant number of articles contain critical information about the structure, genetics, and function of genes and proteins in normal and disease states. These articles are identified by the NLM curators, and a manual link is created between these articles and the corresponding gene records at the NCBI Gene database. Thus, the information is interconnected with all the NLM resources, services which bring considerable value to life sciences. National Library of Medicine aims to provide timely access to all metadata, and this necessitates that the article indexing scales to the volume of the published literature. On the other hand, although automatic information extraction methods have been shown to achieve accurate results in biomedical text mining research, it remains difficult to evaluate them on established pipelines and integrate them within the daily workflows.

Materials and methods: Here, we demonstrate how our machine learning model, GNorm2, which achieved state-of-the art performance on identifying genes and their corresponding species at the same time handling innate textual ambiguities, could be integrated with the established daily workflow at the NLM and evaluated for its performance in this new environment.

Results: We worked with 8 biomedical curator experts and evaluated the integration using these parameters: (1) gene identification accuracy, (2) interannotator agreement with and without GNorm2, (3) GNorm2 potential bias, and (4) indexing consistency and efficiency. We identified key interface changes that significantly helped the curators to maximize the GNorm2 benefit, and further improved the GNorm2 algorithm to cover 135 species of genes including viral and bacterial genes, based on the biocurator expert survey.

Conclusion: GNorm2 is currently in the process of being fully integrated into the regular curator's workflow.

Keywords: AI workflow implementation; AI-assisted curation; article indexing; gene identification; gene name entity recognition; gene name normalization.

PubMed Disclaimer

Conflict of interest statement

None declared.

Figures

Figure 1.
Figure 1.
Articles in PubMed grouped by their publication year that contain genes, and those that are linked to a GENE record. Gene linking is a manual expert curation effort that requires time and expertise.
Figure 2.
Figure 2.
Three stages of introducing an AI model into an existing workflow.
Figure 3.
Figure 3.
AI algorithm regular update process considering user feedback. Step 1 shows the input of new PubMed documents to the GNorm2 algorithm (GNorm2 consists of 4 modules: gene name recognition, species recognition, species assignment, and gene normalization) to produce the annotated set (Step 2). In Step 3, these annotated articles are shown to expert curators who identify any potential issues and corrections (Step 4). The corrected documents become part of the training corpus of GNorm2 (Step 5) and the algorithm is retrained with the new data (Step 6).
Figure 4.
Figure 4.
Evaluation setup for the GNorm2 integration. We created 2 copies of the environment to allow the same set of articles to be doubly curated by 2 curators. We collected data on 1600 PubMed articles, of which 800 were processed with GNorm2 results and 800 were not. All articles were collected in groups of 50, and each set of 50 was annotated by 2 curators. For each set, we reviewed the created gene links, and measured interannotator agreement, as well as the time from the first link to the last, which helped to estimate the time to curate the whole set.
Figure 5.
Figure 5.
GNorm2 performance on the test dataset as we retrain the algorithm 100 additional articles at a time. The Y-axis denotes the gene normalization F-score.
Figure 6.
Figure 6.
Updated curator interface displaying the GNorm2 identified genes. This interface is very similar to its previous version however it includes: (1) On the left panel, the list of suggested MeSH indexing terms, and the GNorm2 identified list of genes mentioned in the article, including their GENE concept ID, and organism name. (2) On the right panel, term highlighting including all the mentions of genes on the article title and abstract of the article. (3) When a curator brings the mouse over a highlighted gene name, they see all the information identified by GNorm2 and (4) on click they can create the gene link.

References

    1. National Library of Medicine. MEDLINE 2022 initiative: transition to automated indexing. NLM Tech Bull. 2021:e5. https://www.nlm.nih.gov/pubs/techbull/nd21/nd21_medline_2022.html
    1. MEDLINE PubMed Production Statistics. 2023. https://www.nlm.nih.gov/bsd/medline_pubmed_production_stats.html
    1. Wu CH, Huang H, Arminski L, et al. The protein information resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002;30:35-37. - PMC - PubMed
    1. Berman HM, Battistuz T, Bhat TN, et al. The protein data bank. Acta Crystallogr D Biol Crystallogr. 2002;58:899-907. - PubMed
    1. Boeckmann B, Bairoch A, Apweiler R, et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 2003;31:365-370. - PMC - PubMed

LinkOut - more resources