Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 1:2020:baaa006.
doi: 10.1093/database/baaa006.

Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase

Affiliations

Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase

Valerio Arnaboldi et al. Database (Oxford). .

Abstract

Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Overview of the AFP system.
Figure 2
Figure 2
AFP feedback form for authors - Overview (genes and species) widget. The form is divided into widgets that group the data types into different categories to simplify the work for authors—left sidebar. Between the title and the main author feedback area, there is a component that displays the WormBase person name and ID for the identified corresponding author and allows the users to select a different person using an autocomplete on WormBase database. The selected person ID is stored with the curated data in our database when the submission process is completed. Each widget shows a colored panel at the top with specific instructions on how to curate the included data types. An example of the list of genes extracted can be seen in the feedback area at the center of the page.
Figure 3
Figure 3
Feedback form for authors - completed form. When the author clicks ‘Save and continue’ (or ‘Finish and submit’ on the last widget), a pop-up message notifies that the data have been received by WB and the instruction alert turns green. The form returns a ‘Well done!’ message upon the completion of each section and the data are immediately stored in the WB database. In addition, to track the progress throughout the author curation process, completed sections are marked on the left menu with a special icon. The authors can modify the submitted data any time by returning to the form. The figure also shows an example of automatically classified data types, in this case regulatory interactions.
Figure 4
Figure 4
‘My AFP papers’ page. The authors can retrieve links to the AFP feedback form for their papers through the interface presented in this page.
Figure 5
Figure 5
Curator dashboard. Curators at WB use this page to monitor the AFP system and to compare the data extracted by the pipeline with those submitted by the authors.
Figure 6
Figure 6
Curator dashboard - Statistics page. Curators can collect statistics on the extracted and submitted data through this page.
Figure 7
Figure 7
Curator dashboard - Paper lists. This page of the curator dashboard contains links to the feedback form for authors for all papers processed by the pipeline, divided into different groups depending on their status (i.e. processed but not submitted, full submissions and partial submissions).
Figure 8
Figure 8
Response rate for the old and the new AFP.
Figure 9
Figure 9
Percentage of submissions with at least one gene reported through the old and the new AFP.
Figure 10
Figure 10
Number of genes reported by the authors per submission through the old and the new AFP forms.

References

    1. Karp P.D. (2016) How much does curation cost? Database (Oxford), 2016, baw110. - PMC - PubMed
    1. Vale R.D. (2015) Accelerating scientific publication in biology. Proc. Natl. Acad. Sci. U. S. A., 112, 13439–13446. - PMC - PubMed
    1. Karp P.D. (2016) Crowd-sourcing and author submission as alternatives to professional curation. Database (Oxford), 2016, baw149. - PMC - PubMed
    1. Fang R., Schindelman G., Van Auken K. et al. (2012) Automatic categorization of diverse experimental information in the bioscience literature. BMC Bioinformatics, 13, 16. - PMC - PubMed
    1. Müller H.M., Kenny E.E. and Sternberg P.W. (2004) Textpresso: an ontology-based information retrieval and extraction system for biological literature. PLoS Biol., 2, e309. - PMC - PubMed

Publication types

LinkOut - more resources