Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 Jan 1:2020:baz152.
doi: 10.1093/database/baz152.

Building a pipeline to solicit expert knowledge from the community to aid gene summary curation

Affiliations

Building a pipeline to solicit expert knowledge from the community to aid gene summary curation

Giulia Antonazzo et al. Database (Oxford). .

Abstract

Brief summaries describing the function of each gene's product(s) are of great value to the research community, especially when interpreting genome-wide studies that reveal changes to hundreds of genes. However, manually writing such summaries, even for a single species, is a daunting task; for example, the Drosophila melanogaster genome contains almost 14 000 protein-coding genes. One solution is to use computational methods to generate summaries, but this often fails to capture the key functions or express them eloquently. Here, we describe how we solicited help from the research community to generate manually written summaries of D. melanogaster gene function. Based on the data within the FlyBase database, we developed a computational pipeline to identify researchers who have worked extensively on each gene. We e-mailed these researchers to ask them to draft a brief summary of the main function(s) of the gene's product, which we edited for consistency to produce a 'gene snapshot'. This approach yielded 1800 gene snapshot submissions within a 3-month period. We discuss the general utility of this strategy for other databases that capture data from the research literature. Database URL: https://flybase.org/.

PubMed Disclaimer

Figures

Figure 1
Figure 1. Author response rates in the different cycles. Pilot cycle: semi-manually selected authors. Second cycle: predicted authors, no gene categorization. Third cycle: predicted authors, with gene categorization.
Figure 2
Figure 2. The relationship between the number of genes for which a given author was asked to provide snapshots and the average fraction of genes for which snapshots were returned (not including authors sent a spreadsheet with many genes). The numbers above the data points indicate the number of authors in each category.
Figure 3
Figure 3. Screenshot of the top of a gene report page showing the Gene Snapshot section.
Figure 4
Figure 4. Overview of pipeline to produce Gene Snapshots.

References

    1. Skrzypek M.S., Nash R.S., Wong E.D. et al. (2018) Saccharomyces genome database informs human biology. Nucleic Acids Res, 46, D736. - PMC - PubMed
    1. UniProt Consortium (2019) UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res, 47, D506. - PMC - PubMed
    1. Lee R.Y.N., Howe K.L., Harris T.W. et al. (2018) WormBase 2017: molting into a new stage. Nucleic Acids Res, 46, D869. - PMC - PubMed
    1. Spärck Jones K. (2007) Automatic summarising: The state of the art. Information Processing & Management, 43, 1449.
    1. Jin F., Huang M., Lu Z. and Zhu X. (2009) Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. Association for Computational Linguistics, Boulder, Colorado, p. 97.

Publication types

LinkOut - more resources