Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005;6 Suppl 1(Suppl 1):S12.
doi: 10.1186/1471-2105-6-S1-S12. Epub 2005 May 24.

Data preparation and interannotator agreement: BioCreAtIvE task 1B

Affiliations

Data preparation and interannotator agreement: BioCreAtIvE task 1B

Marc E Colosimo et al. BMC Bioinformatics. 2005.

Abstract

Background: We prepared and evaluated training and test materials for an assessment of text mining methods in molecular biology. The goal of the assessment was to evaluate the ability of automated systems to generate a list of unique gene identifiers from PubMed abstracts for the three model organisms Fly, Mouse, and Yeast. This paper describes the preparation and evaluation of answer keys for training and testing. These consisted of lists of normalized gene names found in the abstracts, generated by adapting the gene list for the full journal articles found in the model organism databases. For the training dataset, the gene list was pruned automatically to remove gene names not found in the abstract; for the testing dataset, it was further refined by manual annotation by annotators provided with guidelines. A critical step in interpreting the results of an assessment is to evaluate the quality of the data preparation. We did this by careful assessment of interannotator agreement and the use of answer pooling of participant results to improve the quality of the final testing dataset.

Results: Interannotator analysis on a small dataset showed that our gene lists for Fly and Yeast were good (87% and 91% three-way agreement) but the Mouse gene list had many conflicts (mostly omissions), which resulted in errors (69% interannotator agreement). By comparing and pooling answers from the participant systems, we were able to add an additional check on the test data; this allowed us to find additional errors, especially in Mouse. This led to 1% change in the Yeast and Fly "gold standard" answer keys, but to an 8% change in the mouse answer key.

Conclusion: We found that clear annotation guidelines are important, along with careful interannotator experiments, to validate the generated gene lists. Also, abstracts alone are a poor resource for identifying genes in paper, containing only a fraction of genes mentioned in the full text (25% for Fly, 36% for Mouse). We found that there are intrinsic differences between the model organism databases related to the number of synonymous terms and also to curation criteria. Finally, we found that answer pooling was much faster and allowed us to identify more conflicting genes than interannotator analysis.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Sample Abstract and "Noisy" Gene List. Underlined and bold words in the PubMed Abstract are the genes were found in the text. The answer key consists of four columns. Column 1 is the file name; column 2 has the model organism unique database identifiers. Column 3 shows whether the gene was found automatically in the abstract (Y), not found and pruned (N), or added by hand (X). Column 4 shows the final set of genes in the answer key. This answer key shows that two genes were given by the database curators (FBgn0000592, and FBgn0002722); the first one was found in the abstract, the second one was not. The third gene (FBgn0026412) was found by our annotators based on the guidelines.
Figure 2
Figure 2
Changes in Participant's Mouse F-measures. Graph showing the differences between the participant's original F-measure and their final F-measure.

References

    1. Hirschman L, Colosimo M, Morgan A, Yeh A. Overview of BioCreAtIvE Task 1B: Normalized Gene Lists. BMC Bioinformatics. 2005;6:S11. doi: 10.1186/1471-2105-6-S1-S11. - DOI - PMC - PubMed
    1. Jensen RA. Orthologs and paralogs – we need to get it right. Genome Biol. 2001;2:INTERACTIONS1002. doi: 10.1186/gb-2001-2-8-interactions1002. - DOI - PMC - PubMed
    1. Mewes HW, Albermann K, Bahr M, Frishman D, Gleissner A, Hani J, Heumann K, Kleine K, Maierl A, Oliver SG, et al. Overview of the yeast genome. Nature. 1997;387:7–65. - PubMed
    1. The FlyBase database of the Drosophila genome projects and community literature. Nucleic Acids Res. 2003;31:172–175. doi: 10.1093/nar/gkg094. - DOI - PMC - PubMed
    1. The FlyBase Database http://flybase.org/

Publication types

MeSH terms

LinkOut - more resources