Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2005;6 Suppl 1(Suppl 1):S2.
doi: 10.1186/1471-2105-6-S1-S2. Epub 2005 May 24.

BioCreAtIvE task 1A: gene mention finding evaluation

Affiliations

BioCreAtIvE task 1A: gene mention finding evaluation

Alexander Yeh et al. BMC Bioinformatics. 2005.

Abstract

Background: The biological research literature is a major repository of knowledge. As the amount of literature increases, it will get harder to find the information of interest on a particular topic. There has been an increasing amount of work on text mining this literature, but comparing this work is hard because of a lack of standards for making comparisons. To address this, we worked with colleagues at the Protein Design Group, CNB-CSIC, Madrid to develop BioCreAtIvE (Critical Assessment for Information Extraction in Biology), an open common evaluation of systems on a number of biological text mining tasks. We report here on task 1A, which deals with finding mentions of genes and related entities in text. "Finding mentions" is a basic task, which can be used as a building block for other text mining tasks. The task makes use of data and evaluation software provided by the (US) National Center for Biotechnology Information (NCBI).

Results: 15 teams took part in task 1A. A number of teams achieved scores over 80% F-measure (balanced precision and recall). The teams that tried to use their task 1A systems to help on other BioCreAtIvE tasks reported mixed results.

Conclusion: The 80% plus F-measure results are good, but still somewhat lag the best scores achieved in some other domains such as newswire, due in part to the complexity and length of gene names, compared to person or organization names in newswire.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Balanced F-scores of the 40+4 submissions.
Figure 2
Figure 2
Precision versus recall of the 40+4 submissions.
Figure 3
Figure 3
Sample phrase with problematic tokenization (red vertical bars give tokenization boundaries).
Figure 4
Figure 4
Percent of names of a given length for BioCreAtIvE task 1A gene names and MUC-6 organization names.

References

    1. Hirschman L, Park JC, Tsujii J, Wong L, Wu CH. Accomplishments and challenges in literature data mining for biology. Bioinformatics. 2002;18:1553–1561. doi: 10.1093/bioinformatics/18.12.1553. - DOI - PubMed
    1. Critical Assessment of Techniques for Protein Structure Prediction http://predictioncenter.llnl.gov/
    1. Hirschman L. The evolution of evaluation: lessons from the message understanding conferences. Computer Speech and Language. 1998;12:281–305. doi: 10.1006/csla.1998.0102. - DOI
    1. Text REtrieval Conference http://trec.nist.gov/
    1. Voorhees EM, Buckland LP, Ed J The Eleventh Text Retrieval Conference (TREC 2002): NIST Special Publication 500-XXX, Gaithersburg, Maryland. 2002. http://trec.nist.gov/pubs/trec11/t11_proceedings.html

Publication types