Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2010 May 25:11:278.
doi: 10.1186/1471-2105-11-278.

Semantic annotation of morphological descriptions: an overall strategy

Affiliations
Review

Semantic annotation of morphological descriptions: an overall strategy

Hong Cui. BMC Bioinformatics. .

Abstract

Background: Large volumes of morphological descriptions of whole organisms have been created as print or electronic text in a human-readable format. Converting the descriptions into computer- readable formats gives a new life to the valuable knowledge on biodiversity. Research in this area started 20 years ago, yet not sufficient progress has been made to produce an automated system that requires only minimal human intervention but works on descriptions of various plant and animal groups. This paper attempts to examine the hindering factors by identifying the mismatches between existing research and the characteristics of morphological descriptions.

Results: This paper reviews the techniques that have been used for automated annotation, reports exploratory results on characteristics of morphological descriptions as a genre, and identifies challenges facing automated annotation systems. Based on these criteria, the paper proposes an overall strategy for converting descriptions of various taxon groups with the least human effort.

Conclusions: A combined unsupervised and supervised machine learning strategy is needed to construct domain ontologies and lexicons and to ultimately achieve automated semantic annotation of morphological descriptions. Further, we suggest that each effort in creating a new description or annotating an individual description collection should be shared and contribute to the "biodiversity information commons" for the Semantic Web. This cannot be done without a sound strategy and a close partnership between and among information scientists and biologists.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An annotated morphological description. "<>" enclosed text is a tag. Bold font represents paragraph level annotation, bold and italic clause level annotation, and italic character level annotation. Annotation produced by an annotation system created for FNA by the author.
Figure 2
Figure 2
Two regular expression patterns. The first (Soderland, 1999) is for extracting bedroom number and rent from apartment rental ads. The pattern extracts the digit before "BR" as the number of bedrooms ($1) and the number after a "$" as the rent ($2). The pattern produces the correct result for Input 1 but a wrong result for Input 2, as $600 was the price for one room, not four rooms. The pattern will not match or extract anything from "1 large BR $500" or "1 master BR $500." The second (Tang & Heidorn, 2007) extracts leaf blade dimension by looking for a range between the words "blade" and "base."
Figure 3
Figure 3
The counts of new domain concepts in Part V of TIP using different sized common word filters.
Figure 4
Figure 4
The counts of new domain concepts in FNA using different sized common word filters.
Figure 5
Figure 5
The counts of new domain concepts in FOC using different sized common word filters.
Figure 6
Figure 6
Parsing trees produced by the Stanford Parser for descriptive sentences. The first two trees contrast the incorrect parsing of a descriptive sentence in the deviated grammar to the correct parsing of a similar sentence in standard English grammar. The remaining contrasts the incorrect parsing of 3 typical descriptive clauses in the deviated syntax to the correct parsing when the correct Part of Speech (POS) tags were given to the parser. The nodes closest to the words in the parsing trees are the POS tags.
Figure 7
Figure 7
An overall strategy to automated semantic annotation of morphological descriptions of various taxon groups.

References

    1. Flora of North America Editorial Committee (Eds): Flora of North America. http://www.fna.org/
    1. Tang X, Heidorn PB. Using automatically extracted information in species page retrieval. Proceedings of TDWG. 2007. http://www.tdwg.org/proceedings/article/view/195
    1. Cui H, Macklin J, Yu C. Application of semantic annotation for quality insurance in biosystematics publishing. Proceedings of the Annual Meeting of American Society of Information Science and Technology 2009 (in CD) 2009.
    1. Taylor A. Extracting knowledge from biological descriptions. Proceedings of 2nd International Conference on Building and Sharing Very Large-Scale Knowledge Bases. 1995. pp. 114–119.
    1. Abascal R, Sanchenz J. X-tract: Structure extraction from botanical textual descriptions. Proceeding of the String Processing & Information Retrieval Symposium. 1999. pp. 2–7.

Publication types