Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2008 Apr 29;9 Suppl 5(Suppl 5):S5.
doi: 10.1186/1471-2105-9-S5-S5.

Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

Affiliations

Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

Irena Spasić et al. BMC Bioinformatics. .

Abstract

Background: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually.

Results: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts.

Conclusions: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The flow of data in a TM approach to CV expansion. The information retrieval module is used to gather a corpus of documents relevant for a given CV from the literature databases. Automatic term recognition is applied against the corpus to extract terms as domain-specific lexical units. Some of the extracted terms not directly related to the CV are filtered out by using the knowledge about typically co-occurring types of terms.
Figure 2
Figure 2
A sub-tree of the MeSH hierarchy. We show part of the MeSH hierarchy relevant for the two CVs (i.e. NMR and GC) considered.
Figure 3
Figure 3
An HTML report summarising CV expansion results
Figure 4
Figure 4
Citation details of the retrieved documents
Figure 5
Figure 5
A full-text document retrieved from PMC
Figure 6
Figure 6
A corpus of “Materials and Methods” sections
Figure 7
Figure 7
A list of automatically extracted terms with links to their concordances
Figure 8
Figure 8
Distribution of evaluation scores for NMR
Figure 9
Figure 9
Distribution of evaluation scores for GC

References

    1. Field D, Sansone S-A. A special issue on data standards. OMICS. 2006;10:84–93.
    1. Quackenbush J. Data standards for ‘omic’ science. Nature Biotechnology. 2004;22:613–614. - PubMed
    1. Shulaev V. Metabolomics technology and bioinformatics. Briefings in Bioinformatics. 2006;7:128–139. - PubMed
    1. Cimino JJ, Zhu X. The practical impact of ontologies on biomedical informatics. Methods of information in medicine. 2006;45:124–135. - PubMed
    1. Schulze-Kremer S. Ontologies for molecular biology and bioinformatics. In Silico Biol. 2002;2:179–193. - PubMed

Publication types

LinkOut - more resources