Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2013 Dec 25:7:38.
doi: 10.3389/fninf.2013.00038. eCollection 2013.

Virk: an active learning-based system for bootstrapping knowledge base development in the neurosciences

Affiliations

Virk: an active learning-based system for bootstrapping knowledge base development in the neurosciences

Kyle H Ambert et al. Front Neuroinform. .

Abstract

The frequency and volume of newly-published scientific literature is quickly making manual maintenance of publicly-available databases of primary data unrealistic and costly. Although machine learning (ML) can be useful for developing automated approaches to identifying scientific publications containing relevant information for a database, developing such tools necessitates manually annotating an unrealistic number of documents. One approach to this problem, active learning (AL), builds classification models by iteratively identifying documents that provide the most information to a classifier. Although this approach has been shown to be effective for related problems, in the context of scientific databases curation, it falls short. We present Virk, an AL system that, while being trained, simultaneously learns a classification model and identifies documents having information of interest for a knowledge base. Our approach uses a support vector machine (SVM) classifier with input features derived from neuroscience-related publications from the primary literature. Using our approach, we were able to increase the size of the Neuron Registry, a knowledge base of neuron-related information, by a factor of 90%, a knowledge base of neuron-related information, in 3 months. Using standard biocuration methods, it would have taken between 1 and 2 years to make the same number of contributions to the Neuron Registry. Here, we describe the system pipeline in detail, and evaluate its performance against other approaches to sampling in AL.

Keywords: active Learning; biocuration; community-curated database; machine learning; neuroinformatics; text-mining.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Diagrammatic representation of the distribution of documents in our corpus, and workflow diagram for annotation in the active learning experiment. Documents were randomly allocated to either the initial training, active learning pool, validation, or test collections. A total of 962 documents were annotated during these experiments—670 during the active learning procedure, 92 during the random validation experiments, and an additional 200 for the hold-out test collection, used to evaluate the Virk system against the random system.
Figure 2
Figure 2
Performance statistics for the active learning system over iterations of document curation. The gray bars show the cumulating number of positively-annotated documents, while the black dotted line indicate the total number of documents annotated at a given iteration (increasing by 30 at each iteration after the first). The solid black line intersected with solid red lines indicates the estimated number of randomly-selected documents (±90% CI) that, at any iteration, would need to be annotated in order to obtain the same number of positive documents identified by Virk by that iteration. After three rounds of annotation, the average number of documents that would have to be read for a given number of positives is statistically significantly greater than that needing to be annotated with the Virk system.
Figure 3
Figure 3
Relative performance of a system trained using only the uncertain documents at each iteration of active learning (black), versus all documents annotated up to that point (red), in terms of AUC. At Iteration 1, both systems are trained using the same data (the 100 initially-annotated documents), and thus score the same. After that, the system trained using all the data consistently out-performs the one using only the uncertain data, though both begin to converge to similar values after 20 iterations.
Figure 4
Figure 4
Performance evaluation comparing AUC of the Virk (red line) and Random Validation (black line) systems over iterations of active learning. The Random Validation system AUC was averaged over 10 random samplings, so that standard error could be calculated (bars). After six iteration, the Virk system outperform the 95% confidence interval for the random validation system.
Figure 5
Figure 5
Number of positive-class documents identified over 20 iterations by Virk (red line) and Random Validation (black line, ±95% confidence interval). After 20 iterations, the random validation system identified the number of positives found after only three iterations of the Virk system.
Figure 6
Figure 6
Goodness: work ratio over iterations of active learning. No data exists at the first iteration because no active learning has yet taken place. Between iterations 2 and 7, the goodness:work ratio increases, being approximately level until iteration 15, where it begins to decline.
Figure 7
Figure 7
The top 20 rank-ordered features, in terms of information gain, over iterations of active learning. The color of the text denotes which section of the document's associated MEDLINE record the term came from: either title (black), abstract (red), or MeSH (blue). Certain terms, such as ganglion are found across many iterations, though its position in the rank-ordered list changes, while others, such as “via” were less informative, appearing only in the first iteration.

References

    1. Ambert K., Cohen A. (2009). A system for classifying disease comorbidity status from medical discharge summaries using automated hotspot and negated concept detection. J. Am. Med. Inform. Assoc. 16, 590 10.1197/jamia.M3095 - DOI - PMC - PubMed
    1. Ambert K., Cohen A. (2011). k-information gain scaled nearest neighbors: a novel approach to classifying protein-protein interactions in free-text. IEEE Trans. Comput. Biol. Bioinform. 9, 305–310 10.1109/TCBB.2011.32 - DOI - PubMed
    1. Ambert K., Cohen A. (2012). Text-mining and neuroscience. Int. Rev. Neurobiol. 103, 109–132 10.1016/B978-0-12-388408-4.00006-X - DOI - PubMed
    1. Ananiadou S., Rea B., Okazaki N., Procter R., Thomas J. (2009). Supporting systematic reviews using text mining. Soc. Sci. Comput. Rev. 27, 509–523 10.1177/0894439309332293 - DOI
    1. Arens R. (2010). Learning SVM ranking function from user feedback using document metadata and active learning in the biomedical domain. Preference Learn. 11, 363 10.1007/978-3-642-14125-6_17 - DOI

LinkOut - more resources