A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework

A E Bandrowski¹, J Cachat, Y Li, H M Müller, P W Sternberg, P Ciccarese, T Clark, L Marenco, R Wang, V Astakhov, J S Grethe, M E Martone

Affiliations

PMID: 22434839
PMCID: PMC3308161
DOI: 10.1093/database/bas005

A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework

A E Bandrowski et al. Database (Oxford). 2012.

. 2012 Mar 20:2012:bas005.

doi: 10.1093/database/bas005. Print 2012.

Authors

A E Bandrowski¹, J Cachat, Y Li, H M Müller, P W Sternberg, P Ciccarese, T Clark, L Marenco, R Wang, V Astakhov, J S Grethe, M E Martone

Affiliation

¹ Center for Research in Biological Systems, University of California San Diego, CA, USA. abandrowski@ucsd.edu

PMID: 22434839
PMCID: PMC3308161
DOI: 10.1093/database/bas005

Abstract

The breadth of information resources available to researchers on the Internet continues to expand, particularly in light of recently implemented data-sharing policies required by funding agencies. However, the nature of dense, multifaceted neuroscience data and the design of contemporary search engine systems makes efficient, reliable and relevant discovery of such information a significant challenge. This challenge is specifically pertinent for online databases, whose dynamic content is 'hidden' from search engines. The Neuroscience Information Framework (NIF; http://www.neuinfo.org) was funded by the NIH Blueprint for Neuroscience Research to address the problem of finding and utilizing neuroscience-relevant resources such as software tools, data sets, experimental animals and antibodies across the Internet. From the outset, NIF sought to provide an accounting of available resources, whereas developing technical solutions to finding, accessing and utilizing them. The curators therefore, are tasked with identifying and registering resources, examining data, writing configuration files to index and display data and keeping the contents current. In the initial phases of the project, all aspects of the registration and curation processes were manual. However, as the number of resources grew, manual curation became impractical. This report describes our experiences and successes with developing automated resource discovery and semiautomated type characterization with text-mining scripts that facilitate curation team efforts to discover, integrate and display new content. We also describe the DISCO framework, a suite of automated web services that significantly reduce manual curation efforts to periodically check for resource updates. Lastly, we discuss DOMEO, a semi-automated annotation tool that improves the discovery and curation of resources that are not necessarily website-based (i.e. reagents, software tools). Although the ultimate goal of automation was to reduce the workload of the curators, it has resulted in valuable analytic by-products that address accessibility, use and citation of resources that can now be shared with resource owners and the larger scientific community. DATABASE URL: http://neuinfo.org.

PubMed Disclaimer

Figures

**Figure 1.**
NIF Resource Landscape. Background: each point on the map represents a global location that houses one or more resources registered with the NIF (via NeuroLex). Red points represent NIF registry entries and blue points represent databases and data sets incorporated into the data federation. Foreground: the blue line represents a plot of the number of federated data sources over time, and the green line represents the number of records in the NIF Data Federation over the same time (note these records come from only the blue dots and that the scale is logarithmic). The DISCO protocols and automated resource crawling were integrated into NIF system function in November 2009 and led to a growth of NIF holdings and in mid 2011, significant enhancements of the DISCO protocols allowing for enhanced automation of data ingestion, as well as the current resource discovery pipeline (RDP) were implemented.

**Figure 2.**
A high level overview of the NIF system. This figure emphasizes where inputs and outputs of the NIF lie as a function of some of NIF's tools. Red arrows represent human steps, blue arrows represent automated steps and green boxes represent places in the system where community interactions are likely. The input of data is done using a suite of tools including NeuroLex (the first step for all data ingestion), DISCO (for deep data registration), LinkOut (linking data to PubMed, PMC PubMed Central literature), DOMEO (for literature annotation) and the RDP automated text-mining resource discovery pipeline that recognizes resources and recommends them to curators for possible inclusion in the NIF Registry. The creation of indices is informed by the ontology, as are the search tools and public web services. Note, all data moves through a process where it is recommended, registered to the NeuroLex, then included in the NIF Registry index and becomes available to DISCO tools for deeper content integration.

**Figure 3.**
The NIF Registration Pipeline. The NIF registration pipeline starts at a wiki page for each resource (i). This step shows an example public wiki page for the ModelDB resource. Anyone can nominate a resource, the curators will standardize the entry, the resource owner can change the description by simply hitting the edit button and adding information to the form and the owner can sign up to watch the page so that when any changes are made, he/she is notified. When the description is adequate, the curators will change the curation status to ‘curated’ and the ‘click here to generate sitemap’ link becomes visible. This link activates the DISCO system to generate a sitemap file using the text from the stable version of the resource in the NeuroLex wiki (ii). The event tracking system is activated, generating an email to the resource-provider tracking group in NIF, and instructions prompt the user to download the DISCO interop file (iii) and place it into the root directory of the resource. When this is complete, the DISCO dashboard updates and a new page is generated for the resource (iv) that allows the curators or the resource owner to regenerate, or edit the files that were created, schedule a crawl frequency and add additional files allowing for deeper interoperation with NIF such as including data in the Data Federation.

See this image and copyright information in PMC

References

1. Gupta A, Bug W, Marenco L, et al. Federated access to heterogeneous information resources in the Neuroscience Information Framework (NIF) Neuroinformatics. 2008;6:205–217. - PMC - PubMed
1. Gardner D, Akil H, Ascoli GA, et al. The Neuroscience Information Framework: a data and knowledge environment for neuroscience. Neuroinformatics. 2008;6:149–160. - PMC - PubMed
1. Müller HM, Rangarajan A, Teal TK, et al. Textpresso for neuroscience: Searching the full text of thousands of neuroscience research papers. Neuroinformatics. 2008;6:195–204. - PMC - PubMed
1. Bug WJ, Ascoli GA, Grethe JS, et al. The NIFSTD and BIRNLex vocabularies: building comprehensive ontologies for neuroscience. Neuroinformatics. 2008;6:175–194. - PMC - PubMed
1. Marenco L, Wang R, Shepherd GM, et al. The NIF DISCO Framework: facilitating automated integration of neuroscience content on the web. Neuroinformatics. 2010;8:101–112. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework

Affiliation

A hybrid human and machine resource curation pipeline for the Neuroscience Information Framework

Authors

Affiliation

Abstract

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources