Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2015 Oct 15;14(1):A56-65.
eCollection 2015 Fall.

Undergraduate Biocuration: Developing Tomorrow's Researchers While Mining Today's Data

Affiliations

Undergraduate Biocuration: Developing Tomorrow's Researchers While Mining Today's Data

Cassie S Mitchell et al. J Undergrad Neurosci Educ. .

Abstract

Biocuration is a time-intensive process that involves extraction, transcription, and organization of biological or clinical data from disjointed data sets into a user-friendly database. Curated data is subsequently used primarily for text mining or informatics analysis (bioinformatics, neuroinformatics, health informatics, etc.) and secondarily as a researcher resource. Biocuration is traditionally considered a Ph.D. level task, but a massive shortage of curators to consolidate the ever-mounting biomedical "big data" opens the possibility of utilizing biocuration as a means to mine today's data while teaching students skill sets they can utilize in any career. By developing a biocuration assembly line of simplified and compartmentalized tasks, we have enabled biocuration to be effectively performed by a hierarchy of undergraduate students. We summarize the necessary physical resources, process for establishing a data path, biocuration workflow, and undergraduate hierarchy of curation, technical, information technology (IT), quality control and managerial positions. We detail the undergraduate application and training processes and give detailed job descriptions for each position on the assembly line. We present case studies of neuropathology curation performed entirely by undergraduates, namely the construction of experimental databases of Amyotrophic Lateral Sclerosis (ALS) transgenic mouse models and clinical data from ALS patient records. Our results reveal undergraduate biocuration is scalable for a group of 8-50+ with relatively minimal required resources. Moreover, with average accuracy rates greater than 98.8%, undergraduate biocurators are equivalently accurate to their professional counterparts. Initial training to be completely proficient at the entry-level takes about five weeks with a minimal student time commitment of four hours/week.

Keywords: big data; biocuration; bioinformatics; biomedical informatics; data science; database; health informatics; lab management; neuroinformatics; text mining; undergraduate research.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Application, training, and job positions. ALS clinical informatics project used as a detailed example.
Figure 2.
Figure 2.
Example curation data entry layout from ALS clinical informatics project. This is one of the four data entry layouts used as part of our ALS clinical informatics project. The layout above shows some of the parametric and non-parametric data that is recorded during a standard ALS clinic patient visit. Additional separate layouts (not shown) exist for cognitive testing, patient history (medical and family history, onset symptom timeline, diagnostic and genetic testing), and autopsy and pathological reports. If no data was present in the medical record or survey for a particular field, the field is simply left blank. Note that patient name and MRN fields are only shown for reference as to how data is obtained from data source; curated data is ultimately de-identified to protect patient privacy.
Figure 3.
Figure 3.
Pie chart illustrating curator error types for Full Data Capture (Curation Levels 1–4, see curation section in Positions) for our SOD1 G93A transgenic mouse ALS experimental database as identified by quality control personnel. On average, trained curators commit errors on less than 1.2% of their entries with a standard deviation of ± 1%. The pie chart represents the breakdown of error type of this 1.2% of total errors. Partial Data Capture (green): curator fails to collect all data from a figure or table. For example, trendline data was recaptured only for G93A and not for G93A + treatment. Ontology mislabeling error (blue): curator assigns response value entry to wrong ontological classification. Estimation error (yellow): captured data point value (typically from a graph) is visually estimated as greater than ± 5% from the actual value. Note that value estimation off by more than 10% is defined as a critical error. Critical error (red): incorrectly entered data that compromises data integrity.

References

    1. Burge S, Attwood TK, Bateman A, Berardini TZ, Cherry M, O'donovan C, Xenarios L, Gaudet P. Database. Oxford: 2012. 2012. Biocurators and biocuration: surveying the 21st century challenges; p. bar059. - PMC - PubMed
    1. Coan G, Mitchell CS. An assessment of possible neuropathology and clinical relationships in 46 sporadic amyotrophic lateral sclerosis patient autopsies. Neurodegener Dis. 2015;15:301–312. - PMC - PubMed
    1. Foley AM, Ammar ZM, Lee RH, Mitchell CS. Systematic review of the relationship between amyloid-beta levels and measures of transgenic mouse cognitive deficit in Alzheimer’s disease. J Alzheimers Dis. 2015;44:787–795. - PMC - PubMed
    1. Irvin CW, Kim RB, Mitchell CS. Seeking homeostasis: temporal trends in respiration, oxidation, and calcium in the SOD1 G93A Amyotrohpic Lateral Sclerosis mouse. Front Cell Neurosci. 2015;9:248. - PMC - PubMed
    1. Keseler IM, Mackie A, Peralta-Gil M, Santos-Zavaleta A, Gama-Castro S, Bonavides-Martinez C, Fulcher C, Huerta AM, Kothari A, Krummenacker M, Latendresse M, Muniz-Rascado L, Ong Q, Paley S, Schroder I, Shearer AG, Subhraveti P, Travers M, Weerasinghe D, Weiss V, Collado-Vides J, Gunsalus RP, Paulsen I, Karp PD. EcoCyc: fusing model organism databases with systems biology. Nucleic Acids Res. 2013;41:D605–612. - PMC - PubMed

LinkOut - more resources