Review

. 2016 Jul 15;17(8):470-86.

doi: 10.1038/nrg.2016.69.

Crowdsourcing biomedical research: leveraging communities as innovation engines

Julio Saez-Rodriguez^{1

2}, James C Costello³, Stephen H Friend⁴, Michael R Kellen⁴, Lara Mangravite⁴, Pablo Meyer⁵, Thea Norman⁴, Gustavo Stolovitzky^{5

6}

Affiliations

¹ RWTH Aachen University, Faculty of Medicine, Joint Research Centre for Computational Biomedicine, Aachen D-52074, Germany.
² European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK.
³ Department of Pharmacology, University of Colorado, Anschutz Medical Campus, Aurora, Colorado 80045, USA.
⁴ Sage Bionetworks, Seattle, Washington 98109, USA.
⁵ IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA.
⁶ Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA.

PMID: 27418159
PMCID: PMC5918684
DOI: 10.1038/nrg.2016.69

Review

Crowdsourcing biomedical research: leveraging communities as innovation engines

Julio Saez-Rodriguez et al. Nat Rev Genet. 2016.

. 2016 Jul 15;17(8):470-86.

doi: 10.1038/nrg.2016.69.

Authors

Julio Saez-Rodriguez^{1

2}, James C Costello³, Stephen H Friend⁴, Michael R Kellen⁴, Lara Mangravite⁴, Pablo Meyer⁵, Thea Norman⁴, Gustavo Stolovitzky^{5

6}

Affiliations

¹ RWTH Aachen University, Faculty of Medicine, Joint Research Centre for Computational Biomedicine, Aachen D-52074, Germany.
² European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK.
³ Department of Pharmacology, University of Colorado, Anschutz Medical Campus, Aurora, Colorado 80045, USA.
⁴ Sage Bionetworks, Seattle, Washington 98109, USA.
⁵ IBM Thomas J. Watson Research Center, Yorktown Heights, New York 10598, USA.
⁶ Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York 10029, USA.

PMID: 27418159
PMCID: PMC5918684
DOI: 10.1038/nrg.2016.69

Abstract

The generation of large-scale biomedical data is creating unprecedented opportunities for basic and translational science. Typically, the data producers perform initial analyses, but it is very likely that the most informative methods may reside with other groups. Crowdsourcing the analysis of complex and massive data has emerged as a framework to find robust methodologies. When the crowdsourcing is done in the form of collaborative scientific competitions, known as Challenges, the validation of the methods is inherently addressed. Challenges also encourage open innovation, create collaborative communities to solve diverse and important biomedical problems, and foster the creation and dissemination of well-curated data repositories.

PubMed Disclaimer

Conflict of interest statement

Competing interests statement

The authors declare no competing interests.

Figures

**Figure 1. Challenge platforms and organizations**
The most popular researcher-driven Challenge initiatives in the life sciences (left) and the most popular commercial Challenge platforms (right) are shown. Initiatives, such as DREAM (Dialogue for Reverse Engineering Assessment and Methods), FlowCAP (Flow Cytometry Critical Assessment of Population Identification Methods), CAGI (Critical Assessment of Genome Interpretation) and sbv-IMPROVER (Systems Biology Verification combined with Industrial Methodology for Process Verification in Research), organize several Challenges per year; only the generic project and not the specific Challenges are shown. Among the most popular and successful commercial Challenge platforms are: InnoCentive, which crowdsources Challenges in science and technology (social sciences, physics, biology and chemistry); Topcoder, which serves the software developer community; and Kaggle, which administers Challenges to machine-learning and computer experts, addressing predictive analytics problems in a wide range of disciplines. The figure is not comprehensive, but highlights the most consistent and well-established Challenge initiatives. CAFA, Critical Assessment of Functional Annotation; CACAO, Cross-language Access to Catalogues And On-line libraries; CAMDA, Critical Assessment of Massive Data Analysis; CAPRI, Critical Assessment of PRediction of Interaction; CASP, Critical Assessment of protein Structure Prediction; CLARITY, Children’s Leadership Award for the Reliable Interpretation and appropriate Transmission of Your genomic information; RGASP, RNA-seq Genome Annotation Assessment Project; TREC Crowd, Text REtrieval Conference Crowdsourcing Track.

**Figure 2. The steps and tasks in the organization of a Challenge**
The main scientific steps of developing a Challenge are: the determination of the scientific question, the pre-processing and curation of the data, the dry run, the scoring and judging, the post-Challenge analysis and the Challenge reporting and paper writing. Technical considerations include: development and maintenance of the IT infrastructure that requires registration, creation of computing accounts, security needed for cloud-based data hosting and development of submission queues, leaderboards and discussion forums. The legal considerations include agreements with the data providers regarding restrictions of data use and the agreement that participants will abide by the Challenge rules. The social dimension includes the creation of an organizing team to plan, run and analyse the Challenge, as well as to determine and put incentives in place for participation, to advertise the Challenge, to moderate the discussion forum and to lead the post-Challenge activities, such as paper writing and conferences. Comms, communications; IRB, Institutional Review Board.

**Figure 3. The wisdom of crowds in theory and in practice**
Two case studies in the context of a hypothetical Challenge or the NIEHS–NCATS–UNC DREAM Toxicogenetics Challenge (a collaboration between the US National Institute of Environmental Health Sciences (NIEHS), the US National Center for Advancing Translational Sciences (NCATS) and the University of North Carolina (UNC)). **a–d** | The hypothetical example shows three of the predictions that will be integrated into an aggregate ranked list. Two sufficient conditions for integration to outperform individual inference methods are: first, each of the inference methods must have better than random predictive power (that is, on average, items in the positive set are assigned better (lower) ranks than items in the negative set), and second, predictions of different inference methods must be statistically independent. In part b, we show the probability that a given method places a positive or negative item at a given rank. Positive items are assigned lower ranks on average, yet there is still some considerable probability of giving a low rank to a negative item. The area under the precision-recall curve (AUPR) of this method is only 0.41; for a random prediction with these parameters, we would expect an AUPR of 0.3. Suppose now that the integrated solution is computed for each item as the average of the assigned ranks to that item by each method. If, for the sake of simplicity, we assume that all methods have the same probability and the assigned ranks are independently chosen for the positive and negative sets, then the central limit theorem establishes that the average rank probability will approach a Gaussian distribution, with its variance shrinking as more methods are integrated. In this way, the probability of a positive to have lower ranks than negatives increases (parts c and d), resulting in an AUPR that tends to 1 (perfect prediction) as the number of integrated inference methods increases. e | An equivalent trend is seen in the Toxicogenetics Challenge using a different metric (Pearson correlation). The Pearson correlation is shown for all 24 methods submitted, and the box-plot for n randomly chosen predictions out of the 24. The median correlation of the aggregates increases as the number of aggregated methods increases. Parts **a–d** are adapted from REF. , Nature Publishing Group. Part e is adapted from REF. , Nature Publishing Group.

See this image and copyright information in PMC

References

1. Stephens ZD, et al. Big Data: astronomical or genomical? PLoS Biol. 2015;13:e1002195. - PMC - PubMed
1. ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. - PMC - PubMed
1. The Cancer Genome Atlas Research Network et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013;45:1113–1120. - PMC - PubMed
1. International Cancer Genome Consortium et al. International network of cancer genome projects. Nature. 2010;464:993–998. - PMC - PubMed
1. Uhlén M, et al. Proteomics. Tissue-based map of the human proteome. Science. 2015;347:1260419. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

U54 HG007990/HG/NHGRI NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- scite Smart Citations

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Crowdsourcing biomedical research: leveraging communities as innovation engines

Affiliations

Crowdsourcing biomedical research: leveraging communities as innovation engines

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources