Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Feb 23;5(2):e9202.
doi: 10.1371/journal.pone.0009202.

Towards a rigorous assessment of systems biology models: the DREAM3 challenges

Affiliations

Towards a rigorous assessment of systems biology models: the DREAM3 challenges

Robert J Prill et al. PLoS One. .

Erratum in

  • PLoS One. 2010;5(3). doi: 10.1371/annotation/f633213a-dc4f-4bee-b6c5-72d50e7073b8

Abstract

Background: Systems biology has embraced computational modeling in response to the quantitative nature and increasing scale of contemporary data sets. The onslaught of data is accelerating as molecular profiling technology evolves. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) is a community effort to catalyze discussion about the design, application, and assessment of systems biology models through annual reverse-engineering challenges.

Methodology and principal findings: We describe our assessments of the four challenges associated with the third DREAM conference which came to be known as the DREAM3 challenges: signaling cascade identification, signaling response prediction, gene expression prediction, and the DREAM3 in silico network challenge. The challenges, based on anonymized data sets, tested participants in network inference and prediction of measurements. Forty teams submitted 413 predicted networks and measurement test sets. Overall, a handful of best-performer teams were identified, while a majority of teams made predictions that were equivalent to random. Counterintuitively, combining the predictions of multiple teams (including the weaker teams) can in some cases improve predictive power beyond that of any single method.

Conclusions: DREAM provides valuable feedback to practitioners of systems biology modeling. Lessons learned from the predictions of the community provide much-needed context for interpreting claims of efficacy of algorithms described in the scientific literature.

PubMed Disclaimer

Conflict of interest statement

Competing Interests: The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. The objective of the signaling cascade identification challenge was to identify some of the molecular species in this diagram from single-cell flow cytometry measurements.
The upstream binding of a ligand to a receptor and the downstream phosphorylation of a protein are illustrated.
Figure 2
Figure 2. The objective of the signaling response prediction challenge was to predict the concentrations of phosphoproteins and cytokines in response to combinatorial perturbations the environmental cues (stimuli) and perturbations of the signaling network (inhibtors).
(a) A compendium of phosphoprotein and cytokine measurements was provided as a training set. (b) Histograms (log scale) of the scoring metric (normalized squared error) for 100,000 random predictions were approximately Gaussian (fitted blue points). Significance of the predictions of the teams (black points) was assessed with respect to the empirical probability densities embodied by these histograms. Scores of the best-performer teams are denoted with arrows.
Figure 3
Figure 3. Overlay of the assignment tables from the seven teams in the signaling cascade identification challenge.
The number of teams making each assignment and the p-value is indicated. The p-value expresses the probability of a such a concentration of random guesses in the same table entry. Highlighted entries are correct. Five teams correctly identified species x1 as the kinase, a significant event for the community despite that no team had a significant individual performance.
Figure 4
Figure 4. The objective of the gene expression prediction challenge was to predict temporal expression of 50 genes that were withheld from a training set consisting of 9285 genes.
(a) Clustered heatmaps of the predicted genes (columns) reveal that two best-performer teams predicted substantially similar gene expression values, though different methods were employed. Results for the 60 minute time-point are shown. (b) The benefits of combining the predictions of multiple teams into a consensus prediction are illustrated by the rank sum prediction (triangles). Some rank sum predictions score higher than the best-performer, depending on the teams that are included. The highest score is achieved by a combination of the predictions of the best four teams.
Figure 5
Figure 5. The objective of the in silico network inference challenge was to infer networks of various sizes (10, 50, and 100 nodes) from steady-state and time-series “measurements” of simulated gene regulation networks.
Predicted networks were evaluated on the basis of two scoring metrics, (a) area under the ROC curve and (b) area under the precision-recall curve. ROC and precision-recall curves of the five best teams in the 100-node sub-challenge. (a) Dotted diagonal line is the expected value of a random prediction. (b) Note that the best and second-best performers have different precision-recall characteristics. (c) Histograms (log scale) of the AUROC scoring metric for 100,000 random predictions was approximately Gaussian (fitted blue points) whereas the histogram of the AUPR metric was not (inset). Significance of the predictions of the teams (black points) was assessed with respect to the empirical probability densities embodied by these histograms. Scores of the best-performer team are denoted with arrows. All plots are analyses of the gold standard network called InSilico_Size100_Yeast2.
Figure 6
Figure 6. Analysis of the community of teams reveals characteristics of identifiable and unidentifiable network edges.
The number of teams that identify an edge at a specified cutoff is a measure of how easy or difficult an edge is to identify. In this analysis we use a cutoff of 2P (i.e., twice the number of actual positive edges in the gold standard network). (a) Histograms indicate the number of teams that correctly identified the edges of the gold standard network called InSilico_Size100_Ecoli1. The ten worst teams in the 100-node sub-challenge identified about the same number of edges as is expected by chance. By contrast, the ten best teams identified more edges than is expected by chance and this sub-community has a markedly different identifiability distribution than random. Still, some edges were not identified by even the ten best teams (see bin corresponding to zero teams). Unidentified edges are characterized by (b) a property of the measurement data and (c) a topological property of the network. (b) Unidentified edges have a lower null-mutant absolute z-score than those that were identified by at least one of the ten best teams. This metric is a measure of the information content of the measurements. (c) Unidentified edges belong to target nodes with a higher in-degree than edges that were identified by at least one of the ten best teams. Circles denote the median and bars denote upper and lower quartiles. Statistics were not computed for bins containing less than four edges. (d) The benefits of combining the predictions of multiple teams into a consensus prediction are illustrated by the rank sum prediction (triangles). Though no rank sum prediction scored higher than the best-performer, a consensus of the predictions of the second and third place teams boosted the score of the second place team. Rank sum analysis shown for the 100-node sub-challenge.
Figure 7
Figure 7. Community analysis of systematic false positives.
Systematic false positive (FP) edges are the top one percent of edges that were predicted by the most teams to exist, yet are actually absent from the gold standard (i.e., negative). Rare false positive edges are the remaining 99 percent of edges that are absent from the gold standard network. The entries of each two-by-two contingency table sum to the total number of negative edges (i.e., those not present) in the gold standard network. There is a relative concentration of FP errors in the shortcut and co-regulated topologies, as evidenced by the A-to-B ratio. P-values for each contingency table were computed by Fisher's exact test, which expresses the probability that a random partitioning of the data will result in such a contingency table.
Figure 8
Figure 8. Survey of in silico network methods.
There does not seem to be a correlation between methods and scores, implying that success is more related to the details of implementation than the choice of general methodology.

Similar articles

Cited by

References

    1. Kim HD, Shay T, O'Shea EK, Regev A. Transcriptional regulatory circuits: predicting numbers from alphabets. Science. 2009;325:429–432. - PMC - PubMed
    1. Kremling A, Fischer S, Gadkar K, Doyle FJ, Sauter T, et al. A benchmark for methods in reverse engineering and model discrimination: problem formulation and solutions. Genome Res. 2004;14:1773–1785. - PMC - PubMed
    1. David LA, Wiggins CH. Benchmarking of dynamic bayesian networks inferred from stochastic time-series data. Ann N Y Acad Sci. 2007;1115:90–101. - PubMed
    1. Camacho D, Vera Licona P, Mendes P, Laubenbacher R. Comparison of reverse-engineering methods using an in silico network. Ann N Y Acad Sci. 2007;1115:73–89. - PubMed
    1. Cantone I, Marucci L, Iorio F, Ricci MA, Belcastro V, et al. A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009;137:172–181. - PubMed

Publication types