Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Mar 5;6(3):e1000698.
doi: 10.1371/journal.pcbi.1000698.

Systematic planning of genome-scale experiments in poorly studied species

Affiliations

Systematic planning of genome-scale experiments in poorly studied species

Yuanfang Guan et al. PLoS Comput Biol. .

Abstract

Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organism Saccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Figure 1
Figure 1. Three-step schematic of the genome-scale experiment planning procedures.
First, the informativeness of each experiment in predicting each Gene Ontology (GO) biological process is quantified by bootstrap support vector machine (SVM). Genes in the model organism are grouped into ‘Positives’ (those annotated to the GO under study), and ‘Negatives’ (those not annotated to the GO term). The Area Under the Receiver Operating Characteristic Curve (AUC) of each experiment is estimated by bootstrap SVM, resulting in the GO-experiment matrix. Secondly, conditional mutual information (CMI) was used to quantify the overlap between pair-wise experiments. This results in a symmetric mutual information matrix. Finally, for datasets that contain a large number of arrays, we estimated the minimal number needed to achieve satisfactory function prediction results by a randomized test. The experiment planning system combines the above three aspects and recommends a final list of experimental treatments to be carried out in a related poorly-studied species.
Figure 2
Figure 2. Microarray datasets contain signals for different yet overlapping biological processes.
A. The performance (in AUC) of each of the top 10 datasets (in order) recommended by the planning system in predicting different biological processes. B. The performance (in AUC) for the prediction for all GO biological process terms by the entire S. cerevisiae microarray repository, clustered by hierarchical clustering. Datasets are very different in their relative performance for different biological processes. Some of the biological processes are well-covered by a variety of experiment treatments, while the majority are only covered by a small fraction of the datasets.
Figure 3
Figure 3. Conditional mutual information could quickly identify redundant datasets in the S. cerevisiae microarray repository.
A. Overall demonstration of the pair-wise mutual information between datasets, with mutual information values clustered with hierarchical clustering. The mutual information between datasets is highly structured, where black blocks represent several highly overlapping datasets. B. Examples of mutual information between specific datasets. Dataset pairs generated under the same experimental treatment have very high mutual information.
Figure 4
Figure 4. A small number of arrays in some of the very large-scale experiments are sufficient for function prediction.
The performance (in AUC) of the random subsets of different numbers of arrays of the (A) Brem et al., 2005 dataset and (B) Hughes et al., 2000 dataset. The mean, median and standard deviation were estimated through 25 sub-samplings. C. The performance (in AUC) of different number of arrays from the Brem et al. dataset in predicting different biological processes. The performance of the randomly selected subsets is defined as the average AUC of the GO functional slim biological processes.
Figure 5
Figure 5. Bootstrap cross-validation determines the trade-off between accuracy and redundancy of datasets.
A. A schematic for the bootstrap cross-validation scheme. Using the selected dataset, genes could be placed into hyperdimensional space where support vector machine separates the positive and negative examples (as genes annotated to the GO term and genes not annotated). In each iteration, a set of the genes were bootstrapped as the training set, and the rest remains as the test set. The predicted values of the test set were recorded. After 25 iterations, the median predicted value for a gene when it is in the test sets were taken as the final prediction value for that gene. This value was later used for performance analysis. B. The performance (in AUC) of the top 10 datasets selected by a range of α differs in their ability to predict the GO functional SLIM biological processes. A higher trade-off factor (α) means more weight on the accuracy of the datasets and lower means a heavier penalty is placed on the overlap between them. α = 0.9 achieved the best performance in functional annotation.
Figure 6
Figure 6. Comparative evaluation of the experimental validation in S. bayanus.
Each panel depicts the comparison of the performance in AUC between S. bayanus and S. cerevisiae. GO functional slim terms with more than 30 genes annotated to them were included in all panels. Experimental validation in S. bayanus shows that 250 arrays based on the recommendations achieve a similar level of accuracy as 2569 arrays in S. cerevisiae. Also shown here are the comparison of performance of eight individually matched experiment pairs in S. bayanus and S. cerevisiae.
Figure 7
Figure 7. Recommended experiments can more accurately predict functions than a random selection of the data repository.
A. Comparison to the performance of randomly selected subsets of the entire expression data repository in S. cerevisiae, the recommended datasets, and the recommended experiments carried out in S. bayanus. B. Recommended experiments in the second round in S. bayanus significantly improved weakly represented terms from the first round. Based on the evaluation results in the first round in S. bayanus, we re-designed several microarray experiments for the weakly-predicted terms in the first round. We found that adding these ∼50 experiments to the compendium improved the predictions on the previously weakly predicted terms.

Comment in

  • Learning to prioritize.
    Flintoft L. Flintoft L. Nat Rev Genet. 2010 May;11(5):315. doi: 10.1038/nrg2789. Nat Rev Genet. 2010. PMID: 20414989 No abstract available.

Similar articles

Cited by

References

    1. Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, et al. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics. 2007;23:2692–2699. - PubMed
    1. Hess DC, Myers CL, Huttenhower C, Hibbs MA, Hayes AP, et al. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet. 2009;5:e1000407. - PMC - PubMed
    1. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, et al. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 2008;9(Suppl 1):S2. - PMC - PubMed
    1. Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, et al. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 2008;9(Suppl 1):S3. - PMC - PubMed
    1. Xia K, Dong D, Han JD. IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model. BMC Bioinformatics. 2006;7:508. - PMC - PubMed

Publication types

MeSH terms