. 2010 Mar 5;6(3):e1000698.

doi: 10.1371/journal.pcbi.1000698.

Systematic planning of genome-scale experiments in poorly studied species

Yuanfang Guan¹, Maitreya Dunham, Amy Caudy, Olga Troyanskaya

Affiliations

PMID: 20221257
PMCID: PMC2832676
DOI: 10.1371/journal.pcbi.1000698

Systematic planning of genome-scale experiments in poorly studied species

Yuanfang Guan et al. PLoS Comput Biol. 2010.

. 2010 Mar 5;6(3):e1000698.

doi: 10.1371/journal.pcbi.1000698.

Authors

Yuanfang Guan¹, Maitreya Dunham, Amy Caudy, Olga Troyanskaya

Affiliation

¹ Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America.

PMID: 20221257
PMCID: PMC2832676
DOI: 10.1371/journal.pcbi.1000698

Abstract

Genome-scale datasets have been used extensively in model organisms to screen for specific candidates or to predict functions for uncharacterized genes. However, despite the availability of extensive knowledge in model organisms, the planning of genome-scale experiments in poorly studied species is still based on the intuition of experts or heuristic trials. We propose that computational and systematic approaches can be applied to drive the experiment planning process in poorly studied species based on available data and knowledge in closely related model organisms. In this paper, we suggest a computational strategy for recommending genome-scale experiments based on their capability to interrogate diverse biological processes to enable protein function assignment. To this end, we use the data-rich functional genomics compendium of the model organism to quantify the accuracy of each dataset in predicting each specific biological process and the overlap in such coverage between different datasets. Our approach uses an optimized combination of these quantifications to recommend an ordered list of experiments for accurately annotating most proteins in the poorly studied related organisms to most biological processes, as well as a set of experiments that target each specific biological process. The effectiveness of this experiment- planning system is demonstrated for two related yeast species: the model organism Saccharomyces cerevisiae and the comparatively poorly studied Saccharomyces bayanus. Our system recommended a set of S. bayanus experiments based on an S. cerevisiae microarray data compendium. In silico evaluations estimate that less than 10% of the experiments could achieve similar functional coverage to the whole microarray compendium. This estimation was confirmed by performing the recommended experiments in S. bayanus, therefore significantly reducing the labor devoted to characterize the poorly studied genome. This experiment-planning framework could readily be adapted to the design of other types of large-scale experiments as well as other groups of organisms.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

**Figure 1. Three-step schematic of the genome-scale experiment planning procedures.**
First, the informativeness of each experiment in predicting each Gene Ontology (GO) biological process is quantified by bootstrap support vector machine (SVM). Genes in the model organism are grouped into ‘Positives’ (those annotated to the GO under study), and ‘Negatives’ (those not annotated to the GO term). The Area Under the Receiver Operating Characteristic Curve (AUC) of each experiment is estimated by bootstrap SVM, resulting in the GO-experiment matrix. Secondly, conditional mutual information (CMI) was used to quantify the overlap between pair-wise experiments. This results in a symmetric mutual information matrix. Finally, for datasets that contain a large number of arrays, we estimated the minimal number needed to achieve satisfactory function prediction results by a randomized test. The experiment planning system combines the above three aspects and recommends a final list of experimental treatments to be carried out in a related poorly-studied species.

**Figure 2. Microarray datasets contain signals for different yet overlapping biological processes.**
A. The performance (in AUC) of each of the top 10 datasets (in order) recommended by the planning system in predicting different biological processes. B. The performance (in AUC) for the prediction for all GO biological process terms by the entire *S. cerevisiae* microarray repository, clustered by hierarchical clustering. Datasets are very different in their relative performance for different biological processes. Some of the biological processes are well-covered by a variety of experiment treatments, while the majority are only covered by a small fraction of the datasets.

**Figure 3. Conditional mutual information could quickly identify redundant datasets in the *S. cerevisiae* microarray repository.**
A. Overall demonstration of the pair-wise mutual information between datasets, with mutual information values clustered with hierarchical clustering. The mutual information between datasets is highly structured, where black blocks represent several highly overlapping datasets. B. Examples of mutual information between specific datasets. Dataset pairs generated under the same experimental treatment have very high mutual information.

**Figure 4. A small number of arrays in some of the very large-scale experiments are sufficient for function prediction.**
The performance (in AUC) of the random subsets of different numbers of arrays of the (A) Brem *et al.*, 2005 dataset and (B) Hughes *et al.*, 2000 dataset. The mean, median and standard deviation were estimated through 25 sub-samplings. C. The performance (in AUC) of different number of arrays from the Brem *et al.* dataset in predicting different biological processes. The performance of the randomly selected subsets is defined as the average AUC of the GO functional slim biological processes.

**Figure 5. Bootstrap cross-validation determines the trade-off between accuracy and redundancy of datasets.**
A. A schematic for the bootstrap cross-validation scheme. Using the selected dataset, genes could be placed into hyperdimensional space where support vector machine separates the positive and negative examples (as genes annotated to the GO term and genes not annotated). In each iteration, a set of the genes were bootstrapped as the training set, and the rest remains as the test set. The predicted values of the test set were recorded. After 25 iterations, the median predicted value for a gene when it is in the test sets were taken as the final prediction value for that gene. This value was later used for performance analysis. B. The performance (in AUC) of the top 10 datasets selected by a range of α differs in their ability to predict the GO functional SLIM biological processes. A higher trade-off factor (α) means more weight on the accuracy of the datasets and lower means a heavier penalty is placed on the overlap between them. α = 0.9 achieved the best performance in functional annotation.

**Figure 6. Comparative evaluation of the experimental validation in *S. bayanus*.**
Each panel depicts the comparison of the performance in AUC between *S. bayanus* and *S. cerevisiae*. GO functional slim terms with more than 30 genes annotated to them were included in all panels. Experimental validation in *S. bayanus* shows that 250 arrays based on the recommendations achieve a similar level of accuracy as 2569 arrays in *S. cerevisiae*. Also shown here are the comparison of performance of eight individually matched experiment pairs in *S. bayanus* and *S. cerevisiae*.

**Figure 7. Recommended experiments can more accurately predict functions than a random selection of the data repository.**
A. Comparison to the performance of randomly selected subsets of the entire expression data repository in *S. cerevisiae*, the recommended datasets, and the recommended experiments carried out in *S. bayanus*. B. Recommended experiments in the second round in *S. bayanus* significantly improved weakly represented terms from the first round. Based on the evaluation results in the first round in *S. bayanus*, we re-designed several microarray experiments for the weakly-predicted terms in the first round. We found that adding these ∼50 experiments to the compendium improved the predictions on the previously weakly predicted terms.

See this image and copyright information in PMC

Comment in

Learning to prioritize.
Flintoft L. Flintoft L. Nat Rev Genet. 2010 May;11(5):315. doi: 10.1038/nrg2789. Nat Rev Genet. 2010. PMID: 20414989 No abstract available.

Cited by

Computationally Driven Experimental Biology.
Murali TM. Murali TM. Computer (Long Beach Calif). 2012 Mar;45(3):22-23. doi: 10.1109/mc.2012.93. Computer (Long Beach Calif). 2012. PMID: 24976642 Free PMC article.
Comparative gene expression between two yeast species.
Guan Y, Dunham MJ, Troyanskaya OG, Caudy AA. Guan Y, et al. BMC Genomics. 2013 Jan 16;14:33. doi: 10.1186/1471-2164-14-33. BMC Genomics. 2013. PMID: 23324262 Free PMC article.
Multiple genetic interaction experiments provide complementary information useful for gene function prediction.
Michaut M, Bader GD. Michaut M, et al. PLoS Comput Biol. 2012;8(6):e1002559. doi: 10.1371/journal.pcbi.1002559. Epub 2012 Jun 21. PLoS Comput Biol. 2012. PMID: 22737063 Free PMC article.
Combinatorial Cis-regulation in Saccharomyces Species.
Spivak AT, Stormo GD. Spivak AT, et al. G3 (Bethesda). 2016 Jan 15;6(3):653-67. doi: 10.1534/g3.115.024331. G3 (Bethesda). 2016. PMID: 26772747 Free PMC article.
Commitment to a cellular transition precedes genome-wide transcriptional change.
Eser U, Falleur-Fettig M, Johnson A, Skotheim JM. Eser U, et al. Mol Cell. 2011 Aug 19;43(4):515-27. doi: 10.1016/j.molcel.2011.06.024. Mol Cell. 2011. PMID: 21855792 Free PMC article.

See all "Cited by" articles

References

1. Hibbs MA, Hess DC, Myers CL, Huttenhower C, Li K, et al. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics. 2007;23:2692–2699. - PubMed
1. Hess DC, Myers CL, Huttenhower C, Hibbs MA, Hayes AP, et al. Computationally driven, quantitative experiments discover genes required for mitochondrial biogenesis. PLoS Genet. 2009;5:e1000407. - PMC - PubMed
1. Pena-Castillo L, Tasan M, Myers CL, Lee H, Joshi T, et al. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol. 2008;9(Suppl 1):S2. - PMC - PubMed
1. Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, et al. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 2008;9(Suppl 1):S3. - PMC - PubMed
1. Xia K, Dong D, Han JD. IntNetDB v1.0: an integrated protein-protein interaction network database generated by a probabilistic model. BMC Bioinformatics. 2006;7:508. - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

R01 GM071966/GM/NIGMS NIH HHS/United States

LinkOut - more resources

Full Text Sources
Molecular Biology Databases
- NIAID Data Ecosystem - Find datasets on Infectious and Immune-mediated Diseases
- Saccharomyces Genome Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Systematic planning of genome-scale experiments in poorly studied species

Affiliation

Systematic planning of genome-scale experiments in poorly studied species

Authors

Affiliation

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases

Abstract

Conflict of interest statement

Figures

Comment in

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Molecular Biology Databases