Graph-based data selection for the construction of genomic prediction models

Steven Maenhout¹, Bernard De Baets, Geert Haesaert

Affiliations

PMID: 20479144
PMCID: PMC2927770
DOI: 10.1534/genetics.110.116426

Graph-based data selection for the construction of genomic prediction models

Steven Maenhout et al. Genetics. 2010 Aug.

. 2010 Aug;185(4):1463-75.

doi: 10.1534/genetics.110.116426. Epub 2010 May 17.

Authors

Steven Maenhout¹, Bernard De Baets, Geert Haesaert

Affiliation

¹ Department of Biosciences and Landscape Architecture, University College Ghent, B-9000 Gent, Belgium. steven.maenhout@hogent.be

PMID: 20479144
PMCID: PMC2927770
DOI: 10.1534/genetics.110.116426

Abstract

Efficient genomic selection in animals or crops requires the accurate prediction of the agronomic performance of individuals from their high-density molecular marker profiles. Using a training data set that contains the genotypic and phenotypic information of a large number of individuals, each marker or marker allele is associated with an estimated effect on the trait under study. These estimated marker effects are subsequently used for making predictions on individuals for which no phenotypic records are available. As most plant and animal breeding programs are currently still phenotype driven, the continuously expanding collection of phenotypic records can only be used to construct a genomic prediction model if a dense molecular marker fingerprint is available for each phenotyped individual. However, as the genotyping budget is generally limited, the genomic prediction model can only be constructed using a subset of the tested individuals and possibly a genome-covering subset of the molecular markers. In this article, we demonstrate how an optimal selection of individuals can be made with respect to the quality of their available phenotypic data. We also demonstrate how the total number of molecular markers can be reduced while a maximum genome coverage is ensured. The third selection problem we tackle is specific to the construction of a genomic prediction model for a hybrid breeding program where only molecular marker fingerprints of the homozygous parents are available. We show how to identify the set of parental inbred lines of a predefined size that has produced the highest number of progeny. These three selection approaches are put into practice in a simulation study where we demonstrate how the trade-off between sample size and sample quality affects the prediction accuracy of genomic prediction models for hybrid maize.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.— — **Figure 1.—**
Graph of the trade-off between the selection size and the selection quality for a sample of the RAGT grain maize breeding pool. For each examined level of CD_min, ranging from 0.0 to 0.97, the dot represents the maximum cardinality selection of individuals for which the minimum precision of a pairwise contrast is at least CD_min.

F<sc>igure</sc> 2.— — **Figure 2.—**
Graph representation of a sample of the RAGT grain maize breeding pool. The blue vertices represent inbred lines and the gray edges are single-cross hybrids.

F<sc>igure</sc> 3.— — **Figure 3.—**
Graph of the trade-off between the selection size and the selection quality when only k parental inbred lines are being genotyped. For each examined level of CD_min ranging from 0.0 to 0.97 the number of genotyped inbred lines k is reduced from 487 to 3. Each dot in the plotted surface represents the maximum cardinality selection of hybrid individuals for which the minimum precision of a pairwise contrast is at least CD_min and the number of parents is exactly k.

F<sc>igure</sc> 4.— — **Figure 4.—**
Log-scaled degree distribution of the graph created from part of the RAGT R2n grain maize breeding program. In this undirected, unweighted graph, parental inbred lines are represented as vertices and single-cross hybrids as edges. Each dot represents a unique log-scaled vertex degree (horizontal axis) and the log of its frequency in the graph (vertical axis). The red line represents the fitted power law distribution by means of likelihood maximization. The threshold value of 6 was determined by minimizing the Kolmogorov–Smirnov statistic as described by Clauset *et al.* (2009).

F<sc>igure</sc> 5.— — **Figure 5.—**
Accuracy of the genotypic value BLUPs of the hybrids selected using the described graph-based procedures. The three examined heritability levels h² = 0.25, h² = 0.5, and h² = 0.75 are represented by the bottom, middle, and top wireframe surfaces respectively. Each point on a surface is the squared Pearson correlation between the BLUPs and the actual (simulated) genotypic values of the selected hybrids under the constraints of a minimum required contrast precision CD_min, expressed as a percentile of the sampled CD values, and the number of genotyped inbred lines, averaged over 100 iterations of the simulation routine.

F<sc>igure</sc> 6.— — **Figure 6.—**
Average prediction accuracy of ɛ-SVR and BLP prediction models over 100 iterations of the simulation routine for varying levels of the minimum required contrast precision CD_min, expressed as a percentile of the sampled CD values ranging from 0 to 0.875 and the number of genotyped inbred lines. The height of each point in the wireframe represents the prediction accuracy obtained by ɛ-SVR and BLP when training on the optimal selection of hybrids under the constraints imposed by the levels of the two independent variables. Prediction accuracy is expressed as the average squared Pearson correlation between the simulated and the predicted genotypic values of the hybrids. The interval at the bottom of each wireframe provides the minimum and maximum standard error of the mean. The scales of the vertical axes are comparable only within the same heritability level.

See this image and copyright information in PMC

Cited by

Large-scale sequestration of atmospheric carbon via plant roots in natural and agricultural ecosystems: why and how.
Kell DB. Kell DB. Philos Trans R Soc Lond B Biol Sci. 2012 Jun 5;367(1595):1589-97. doi: 10.1098/rstb.2011.0244. Philos Trans R Soc Lond B Biol Sci. 2012. PMID: 22527402 Free PMC article. Review.
Across-years prediction of hybrid performance in maize using genomics.
Schrag TA, Schipprack W, Melchinger AE. Schrag TA, et al. Theor Appl Genet. 2019 Apr;132(4):933-946. doi: 10.1007/s00122-018-3249-5. Epub 2018 Nov 29. Theor Appl Genet. 2019. PMID: 30498894
Training set optimization under population structure in genomic selection.
Isidro J, Jannink JL, Akdemir D, Poland J, Heslot N, Sorrells ME. Isidro J, et al. Theor Appl Genet. 2015 Jan;128(1):145-58. doi: 10.1007/s00122-014-2418-4. Epub 2014 Nov 1. Theor Appl Genet. 2015. PMID: 25367380 Free PMC article.
Maximizing the reliability of genomic selection by optimizing the calibration set of reference individuals: comparison of methods in two diverse groups of maize inbreds (Zea mays L.).
Rincent R, Laloë D, Nicolas S, Altmann T, Brunel D, Revilla P, Rodríguez VM, Moreno-Gonzalez J, Melchinger A, Bauer E, Schoen CC, Meyer N, Giauffret C, Bauland C, Jamin P, Laborde J, Monod H, Flament P, Charcosset A, Moreau L. Rincent R, et al. Genetics. 2012 Oct;192(2):715-28. doi: 10.1534/genetics.112.141473. Epub 2012 Aug 3. Genetics. 2012. PMID: 22865733 Free PMC article.
Beyond Genomic Prediction: Combining Different Types of omics Data Can Improve Prediction of Hybrid Performance in Maize.
Schrag TA, Westhues M, Schipprack W, Seifert F, Thiemann A, Scholten S, Melchinger AE. Schrag TA, et al. Genetics. 2018 Apr;208(4):1373-1385. doi: 10.1534/genetics.117.300374. Epub 2018 Jan 23. Genetics. 2018. PMID: 29363551 Free PMC article.

See all "Cited by" articles

References

1. Asahiro, Y., K. Iwama, H. Tamaki and T. Tokuyama, 2000. Greedily finding a dense subgraph. Algorithmica 34 203–221.
1. Battiti, R., and M. Protasi, 2001. Reactive local search for the maximum clique problem. Algorithmica 29 610–637.
1. Bernardo, R., 1994. Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Sci. 34 20–25.
1. Bernardo, R., 1995. Genetic models for predicting maize single-cross performance in unbalanced yield trial data. Crop Sci. 35 141–147.
1. Bernardo, R., 1996. Best linear unbiased prediction of the performance of crosses between untested maize inbreds. Crop Sci. 36 50–56.

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Graph-based data selection for the construction of genomic prediction models

Affiliation

Graph-based data selection for the construction of genomic prediction models

Authors

Affiliation

Abstract

Figures

Similar articles

Cited by

References

MeSH terms

Substances

LinkOut - more resources

Full Text Sources

Other Literature Sources

Research Materials