Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2010 Aug;185(4):1463-75.
doi: 10.1534/genetics.110.116426. Epub 2010 May 17.

Graph-based data selection for the construction of genomic prediction models

Affiliations

Graph-based data selection for the construction of genomic prediction models

Steven Maenhout et al. Genetics. 2010 Aug.

Abstract

Efficient genomic selection in animals or crops requires the accurate prediction of the agronomic performance of individuals from their high-density molecular marker profiles. Using a training data set that contains the genotypic and phenotypic information of a large number of individuals, each marker or marker allele is associated with an estimated effect on the trait under study. These estimated marker effects are subsequently used for making predictions on individuals for which no phenotypic records are available. As most plant and animal breeding programs are currently still phenotype driven, the continuously expanding collection of phenotypic records can only be used to construct a genomic prediction model if a dense molecular marker fingerprint is available for each phenotyped individual. However, as the genotyping budget is generally limited, the genomic prediction model can only be constructed using a subset of the tested individuals and possibly a genome-covering subset of the molecular markers. In this article, we demonstrate how an optimal selection of individuals can be made with respect to the quality of their available phenotypic data. We also demonstrate how the total number of molecular markers can be reduced while a maximum genome coverage is ensured. The third selection problem we tackle is specific to the construction of a genomic prediction model for a hybrid breeding program where only molecular marker fingerprints of the homozygous parents are available. We show how to identify the set of parental inbred lines of a predefined size that has produced the highest number of progeny. These three selection approaches are put into practice in a simulation study where we demonstrate how the trade-off between sample size and sample quality affects the prediction accuracy of genomic prediction models for hybrid maize.

PubMed Disclaimer

Figures

F<sc>igure</sc> 1.—
Figure 1.—
Graph of the trade-off between the selection size and the selection quality for a sample of the RAGT grain maize breeding pool. For each examined level of CDmin, ranging from 0.0 to 0.97, the dot represents the maximum cardinality selection of individuals for which the minimum precision of a pairwise contrast is at least CDmin.
F<sc>igure</sc> 2.—
Figure 2.—
Graph representation of a sample of the RAGT grain maize breeding pool. The blue vertices represent inbred lines and the gray edges are single-cross hybrids.
F<sc>igure</sc> 3.—
Figure 3.—
Graph of the trade-off between the selection size and the selection quality when only k parental inbred lines are being genotyped. For each examined level of CDmin ranging from 0.0 to 0.97 the number of genotyped inbred lines k is reduced from 487 to 3. Each dot in the plotted surface represents the maximum cardinality selection of hybrid individuals for which the minimum precision of a pairwise contrast is at least CDmin and the number of parents is exactly k.
F<sc>igure</sc> 4.—
Figure 4.—
Log-scaled degree distribution of the graph created from part of the RAGT R2n grain maize breeding program. In this undirected, unweighted graph, parental inbred lines are represented as vertices and single-cross hybrids as edges. Each dot represents a unique log-scaled vertex degree (horizontal axis) and the log of its frequency in the graph (vertical axis). The red line represents the fitted power law distribution by means of likelihood maximization. The threshold value of 6 was determined by minimizing the Kolmogorov–Smirnov statistic as described by Clauset et al. (2009).
F<sc>igure</sc> 5.—
Figure 5.—
Accuracy of the genotypic value BLUPs of the hybrids selected using the described graph-based procedures. The three examined heritability levels h2 = 0.25, h2 = 0.5, and h2 = 0.75 are represented by the bottom, middle, and top wireframe surfaces respectively. Each point on a surface is the squared Pearson correlation between the BLUPs and the actual (simulated) genotypic values of the selected hybrids under the constraints of a minimum required contrast precision CDmin, expressed as a percentile of the sampled CD values, and the number of genotyped inbred lines, averaged over 100 iterations of the simulation routine.
F<sc>igure</sc> 6.—
Figure 6.—
Average prediction accuracy of ɛ-SVR and BLP prediction models over 100 iterations of the simulation routine for varying levels of the minimum required contrast precision CDmin, expressed as a percentile of the sampled CD values ranging from 0 to 0.875 and the number of genotyped inbred lines. The height of each point in the wireframe represents the prediction accuracy obtained by ɛ-SVR and BLP when training on the optimal selection of hybrids under the constraints imposed by the levels of the two independent variables. Prediction accuracy is expressed as the average squared Pearson correlation between the simulated and the predicted genotypic values of the hybrids. The interval at the bottom of each wireframe provides the minimum and maximum standard error of the mean. The scales of the vertical axes are comparable only within the same heritability level.

Similar articles

Cited by

References

    1. Asahiro, Y., K. Iwama, H. Tamaki and T. Tokuyama, 2000. Greedily finding a dense subgraph. Algorithmica 34 203–221.
    1. Battiti, R., and M. Protasi, 2001. Reactive local search for the maximum clique problem. Algorithmica 29 610–637.
    1. Bernardo, R., 1994. Prediction of maize single-cross performance using RFLPs and information from related hybrids. Crop Sci. 34 20–25.
    1. Bernardo, R., 1995. Genetic models for predicting maize single-cross performance in unbalanced yield trial data. Crop Sci. 35 141–147.
    1. Bernardo, R., 1996. Best linear unbiased prediction of the performance of crosses between untested maize inbreds. Crop Sci. 36 50–56.

Substances