How Population Structure Impacts Genomic Selection Accuracy in Cross-Validation: Implications for Practical Breeding

Christian R Werner¹, R Chris Gaynor¹, Gregor Gorjanc¹, John M Hickey¹, Tobias Kox², Amine Abbadi², Gunhild Leckband³, Rod J Snowdon⁴, Andreas Stahl^{4

5}

Affiliations

¹ The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Research Centre, Midlothian, United Kingdom.
² NPZ Innovation GmbH, Holtsee, Germany.
³ German Seed Alliance GmbH, Hohenlieth, Germany.
⁴ Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University, Giessen, Germany.
⁵ Julius Kuehn Institute (JKI), Federal Research Centre for Cultivated Plants, Institute for Resistance Research and Stress Tolerance, Quedlinburg, Germany.

PMID: 33391305
PMCID: PMC7772221
DOI: 10.3389/fpls.2020.592977

How Population Structure Impacts Genomic Selection Accuracy in Cross-Validation: Implications for Practical Breeding

Christian R Werner et al. Front Plant Sci. 2020.

. 2020 Dec 16:11:592977.

doi: 10.3389/fpls.2020.592977. eCollection 2020.

Authors

Christian R Werner¹, R Chris Gaynor¹, Gregor Gorjanc¹, John M Hickey¹, Tobias Kox², Amine Abbadi², Gunhild Leckband³, Rod J Snowdon⁴, Andreas Stahl^{4

5}

Affiliations

¹ The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush Research Centre, Midlothian, United Kingdom.
² NPZ Innovation GmbH, Holtsee, Germany.
³ German Seed Alliance GmbH, Hohenlieth, Germany.
⁴ Department of Plant Breeding, IFZ Research Centre for Biosystems, Land Use and Nutrition, Justus Liebig University, Giessen, Germany.
⁵ Julius Kuehn Institute (JKI), Federal Research Centre for Cultivated Plants, Institute for Resistance Research and Stress Tolerance, Quedlinburg, Germany.

PMID: 33391305
PMCID: PMC7772221
DOI: 10.3389/fpls.2020.592977

Abstract

Over the last two decades, the application of genomic selection has been extensively studied in various crop species, and it has become a common practice to report prediction accuracies using cross validation. However, genomic prediction accuracies obtained from random cross validation can be strongly inflated due to population or family structure, a characteristic shared by many breeding populations. An understanding of the effect of population and family structure on prediction accuracy is essential for the successful application of genomic selection in plant breeding programs. The objective of this study was to make this effect and its implications for practical breeding programs comprehensible for breeders and scientists with a limited background in quantitative genetics and genomic selection theory. We, therefore, compared genomic prediction accuracies obtained from different random cross validation approaches and within-family prediction in three different prediction scenarios. We used a highly structured population of 940 Brassica napus hybrids coming from 46 testcross families and two subpopulations. Our demonstrations show how genomic prediction accuracies obtained from among-family predictions in random cross validation and within-family predictions capture different measures of prediction accuracy. While among-family prediction accuracy measures prediction accuracy of both the parent average component and the Mendelian sampling term, within-family prediction only measures how accurately the Mendelian sampling term can be predicted. With this paper we aim to foster a critical approach to different measures of genomic prediction accuracy and a careful analysis of values observed in genomic selection experiments and reported in literature.

Keywords: genomic prediction; nested association mapping population; oilseed rape; predictive breeding; structure.

PubMed Disclaimer

Conflict of interest statement

AA and TK were employed by the company NPZ Innovation GmbH. GL was employed by the company German Seed Alliance GmbH. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

**Figure 1**
Schematic crossing scheme for development of the *Brassica napus* Nested Association Mapping (BnNAM) population based on 46 founder lines and generation of corresponding test hybrids. 29 non-adapted (include kales, fodder rapes and 19 resynthesized lines) and 17 adapted lines (include a broad set of genetic diverse old European varieties and one resynthesized line) were crossed with the common elite parent “DH5Oase x Nugget” (DH5ON). Based on F2 individuals, 29 families of recombinant inbreed lines were generated via three generations of single seed descend (SSD). Seventeen DH families were produced by using doubled haploid technology. Subsequently, 940 test hybrids were generated by crossing of the elite male-sterile parent “MSL007” with paternal recombinant inbred lines or doubled haploid lines.

**Figure 2**
Boxplots representing the phenotypic variation for **(A)** seed yield, **(B)** flowering time, **(C)** oil concentration, and **(D)** glucosinolate concentration. Three columns on the very left show the average of the total set of 940 test hybrids (orange) and of both subsets of DH testcrosses (white) and SSD testcrosses (blue), respectively. Phenotypic distribution within each of the 46 testcross families is presented in the individual boxplots.

**Figure 3**
Boxplots representing the prediction accuracies observed in prediction scenario 1 using the total set of 940 testcross hybrids. Traits included seed yield (YLD), flowering time (FLT), oil concentration in the seed (OIL) and glucosinolate content in the seed (GSL). Prediction accuracies were calculated using GEBV cross validation (GEBV-CV), genotypic parent average cross validation (GPA-CV) and within-testcross family validation (WFAM). In the two cross validation approaches, the data set was randomly divided into a training population and validation population over 100 iterations. The size of the validation population was set to 20. In the WFAM, all genotypes from one testcross family were used as validation population while the remaining testcross families served as training population.

**Figure 4**
Boxplots representing the prediction accuracies observed in prediction scenario 2 using the DH testcrosses **(A)** and SSD testcrosses **(B)**. Traits included seed yield (YLD), flowering time (FLT), oil concentration in the seed (OIL) and glucosinolate content in the seed (GSL). Prediction accuracies were calculated using GEBV cross validation (GEBV-CV), genotypic parent average cross validation (GPA-CV), phenotypic parent average cross validation (PPA-CV) and within-testcross family validation (WFAM). In the three cross validation approaches, the data set was randomly divided into a training population and validation population over 100 iterations. The size of the validation population was set to 20. In the WFAM, all genotypes from one testcross family were used as validation population while the remaining testcross families served as training population.

**Figure 5**
Correlations between observed and genomic predicted seed yield for two testcross families with very different phenotypic performance **(A)** and very similar phenotypic performance **(B)**. The correlation coefficient was used as a measure for prediction accuracy. Prediction accuracies were calculated as within-family prediction accuracies for each of the two families individually (blue and yellow solid lines) and among-family prediction accuracy for both families simultaneously (gray dashed line). The genomic prediction model was trained using all remaining testcross families.

See this image and copyright information in PMC

References

1. Bayer P. E., Hurgobin B., Golicz A. A., Chan C-K., Yuan Y., Lee H., et al. (2017). Assembly and comparison of two closely related Brassica napus genomes. Plant Biotechnol. J. 15, 1602–1610. 10.1111/pbi.12742 - DOI - PMC - PubMed
1. Clarke E. W., Higgins E. E., Plieske J., Wieseke R., Sidebottom C., Khedikar Y., et al. (2016). A high-density SNP genotyping array for Brassica napus and its ancestral diploid species based on optimised selection of single-locus markers in the allotetraploid genome. Theor. Appl. Genet. 129, 1887–1899. 10.1007/s00122-016-2746-7 - DOI - PMC - PubMed
1. Crossa J., Pérez-Rodríguez P., Cuevas J., Montesinos-López O., Jarquín D., de Los Campos G., et al. (2017). Genomic selection in plant breeding: methods, models, and perspectives. Trends Plant Sci. 22, 961–975. 10.1016/j.tplants.2017.08.011 - DOI - PubMed
1. Daetwyler H. D., Villanueva B., Bijma P., Woolliams J. A. (2007). Inbreeding in genome-wide selection. J. Anim. Breed. Genet. 124, 369–376. 10.1111/j.1439-0388.2007.00693.x - DOI - PubMed
1. Daetwyler H. D., Villanueva B., Woolliams J. A. (2008). Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE 3:e3395. 10.1371/journal.pone.0003395 - DOI - PMC - PubMed

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

How Population Structure Impacts Genomic Selection Accuracy in Cross-Validation: Implications for Practical Breeding

Affiliations

How Population Structure Impacts Genomic Selection Accuracy in Cross-Validation: Implications for Practical Breeding

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources