Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012 Nov 23:13:312.
doi: 10.1186/1471-2105-13-312.

Core Hunter II: fast core subset selection based on multiple genetic diversity measures using Mixed Replica search

Affiliations

Core Hunter II: fast core subset selection based on multiple genetic diversity measures using Mixed Replica search

Herman De Beukelaer et al. BMC Bioinformatics. .

Abstract

Background: Sampling core subsets from genetic resources while maintaining as much as possible the genetic diversity of the original collection is an important but computationally complex task for gene bank managers. The Core Hunter computer program was developed as a tool to generate such subsets based on multiple genetic measures, including both distance measures and allelic diversity indices. At first we investigate the effect of minimum (instead of the default mean) distance measures on the performance of Core Hunter. Secondly, we try to gain more insight into the performance of the original Core Hunter search algorithm through comparison with several other heuristics working with several realistic datasets of varying size and allelic composition. Finally, we propose a new algorithm (Mixed Replica search) for Core Hunter II with the aim of improving the diversity of the constructed core sets and their corresponding generation times.

Results: Our results show that the introduction of minimum distance measures leads to core sets in which all accessions are sufficiently distant from each other, which was not always obtained when optimizing mean distance alone. Comparison of the original Core Hunter algorithm, Replica Exchange Monte Carlo (REMC), with simpler heuristics shows that the simpler algorithms often give very good results but with lower runtimes than REMC. However, the performance of the simpler algorithms is slightly worse than REMC under lower sampling intensities and some heuristics clearly struggle with minimum distance measures. In comparison the new advanced Mixed Replica search algorithm (MixRep), which uses heterogeneous replicas, was able to sample core sets with equal or higher diversity scores than REMC and the simpler heuristics, often using less computation time than REMC.

Conclusion: The REMC search algorithm used in the original Core Hunter computer program performs well, sometimes leading to slightly better results than some of the simpler methods, although it doesn't always give the best results. By switching to the new Mixed Replica algorithm overall results and runtimes can be significantly improved. Finally we recommend including minimum distance measures in the objective function when looking for core sets in which all accessions are sufficiently distant from each other. Core Hunter II is freely available as an open source project at http://www.corehunter.org.

PubMed Disclaimer

Figures

Figure 1
Figure 1
3D Toy example datasets, optimizing mean versus minimum distance. Core collections sampled from two generated three-dimensional toy example datasets, respectively of size 500 and 1000, the former being completely random, the latter having a very strongly clustered structure. Both datasets contain only one single marker with 3 corresponding alleles. Core selection was performed using the REMC algorithm, optimizing mean (top) and minimum (bottom) MR distances. For the random dataset, the sampling intensity is set to 0.2, while an intensity of 0.05 is used for the larger, clustered set. (a) random dataset, mean Modified Rogers’ distance (sampling intensity = 0.2), (b) clustered dataset, mean Modified Rogers’ distance (sampling intensity = 0.05), (c) random dataset, minimum Modified Rogers’ distance (sampling intensity = 0.2), (d) clustered dataset, minimum Modified Rogers’ distance (sampling intensity = 0.05).
Figure 2
Figure 2
PCA plots and distance histograms of cores sampled from large pea dataset. This figure shows both PCA plots and distance histograms of core collections sampled from the large pea dataset, once obtained by optimizing mean MR alone and once by optimizing the mixed MR objective which includes both mean and minimum MR distance with equal weight. The sampling intensity was set to 0.2 and cores where constructed using the LR method. (a) optimizing mean Modified Rogers’ distance – core structure, (b) optimizing mixed Modified Rogers’ distance – core structure, (c) optimizing mean Modified Rogers’ distance – pairwise distance distribution, (d) optimizing mixed Modified Rogers’ distance – pairwise distance distribution.

References

    1. Frankel OH. In: Genetic manipulation: impact on man and society. Arber W, Illmensee K, Peacock WJ, Starlinger P, editor. Cambridge: Cambridge University Press; 1984. Genetic perspectives of germplasm conservation; pp. 161–170.
    1. Wright S. Evolution and the Genetics of Populations: A treatise in four volumes, Volume IV. 1427 East 60th Street Chicago, IL 60637 USA: University of Chicago Press; 1978.
    1. Cavalli-Sforza L, Edwards A. Phylogenetic analysis. Models and estimation procedures. Am J Human Genet. 1967;19:233–257. - PMC - PubMed
    1. Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27:379–423.
    1. Berg EE, Hamrick JL. Quantification of genetic diversity at allozyme loci. Can J Forest Res. 1997;27:415–424. doi: 10.1139/x96-195. - DOI

Publication types

LinkOut - more resources