Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2012;13 Suppl 19(Suppl 19):S17.
doi: 10.1186/1471-2105-13-S19-S17. Epub 2012 Dec 19.

ARG-based genome-wide analysis of cacao cultivars

Affiliations

ARG-based genome-wide analysis of cacao cultivars

Filippo Utro et al. BMC Bioinformatics. 2012.

Abstract

Background: Ancestral recombinations graph (ARG) is a topological structure that captures the relationship between the extant genomic sequences in terms of genetic events including recombinations. IRiS is a system that estimates the ARG on sequences of individuals, at genomic scales, capturing the relationship between these individuals of the species. Recently, this system was used to estimate the ARG of the recombining X Chromosome of a collection of human populations using relatively dense, bi-allelic SNP data.

Results: While the ARG is a natural model for capturing the inter-relationship between a single chromosome of the individuals of a species, it is not immediately apparent how the model can utilize whole-genome (across chromosomes) diploid data. Also, the sheer complexity of an ARG structure presents a challenge to graph visualization techniques. In this paper we examine the ARG reconstruction for (1) genome-wide or multiple chromosomes, (2) multi-allelic and (3) extremely sparse data. To aid in the visualization of the results of the reconstructed ARG, we additionally construct a much simplified topology, a classification tree, suggested by the ARG.As the test case, we study the problem of extracting the relationship between populations of Theobroma cacao. The chocolate tree is an outcrossing species in the wild, due to self-incompatibility mechanisms at play. Thus a principled approach to understanding the inter-relationships between the different populations must take the shuffling of the genomic segments into account. The polymorphisms in the test data are short tandem repeats (STR) and are multi-allelic (sometimes as high as 30 distinct possible values at a locus). Each is at a genomic location that is bilaterally transmitted, hence the ARG is a natural model for this data. Another characteristic of this plant data set is that while it is genome-wide, across 10 linkage groups or chromosomes, it is very sparse, i.e., only 96 loci from a genome of approximately 400 megabases. The results are visualized both as MDS plots and as classification trees. To evaluate the accuracy of the ARG approach, we compare the results with those available in literature.

Conclusions: We have extended the ARG model to incorporate genome-wide (ensemble of multiple chromosomes) data in a natural way. We present a simple scheme to implement this in practice. Finally, this is the first time that a plant population data set is being studied by estimating its underlying ARG. We demonstrate an overall precision of 0.92 and an overall recall of 0.93 of the ARG-based classification, with respect to the gold standard. While we have corroborated the classification of the samples with that in literature, this opens the door to other potential studies that can be made on the ARG.

PubMed Disclaimer

Figures

Figure 1
Figure 1
The essential characteristics of the data set. (a) A summary of the distribution of the polymorphic sites over the chromosomes and the missing values. (b) Number of distinct allelic values per polymorphic site (which ranges from 3 to 30). The red vertical lines are the boundaries of the chromosomes.
Figure 2
Figure 2
The Ensemble Method. The Ensemble Method: An example with 4 chromosomes each with an orientation and a distinct color above. (i) The chromosomes are first arranged as a circular ring. (ii) Then they are randomly permuted and randomly flipped. Then a random cut is placed on the ring, shown as a dashed line. (iii) Then the ring is flattened out and is staged as input to the IRiS pipeline.
Figure 3
Figure 3
The work flow for multiple chromosomes. Work flow for multiple (G') chromosomes. (a) IRiS pipeline: In Phase 1, we use two different parameter settings to obtain an intermediate matrix (called recomatrix) for Phase 2. The second phase constructs the ARG. (b) Some Z' ≤ G' chromosomes are used through the IRiS pipeline and the Z' results are consolidated to obtain the final analysis. (c) The IRiS pipeline is used on an ensemble sequence (see Fig. 2) of the G' chromosomes some N times and the results are consolidated.
Figure 4
Figure 4
Results for the ensemble method on the complete data set. Visualization of the ARG results as a classification tree: The Ensemble method on all ten chromosomes on the complete data set. (a) A classification tree. (b) The first two components of an MDS of the pairwise distances.
Figure 5
Figure 5
Results for the ensemble method on the subsample data set. Visualization of the ARG results as a classification tree and an MDS plot: The Ensemble method on all ten chromosomes on the subsample data set. (a) A classification tree. Note that [5] presents a classification tree for this subsample data set and the agreement index of this tree is 0.43. (b) The first two components of an MDS of the pairwise distances.
Figure 6
Figure 6
Summary of the results. Application of the two methods -Solo and Ensemble- to a variety of data configurations: The results from each case is compared with the gold standard in [5] and the table summarizes the F-index and the Agreement metric. The 3 longest chromosomes are Chr 1, Chr 3 and Chr 5. The '-' values are either extremely low or too poor for any classification. The F-Index (Eqn 2) and the Agreement index (Eqn 4) are each real positive values between 0.0 and 1.0 inclusive, with the theoretical best at 1.0.

References

    1. Bartley BGD. The genetic diversity of cacao and its utilization. CABI Pub; 2005.
    1. Laurent V, Risterucci AM, Lanaud C. Genetic diversity in cocoa revealed by cDNA probes. TAG Theoretical and Applied Genetics. 1994;88:193–198. - PubMed
    1. Lerceteau E, Robert T, Pétiard V, Crouzillat D. Evaluation of the extent of genetic variability among Theobroma cacao accessions using RAPD and RFLP markers. TAG Theoretical and Applied Genetics. 1997;95:10–19. doi: 10.1007/s001220050527. - DOI
    1. Sereno M, Albuquerque P, Vencovsky R, Figueira A. Genetic Diversity and Natural Population Structure of Cacao (Theobroma cacao L.) from the Brazilian Amazon Evaluated by Microsatellite Markers. Conservation Genetics. 2006;7:13–24. doi: 10.1007/s10592-005-7568-0. - DOI
    1. Motamayor JC, Lachenaud P, da Silva e Mota JW, Loor R, Kuhn DN, Brown JS, Schnell RJ. Geographic and Genetic Population Differentiation of the Amazonian Chocolate Tree (Theobroma cacao L) PLoS ONE. 2008;3:e3311. doi: 10.1371/journal.pone.0003311. - DOI - PMC - PubMed

Publication types