Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 9;25(1):1050.
doi: 10.1186/s12870-025-07128-y.

Integrated phenotypic analysis, predictive modeling, and identification of novel trait-associated loci in a diverse Theobroma cacao collection

Affiliations

Integrated phenotypic analysis, predictive modeling, and identification of novel trait-associated loci in a diverse Theobroma cacao collection

Insuck Baek et al. BMC Plant Biol. .

Abstract

Background: Cacao (Theobroma cacao L.) breeding and improvement rely on understanding germplasm diversity and trait architecture. This study characterized a cacao collection (173 accessions) evaluated in Puerto Rico, examining phenotypic diversity, trait interrelationships, and performing comparative analyses with published Trinidad and Colombia datasets. We also developed machine learning (ML) models for yield prediction and identified yield-associated SNP markers.

Results: The cacao collection showed significant phenotypic variation and strong intra-collection trait correlations. Comparative analyses revealed conserved trait responses across environments, notably linking susceptibility to black pod rot in Puerto Rico with Witches' Broom Disease in Colombia, suggesting a broad-spectrum disease response mechanism. Machine learning models effectively modeled yield, quantifying a hierarchy of predictor importance, with 'Total pods', 'Infection rate', and 'Pod weight' being the most influential. Integrating existing SNP data for 28 common accessions, multiple SNPs were identified as significantly associated with key horticultural traits, including 'Total pods', 'Infection rate', and 'Yield' (FDR < 0.01). Notably, a single genetic marker on chromosome 5 (TcSNP475), located within a putative zinc finger stress-associated protein gene (Tc05_t008610), was associated with both 'Total pods' and 'Yield', representing a prime target for marker-assisted selection.

Conclusions: This research provides a detailed characterization of a wide germplasm collection, robust yield predictors, and a suite of novel trait-linked genetic markers, offering valuable resources for cacao breeding. These integrated findings will provide a solid foundation for targeted breeding strategies and deeper molecular investigations into the mechanisms underpinning yield and stress resilience in this vital global crop.

PubMed Disclaimer

Conflict of interest statement

Declarations. Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Competing interests: The authors declare no competing interests.

Figures

Fig. 1
Fig. 1
Principal component analysis of 173 cacao accessions based on all measured phenotypic traits. The plot displays the distribution of individual accessions along the first two principal components (PC1 and PC2). The 28 accessions with known genetic group assignments from Bekele et al. [22], are color-coded according to their respective group: AMAZ, IMC (red); CRIOLLO (orange); LCT_EEN, SCA, MO (yellow); NA, Pound (green); PA (blue); and SPEC (purple). The remaining 145 accessions are shown as black dots. Red vectors indicate the mean coordinates for each genetic group
Fig. 2
Fig. 2
Pearson’s correlation matrix of horticultural traits from TARS cacao phenotype data. The upper triangle displays a heatmap of Pearson correlation coefficients (r), where red indicates positive correlations and blue indicates negative correlations. The lower triangle shows scatter plots for each pair of traits, with a fitted regression line. Trait names are indicated on the diagonal. Detailed correlation coefficients and p-values are provided in Supplementary Data 1
Fig. 3
Fig. 3
Correlation heatmap of genetic cluster membership and phenotypic traits. The heatmap displays Pearson’s correlation coefficients (r) between membership coefficients for 28 common accessions to the K = 7 genetic clusters defined by Bekele et al. [22] and the corresponding phenotypic traits evaluated in Puerto Rico. Red squares indicate positive correlations and blue squares indicate negative correlations, with color intensity corresponding to the magnitude of the correlation coefficient
Fig. 4
Fig. 4
Comparison of key horticultural traits among 28 cacao accessions grouped by genetic clusters. Boxplots illustrate the distribution of Dry Seed weight (g), Pod index, Total pods (count), and Yield (kg/tree/year) for accessions assigned to the K = 7 genetic clusters. Other horticultural traits evaluated did not show statistically significant differences among these genetic groups in this subset of accessions. Boxes represent the interquartile range (IQR), the horizontal line within the box indicates the median, and whiskers extend to 1.5 times the IQR. Different letters above the boxes indicate statistically significant differences (p < 0.05) between mean values for the genetic groups, based on ANOVA and post-hoc Student’s t-tests
Fig. 5
Fig. 5
Comparative analysis using Pearson’s correlations of traits between the TARS evaluation dataset and the Agrosavia collection dataset for 20 overlapping cacao accessions. Traits from the TARS study are highlighted with a green background on the axes. Pearson correlation coefficients (r) are visualized by color intensity (red for positive, blue for negative). Detailed correlation coefficients and p-values are provided in Supplementary Data 1
Fig. 6
Fig. 6
Volcano plots illustrating marker associations with key horticultural traits from the TARS cacao collection. Response screening analysis using genetic markers from the ICGT study as predictors for traits in 28 common accessions. In each plot, the x-axis represents the mean difference associated with each marker, the y-axis represents the Logworth score, and each point represents a single marker. The horizontal dashed line indicates the significance threshold (FDR p < 0.01). Points colored red or blue highlight the markers that surpassed this threshold. (a) Volcano plot for the ‘Total pods’ trait, highlighting three significant markers (the locus TcSNP475, and SNPs TcSNP428 and TcSNP154). (b) Volcano plot for the ‘Infection rate’ trait, highlighting one significant SNP (TcSNP508). (c) Volcano plot for the ‘Yield’ trait, highlighting two significant markers (the locus TcSNP475 and SNP TcSNP483)

Similar articles

References

    1. Kongor JE, Owusu M, Oduro-Yeboah C. Cocoa production in the 2020s: challenges and solutions. CABI Agric Biosci. 2024;5:102.
    1. Argout X, Salse J, Aury J-M, Guiltinan MJ, Droc G, Gouzy J, et al. The genome of Theobroma cacao. Nat Genet. 2011;43:101–8. - PubMed
    1. Aikpokpodion P. Phenology of flowering in Cacao (Theobroma cacao) and its related species in Nigeria. Afr J Agric Res. 2012;7:3395–402.
    1. Falque M, Lesdalons C, Eskes AB. Comparison of two Cacao (Theobroma Cacao L.) clones for the effect of pollination intensity on fruit set and seed content. Sex Plant Reprod. 1996;9:221–7.
    1. Snoeck D, Koko L, Joffre J, Bastide P, Jagoret P. Cacao nutrition and fertilization. In: Lichtfouse E, editor. Sustainable agriculture reviews: volume 19. Cham: Springer International Publishing; 2016. pp. 155–202.

LinkOut - more resources