Empirically-derived synthetic populations to mitigate small sample sizes

Erin E Fowler¹, Anders Berglund², Michael J Schell³, Thomas A Sellers, Steven Eschrich⁴, John Heine⁵

Affiliations

¹ Cancer Epidemiology Department, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: erin.fowler@moffitt.org.
² Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: anders.berglund@moffitt.org.
³ Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: michael.schell@moffitt.org.
⁴ Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: steven.eschrich@moffitt.org.
⁵ Cancer Epidemiology Department, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: john.heine@moffitt.org.

PMID: 32173502
PMCID: PMC7839232
DOI: 10.1016/j.jbi.2020.103408

Empirically-derived synthetic populations to mitigate small sample sizes

Erin E Fowler et al. J Biomed Inform. 2020 May.

. 2020 May:105:103408.

doi: 10.1016/j.jbi.2020.103408. Epub 2020 Mar 12.

Authors

Erin E Fowler¹, Anders Berglund², Michael J Schell³, Thomas A Sellers, Steven Eschrich⁴, John Heine⁵

Affiliations

¹ Cancer Epidemiology Department, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: erin.fowler@moffitt.org.
² Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: anders.berglund@moffitt.org.
³ Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: michael.schell@moffitt.org.
⁴ Department of Biostatistics and Bioinformatics, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: steven.eschrich@moffitt.org.
⁵ Cancer Epidemiology Department, MCC, Moffitt Cancer Center & Research Institute, 12901 Bruce B. Downs Blvd, Tampa, FL 33612, United States. Electronic address: john.heine@moffitt.org.

PMID: 32173502
PMCID: PMC7839232
DOI: 10.1016/j.jbi.2020.103408

Abstract

Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with unconstrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons. Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples. The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.

Keywords: Differential evolution; Distance to the model in X-space; Kernel density estimation; Overfitting; Principal component analysis; Synthetic data generation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

**Figure 1.**
Synthetic Sample Generation Flow Diagram: This shows the steps for generating a synthetic sample. Differential evolution (DE) optimization is a cyclic process that determines the optimal H. The parameter vectors are outputted by DE (left) and the more fit vector moves to the next DE generation (right).

**Figure 2.**
Differential Evolution Trial Competition: This illustrates the basic competition for one trial within a given generation. There are Np trial competitions per generation. The i^th vector from the current generation, w_ig, competes with a trial vector, u_ig. The trial vector is a mutation of w_ig with attributes derived from the population, w_ig, shown in Figure 3. A synthetic population (SP) is generated with each vector. Synthetic samples are constructed from each SP; these are used to make the H comparisons.

**Figure 3.**
Trial Vector Construction: The trial vector, u_ig, is constructed with components from w_ig and the mutant vector v_ig. The v_ig construction is shown in Eq. [7]. The Vectors u_ig and w_ig compete shown in Figure 2.

**Figure 4.**
Control-sample sparsity illustration: This illustration shows multiple projections of the sample’s EPDF onto the mass-PD plane. Mass (kg) is on the vertical axis and PD (breast density measure) on horizontal axis. For example light-blue (two individuals) includes individuals with age = 58 years with height = 64 inches (spare projection). Each point represents one individual.

**Figure 5.**
Synthetic control population reconstruction illustration. This shows a two-dimensional slice through the synthetic population in pane- A for women with age = 58 years and height = 64 inches. This corresponds to the sparsest sample (n = 2) in Figure 4 (light-blue). The dashed line marks a profile for women with mass = 64kg. The corresponding conditional PDF for PD (breast density measure) is shown in pane-B before scaling and normalization were applied.

**Figure 6.**
PCA models for the Observed and Synthetic Samples: The first two principal components derived from the case-sample are shown in pane-A together with the predictions of the synthetic data. There is no visible difference between the observed and the four different synthetic dataset. The residuals, DmodX, are shown in pane-B as a boxplot with violin density lines. The differences in the residuals between the observed and synthetic data were not significant. Similar results were noted for the control-sample in pane-C and pane-D.

See this image and copyright information in PMC

References

1. Mascalzoni D, Paradiso A, Hansson M. Rare disease research: Breaking the privacy barrier. Appl Transl Genom. 2014;3:23–9.doi:10.1016/j.atg.2014.04.003. - DOI - PMC - PubMed
1. Darquy S, Moutel G, Lapointe A-S, D’Audiffret D, Champagnat J, Guerroui S, et al. Patient/family views on data sharing in rare diseases: study in the European LeukoTreat project. Eur J Hum Genet. 2016;24:338–43.doi:10.1038/ejhg.2015.115. - DOI - PMC - PubMed
1. Erves JC, Mayo-Gamble TL, Malin-Fair A, Boyer A, Joosten Y, Vaughn YC, et al. Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community Perspective. J Community Health. 2017;42:472–80.doi:10.1007/s10900-016-0279-2. - DOI - PMC - PubMed
1. Lay JO Jr, Borgmann S, Liyanage R, Wilkins CL. Problems with the “omics”. Trends in Analytical Chemistry. 2006;25.doi:10.1016/j.trac.2006.10.007. - DOI
1. Micheel CM, Nass SJ, Omenn GS. Evolution of Translational Omics: Lessons Learned and the Path Forward: National Academies Press; 2012 - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Empirically-derived synthetic populations to mitigate small sample sizes

Affiliations

Empirically-derived synthetic populations to mitigate small sample sizes

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources