Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2020 May:105:103408.
doi: 10.1016/j.jbi.2020.103408. Epub 2020 Mar 12.

Empirically-derived synthetic populations to mitigate small sample sizes

Affiliations

Empirically-derived synthetic populations to mitigate small sample sizes

Erin E Fowler et al. J Biomed Inform. 2020 May.

Abstract

Limited sample sizes can lead to spurious modeling findings in biomedical research. The objective of this work is to present a new method to generate synthetic populations (SPs) from limited samples using matched case-control data (n = 180 pairs), considered as two separate limited samples. SPs were generated with multivariate kernel density estimations (KDEs) with unconstrained bandwidth matrices. We included four continuous variables and one categorical variable for each individual. Bandwidth matrices were determined with Differential Evolution (DE) optimization by covariance comparisons. Four synthetic samples (n = 180) were derived from their respective SPs. Similarity between observed samples with synthetic samples was compared assuming their empirical probability density functions (EPDFs) were similar. EPDFs were compared with the maximum mean discrepancy (MMD) test statistic based on the Kernel Two-Sample Test. To evaluate similarity within a modeling context, EPDFs derived from the Principal Component Analysis (PCA) scores and residuals were summarized with the distance to the model in X-space (DModX) as additional comparisons. Four SPs were generated from each sample. The probability of selecting a replicate when randomly constructing synthetic samples (n = 180) was infinitesimally small. MMD tests indicated that the observed sample EPDFs were similar to the respective synthetic EPDFs. For the samples, PCA scores and residuals did not deviate significantly when compared with their respective synthetic samples. The feasibility of this approach was demonstrated by producing synthetic data at the individual level, statistically similar to the observed samples. The methodology coupled KDE with DE optimization and deployed novel similarity metrics derived from PCA. This approach could be used to generate larger-sized synthetic samples. To develop this approach into a research tool for data exploration purposes, additional evaluation with increased dimensionality is required. Moreover, given a fully specified population, the degree to which individuals can be discarded while synthesizing the respective population accurately will be investigated. When these objectives are addressed, comparisons with other techniques such as bootstrapping will be required for a complete evaluation.

Keywords: Differential evolution; Distance to the model in X-space; Kernel density estimation; Overfitting; Principal component analysis; Synthetic data generation.

PubMed Disclaimer

Conflict of interest statement

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

Figure 1.
Figure 1.
Synthetic Sample Generation Flow Diagram: This shows the steps for generating a synthetic sample. Differential evolution (DE) optimization is a cyclic process that determines the optimal H. The parameter vectors are outputted by DE (left) and the more fit vector moves to the next DE generation (right).
Figure 2.
Figure 2.
Differential Evolution Trial Competition: This illustrates the basic competition for one trial within a given generation. There are Np trial competitions per generation. The ith vector from the current generation, wig, competes with a trial vector, uig. The trial vector is a mutation of wig with attributes derived from the population, wig, shown in Figure 3. A synthetic population (SP) is generated with each vector. Synthetic samples are constructed from each SP; these are used to make the H comparisons.
Figure 3.
Figure 3.
Trial Vector Construction: The trial vector, uig, is constructed with components from wig and the mutant vector vig. The vig construction is shown in Eq. [7]. The Vectors uig and wig compete shown in Figure 2.
Figure 4.
Figure 4.
Control-sample sparsity illustration: This illustration shows multiple projections of the sample’s EPDF onto the mass-PD plane. Mass (kg) is on the vertical axis and PD (breast density measure) on horizontal axis. For example light-blue (two individuals) includes individuals with age = 58 years with height = 64 inches (spare projection). Each point represents one individual.
Figure 5.
Figure 5.
Synthetic control population reconstruction illustration. This shows a two-dimensional slice through the synthetic population in pane- A for women with age = 58 years and height = 64 inches. This corresponds to the sparsest sample (n = 2) in Figure 4 (light-blue). The dashed line marks a profile for women with mass = 64kg. The corresponding conditional PDF for PD (breast density measure) is shown in pane-B before scaling and normalization were applied.
Figure 6.
Figure 6.
PCA models for the Observed and Synthetic Samples: The first two principal components derived from the case-sample are shown in pane-A together with the predictions of the synthetic data. There is no visible difference between the observed and the four different synthetic dataset. The residuals, DmodX, are shown in pane-B as a boxplot with violin density lines. The differences in the residuals between the observed and synthetic data were not significant. Similar results were noted for the control-sample in pane-C and pane-D.

Similar articles

Cited by

References

    1. Mascalzoni D, Paradiso A, Hansson M. Rare disease research: Breaking the privacy barrier. Appl Transl Genom. 2014;3:23–9.doi:10.1016/j.atg.2014.04.003. - DOI - PMC - PubMed
    1. Darquy S, Moutel G, Lapointe A-S, D’Audiffret D, Champagnat J, Guerroui S, et al. Patient/family views on data sharing in rare diseases: study in the European LeukoTreat project. Eur J Hum Genet. 2016;24:338–43.doi:10.1038/ejhg.2015.115. - DOI - PMC - PubMed
    1. Erves JC, Mayo-Gamble TL, Malin-Fair A, Boyer A, Joosten Y, Vaughn YC, et al. Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community Perspective. J Community Health. 2017;42:472–80.doi:10.1007/s10900-016-0279-2. - DOI - PMC - PubMed
    1. Lay JO Jr, Borgmann S, Liyanage R, Wilkins CL. Problems with the “omics”. Trends in Analytical Chemistry. 2006;25.doi:10.1016/j.trac.2006.10.007. - DOI
    1. Micheel CM, Nass SJ, Omenn GS. Evolution of Translational Omics: Lessons Learned and the Path Forward: National Academies Press; 2012 - PubMed

Publication types

LinkOut - more resources