Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Maria Del Mar Muñiz Moreno^{1

2}, Claire Gavériaux-Ruff³, Yann Herault^{4

5}

Affiliations

¹ Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France. munizmorenomariadelmar@gmail.com.
² John P. Hussman Institute for Human Genomics, University of Miami, Miller School of Medicine, Miami, FL, 33136, USA. munizmorenomariadelmar@gmail.com.
³ Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France.
⁴ Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France. herault@igbmc.fr.
⁵ Université de Strasbourg, CNRS, INSERM, CELPHEDIA, PHENOMIN-Institut Clinique de La Souris (ICS), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France. herault@igbmc.fr.

PMID: 36703114
PMCID: PMC9878791
DOI: 10.1186/s12859-022-05111-0

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Maria Del Mar Muñiz Moreno et al. BMC Bioinformatics. 2023.

. 2023 Jan 26;24(1):28.

doi: 10.1186/s12859-022-05111-0.

Authors

Maria Del Mar Muñiz Moreno^{1

2}, Claire Gavériaux-Ruff³, Yann Herault^{4

5}

Affiliations

¹ Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France. munizmorenomariadelmar@gmail.com.
² John P. Hussman Institute for Human Genomics, University of Miami, Miller School of Medicine, Miami, FL, 33136, USA. munizmorenomariadelmar@gmail.com.
³ Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France.
⁴ Université de Strasbourg, CNRS UMR7104, INSERM U1258, Institut de Génétique, Biologie Moléculaire Et Cellulaire (IGBMC), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France. herault@igbmc.fr.
⁵ Université de Strasbourg, CNRS, INSERM, CELPHEDIA, PHENOMIN-Institut Clinique de La Souris (ICS), 1 Rue Laurent Fries, 67404, Illkirch Graffenstaden, France. herault@igbmc.fr.

PMID: 36703114
PMCID: PMC9878791
DOI: 10.1186/s12859-022-05111-0

Abstract

Background: In individuals or animals suffering from genetic or acquired diseases, it is important to identify which clinical or phenotypic variables can be used to discriminate between disease and non-disease states, the response to treatments or sexual dimorphism. However, the data often suffers from low number of samples, high number of variables or unbalanced experimental designs. Moreover, several parameters can be recorded in the same test. Thus, correlations should be assessed, and a more complex statistical framework is necessary for the analysis. Packages already exist that provide analysis tools, but they are not found together, rendering the decision method and implementation difficult for non-statisticians.

Result: We present Gdaphen, a fast joint-pipeline allowing the identification of most important qualitative and quantitative predictor variables to discriminate between genotypes, treatments, or sex. Gdaphen takes as input behavioral/clinical data and uses a Multiple Factor Analysis (MFA) to deal with groups of variables recorded from the same individuals or anonymize genotype-based recordings. Gdaphen uses as optimized input the non-correlated variables with 30% correlation or higher on the MFA-Principal Component Analysis (PCA), increasing the discriminative power and the classifier's predictive model efficiency. Gdaphen can determine the strongest variables that predict gene dosage effects thanks to the General Linear Model (GLM)-based classifiers or determine the most discriminative not linear distributed variables thanks to Random Forest (RF) implementation. Moreover, Gdaphen provides the efficacy of each classifier and several visualization options to fully understand and support the results as easily readable plots ready to be included in publications. We demonstrate Gdaphen capabilities on several datasets and provide easily followable vignettes.

Conclusions: Gdaphen makes the analysis of phenotypic data much easier for medical or preclinical behavioral researchers, providing an integrated framework to perform: (1) pre-processing steps as data imputation or anonymization; (2) a full statistical assessment to identify which variables are the most important discriminators; and (3) state of the art visualizations ready for publication to support the conclusions of the analyses. Gdaphen is open-source and freely available at https://github.com/munizmom/gdaphen , together with vignettes, documentation for the functions and examples to guide you in each own implementation.

Keywords: Bootstrapping; Clinical data; Discrimination; Generalized linear models; Imputation; Machine learning; Model; Phenotypic data; Prediction; R package; Random forest.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no competing interests.

Figures

**Fig. 1**
Gdaphen workflow. Scheme highlighting the functionalities that Gdaphen can help to implement. From the data organized with individuals in rows and parameters measured in columns, Gdaphen comprises three modules. The first module deals with data pre-processing to shape the data in the input needed to perform the different analyses. Next, the module Analysis is where MFA, feature selection and classification strategies are performed. Last, the module visualization contains the functions that will generate the plots ready for the publications. As shown in the figure, a dotplot is implemented to show the classifiers results in the importance of each variable to the discrimination of your variable of interest. Then the MFA results are shown using different visualizations

**Fig. 2**
Example *Scn10a*^G1662S Gdaphen analysis. A Variance explained by the top ten PCA components and cumulative variance using the full model containing the 14 variables. B Variance explained by the top ten PCA components and cumulative variance using the sel30% model containing 9 variables. C 3D-PCA plots showing the individuals clustering on the first 3 dimensions coloring based on each genotype and sex combination. Left panel shows all the genotypes and sexes, the middle plot shows only wild type and heterozygous data, and the right plot shows only wildtype and heterozygous data. D Classifiers results for the 9 variables included in the Sel30% model showing the most important variables to genotype discrimination after scaling to the top discriminative one. E Classifiers results for all the 14 variables included in the full model showing the most important variables to genotype discrimination after scaling to the top discriminative one. F 2D-PCA plots showing the individuals clustering on the first 3 dimensions and colored based on each genotype and sex combination. G Categorical variables discrimination component map. The panels show the distribution in 2 dimensions of the categorical variables PCA coordinates calculated in the MFA analysis using the MFAmix function from PCAmixdata R package [21]. H Square loadings plot with coordinates calculated in the MFA analysis using the MFAmix function from PCAmixdata R package [21]. I Cosine similarity distance coordinates drawn in each principal component for the selected 30% variables of the analyses calculated by the MFAmix function. The arrow length measures the contribution of each variable to the discrimination on each dimension. Arrows that follow similar trajectories (stronger cosine similarity distance) contribute to the discrimination of the data in the same dimensions

**Fig. 3**
Example *Scn9a*^R185H Gdaphen analysis. A Variance explained by top the ten PCA components and cumulative variance using the full model containing the 14 variables. B Variance explained by top the ten PCA components and cumulative variance using the sel30% model containing 9 variables. C 3D-PCA plots showing the individuals clustering on the first 3 dimensions and colored based on each genotype and sex combination. Left panel shows all the genotypes and sexes, the middle plot shows only wild-type and heterozygous data, and the right plot shows only wild-type and homozygous data. D Classifiers results for the 6 variables included in the Sel30% model showing the most important variables to genotype discrimination after scaling to the top discriminative one. E Classifiers results for all the 18 variables included in the full model showing the most important variables to genotype discrimination after scaling to the top discriminative one. F 2D-PCA plots showing the individuals clustering on the first 3 dimensions. On the upper panel coloring is based on genotype and sex combinations. On the lower panel coloring is just by genotype. G Categorical variables discrimination component map. The panels show the distribution in 2 dimensions of the categorical variables PCA coordinates calculated in the MFA analysis using the MFAmix function from PCAmixdata R package [21]. H Square loadings plot with coordinates calculated in the MFA analysis using the MFAmix function from PCAmixdata R package [21]. I Parallel plot showing the non-scaled results of the most influencing variables to the discrimination colored by genotype and sex showing the mean of the variable per group of genotype and sex. In the left parallel, the plot using all the genotypes and sex data. The middle panel shows only heterozygous and wild-types and the right panel only homozygous and wild-types

See this image and copyright information in PMC

References

1. Serdar CC, Cihan M, Yücel D, Serdar MA. Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochem Med (Zagreb) 2021;31(1):010502. doi: 10.11613/BM.2021.010502. - DOI - PMC - PubMed
1. Faber J, Fonseca LM. How sample size influences research outcomes. Dental Press J Orthod. 2014;19(4):27–9. doi: 10.1590/2176-9451.19.4.027-029.ebo. - DOI - PMC - PubMed
1. Barbour DL. Precision medicine and the cursed dimensions. NPJ Digit Med. 2019;2(1):4. doi: 10.1038/s41746-019-0081-5. - DOI - PMC - PubMed
1. Vanhoeyveld J, Martens D. Imbalanced classification in sparse and large behaviour datasets. Data Min Knowl Discov. 2018;32(1):25–82. doi: 10.1007/s10618-017-0517-y. - DOI
1. de Cnudde S, Ramon Y, Martens D, Provost F. Deep learning on big, sparse. Behav Data Big Data. 2019;7(4):286–307. doi: 10.1089/big.2019.0095. - DOI - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Affiliations

Gdaphen, R pipeline to identify the most important qualitative and quantitative predictor variables from phenotypic data

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources