. 2024 Jun 3;24(11):3613.

doi: 10.3390/s24113613.

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

Dante Trabassi¹, Stefano Filippo Castiglia^{1

2}, Fabiano Bini³, Franco Marinozzi³, Arash Ajoudani⁴, Marta Lorenzini⁴, Giorgia Chini⁵, Tiwana Varrecchia⁵, Alberto Ranavolo⁵, Roberto De Icco^{2

6}, Carlo Casali¹, Mariano Serrao^{1

7}

Affiliations

¹ Department of Medical and Surgical Sciences and Biotechnologies, "Sapienza" University of Rome, 04100 Latina, Italy.
² Department of Brain and Behavioral Sciences, University of Pavia, 27100 Pavia, Italy.
³ Department of Mechanical and Aerospace Engineering, Sapienza University of Rome, 00184 Rome, Italy.
⁴ Department of Advanced Robotics, Italian Institute of Technology, 16163 Genoa, Italy.
⁵ Department of Occupational and Environmental Medicine, Epidemiology and Hygiene, INAIL, Monte Porzio Catone, 00078 Rome, Italy.
⁶ Headache Science & Neurorehabilitation Unit, IRCCS Mondino Foundation, 27100 Pavia, Italy.
⁷ Movement Analysis Laboratory, Policlinico Italia, 00162 Rome, Italy.

PMID: 38894404
PMCID: PMC11175240
DOI: 10.3390/s24113613

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

Dante Trabassi et al. Sensors (Basel). 2024.

. 2024 Jun 3;24(11):3613.

doi: 10.3390/s24113613.

Authors

Affiliations

¹ Department of Medical and Surgical Sciences and Biotechnologies, "Sapienza" University of Rome, 04100 Latina, Italy.
² Department of Brain and Behavioral Sciences, University of Pavia, 27100 Pavia, Italy.
³ Department of Mechanical and Aerospace Engineering, Sapienza University of Rome, 00184 Rome, Italy.
⁴ Department of Advanced Robotics, Italian Institute of Technology, 16163 Genoa, Italy.
⁵ Department of Occupational and Environmental Medicine, Epidemiology and Hygiene, INAIL, Monte Porzio Catone, 00078 Rome, Italy.
⁶ Headache Science & Neurorehabilitation Unit, IRCCS Mondino Foundation, 27100 Pavia, Italy.
⁷ Movement Analysis Laboratory, Policlinico Italia, 00162 Rome, Italy.

PMID: 38894404
PMCID: PMC11175240
DOI: 10.3390/s24113613

Abstract

The interpretability of gait analysis studies in people with rare diseases, such as those with primary hereditary cerebellar ataxia (pwCA), is frequently limited by the small sample sizes and unbalanced datasets. The purpose of this study was to assess the effectiveness of data balancing and generative artificial intelligence (AI) algorithms in generating synthetic data reflecting the actual gait abnormalities of pwCA. Gait data of 30 pwCA (age: 51.6 ± 12.2 years; 13 females, 17 males) and 100 healthy subjects (age: 57.1 ± 10.4; 60 females, 40 males) were collected at the lumbar level with an inertial measurement unit. Subsampling, oversampling, synthetic minority oversampling, generative adversarial networks, and conditional tabular generative adversarial networks (ctGAN) were applied to generate datasets to be input to a random forest classifier. Consistency and explainability metrics were also calculated to assess the coherence of the generated dataset with known gait abnormalities of pwCA. ctGAN significantly improved the classification performance compared with the original dataset and traditional data augmentation methods. ctGAN are effective methods for balancing tabular datasets from populations with rare diseases, owing to their ability to improve diagnostic models with consistent explainability.

Keywords: cerebellar ataxia; conditional tabular generative artificial network; data augmentation; data balancing; gait analysis; generative artificial intelligence; generative artificial network; inertial measurement unit; rare diseases; synthetic minority oversampling technique.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

**Figure 1**
**Machine learning and data augmentation strategy.** A flowchart demonstrating the methodological strategy to enhance machine learning classification in the context of uncommon disease detection. It starts with data collection and progresses to preprocessing and feature noise selection to assure data quality. To overcome dataset imbalance, undersampling, bootstrapping, SMOTE, GAN, and ctGAN were used. LazyPredict package was used for an initial assessment of candidate models. The random forest classifier was then chosen for the classification task, with hyperparameter tuning performed via Bayesian optimization, and its performance was measured using known criteria. Finally, ShAP analysis increased the model’s explainability, ensuring transparency and understanding of how features influence predicted outcomes.

**Figure 2**
**SMOTE application in class balancing**. There are two distinct classes represented here: cerebellar ataxia subjects, as the minority class samples and healthy subjects, as the majority class samples. The majority class is represented by a greater number of subjects distributed across the field, whereas the minority class is represented by fewer individuals. SMOTE focuses on the minority class, which has fewer samples and is thus underrepresented in the dataset. A random sample is drawn from the minority class; this sample is designated as $x_{i}$ . Its k-nearest neighbors are evaluated: the diagram depicts the four nearest neighbors of sample $x_{i}$ within the minority class, which are linked by dashed lines. These neighbors are in the feature space. One of the k-nearest neighbors is randomly chosen: a neighbor is chosen at random from among the four nearest neighbors. Interpolating between the original sample $x_{i}$ and the selected neighbor yields a new synthetic minority class sample.

**Figure 3**
**GAN architecture.** Starting with the generator, the first layer after the random noise vector is a dense layer, which is fully connected and has 128 neurons. Following that, we have LeakyReLU as the activation function, which allows some negative values to ‘leak’ through, potentially avoiding the problem of dead neurons during training, with an alpha of 0.01 defining the slope of the negative part of the activation function. The subsequent batch normalization layer normalizes each batch’s input to keep the mean close to zero and the standard deviation close to one. This helps to stabilize the training and is widely used in GANs. The final layer of the generator contains as many neurons as there are features to be generated and employs the activation function tanh to produce the generator’s output, which are the synthetic features. In contrast, the discriminator’s first layer contains 64 neurons. The last layer contains a single neuron with a sigmoid activation function. This is because the discriminator is responsible for determining whether the data are real (value close to 1) or synthetic/false (value close to 0). The discriminator is built with binary cross entropy as the loss function, an Adam optimizer with a specified learning rate, and a beta parameter that handles the gradient’s exponential moving average decay.

**Figure 4**
**Working process of a ctGAN**. The process begins with inputting random noise into the generator. This noise serves as a seed for creating new data samples. The generator takes this random noise and attempts to generate new synthetic data that closely resemble the distribution of the original training data. The generator gradually learns to produce more realistic data. The generator produces synthetic data, which should be indistinguishable from real data once the ctGAN is fully trained. The GAN discriminator is responsible for distinguishing between real training data and synthetic data produced by the generator. It provides feedback to the generator on the quality of the synthetic data. In addition to the synthetic data, the discriminator receives real training data. Exploratory data analysis and feature engineering processes are used to ensure that the training data is in the proper format and contains the necessary features to effectively train the discriminator. This represents the generator’s successfully generated synthetic sample, which the discriminator is unable to distinguish from real data The ctGAN is trained using an adversarial process in which both the generator and the discriminator iteratively improve themselves.

**Figure 5**
**Correlation heatmap.** The heatmap obtained using the Seaborn library’s relplot function displays the correlations between the initial features. Each cell in the matrix represents the partial correlation between two variables, as shown by the variable names on the x- and y-axes. The color of the cell indicates the direction and strength of the correlation: red for positive and blue for negative. The size of the circle within the cell represents the magnitude of the correlation coefficient. A threshold of 0.5 was chosen to determine which characteristics should be included in the dataset. HR, harmonic ratio; sLLE, short-time largest Lyapunov’s exponent; RQArec, %recurrence in recurrence quantification analysis; RQAdet, %determinism in recurrence quantification analysis; CV_steplength, coefficient of variation of step length; AP, ML, V, anterior–posterior, mediolateral, and vertical direction of the acceleration signal, respectively.

**Figure 6**
**Feature Importance plot.** Each bar represents a feature used in the RF model, and the length of the bar indicates how important that feature is when making predictions. Importance is typically calculated based on how much each feature reduces the impurity of the division. The Noise feature, which was introduced to help determine which features to keep, acts as a baseline. If the real features are similar or less important than the noise, they may not contribute significantly to the model’s predictions and can be removed in the next iteration.

**Figure 7**
**SHAP value plots.** The x-axis shows the SHAP value associated with each feature. A SHAP value indicates the impact of a feature on model output. Positive values increase the prediction, towards a more positive outcome, whereas negative values decrease the prediction towards a more negative outcome. The color denotes the feature value, with red indicating high values and blue indicating low values. For example, if a feature (in red) has high values and is associated with positive SHAP values, the predicted outcome tends to improve as its value increases. The two graphs (a,b) show how the importance and effects of the features differ between the two models, pwCA and HS. Some features have a stronger positive or negative impact in one model than in the other.

See this image and copyright information in PMC

Cited by

Influence of Main Thoracic and Thoracic Kyphosis Morphology on Gait Characteristics in Adolescents with Idiopathic Scoliosis: Gait Analysis Using an Inertial Measurement Unit.
Takahashi K, Tsubouchi Y, Abe T, Takeo Y, Iwakiri M, Kataoka T, Inoue K, Sako N, Kataoka M, Miyazaki M, Kaku N. Takahashi K, et al. Sensors (Basel). 2025 Jul 9;25(14):4265. doi: 10.3390/s25144265. Sensors (Basel). 2025. PMID: 40732393 Free PMC article.
Development of machine learning models for gait-based classification of incomplete spinal cord injuries and cauda equina syndrome.
Park SG, Mun SB, Kim YJ, Kim KG. Park SG, et al. Sci Rep. 2025 Jun 6;15(1):20012. doi: 10.1038/s41598-025-04065-6. Sci Rep. 2025. PMID: 40481015 Free PMC article.
Identification and quantification of muscular cocontraction for ankle rehabilitation through variational mode decomposition in surface electromyography.
Yasmeen S, Waris A, Amin F, Iqbal J, Gilani SO, Khan MJ, Hazzazi F, Imran A, Shah UH, Ijaz MA. Yasmeen S, et al. Sci Rep. 2025 Apr 28;15(1):14847. doi: 10.1038/s41598-025-96334-7. Sci Rep. 2025. PMID: 40295627 Free PMC article.
Gait stability prediction through synthetic time-series and vision-based data.
Cordeiro MC, Cathain CO, Nascimento VB, Rodrigues TB. Cordeiro MC, et al. Front Sports Act Living. 2025 Aug 13;7:1646146. doi: 10.3389/fspor.2025.1646146. eCollection 2025. Front Sports Act Living. 2025. PMID: 40881479 Free PMC article.
Gait-based Parkinson's disease diagnosis and severity classification using force sensors and machine learning.
Navita, Mittal P, Sharma YK, Rai AK, Simaiya S, Lilhore UK, Kumar V. Navita, et al. Sci Rep. 2025 Jan 2;15(1):328. doi: 10.1038/s41598-024-83357-9. Sci Rep. 2025. PMID: 39747956 Free PMC article.

See all "Cited by" articles

References

1. David P.F., David R.C., Juan M.C., Diego T. Human Locomotion Databases: A Systematic Review. IEEE J. Biomed. Health Inform. 2024;28:1716–1729. doi: 10.1109/JBHI.2023.3311677. - DOI - PubMed
1. Rinaldi M., Ranavolo A., Conforto S., Martino G., Draicchio F., Conte C., Varrecchia T., Bini F., Casali C., Pierelli F., et al. Increased Lower Limb Muscle Coactivation Reduces Gait Performance and Increases Metabolic Cost in Patients with Hereditary Spastic Paraparesis. Clin. Biomech. 2017;48:63–72. doi: 10.1016/J.CLINBIOMECH.2017.07.013. - DOI - PubMed
1. Buckley E., Mazzà C., McNeill A. A Systematic Review of the Gait Characteristics Associated with Cerebellar Ataxia. Gait Posture. 2018;60:154–163. doi: 10.1016/J.GAITPOST.2017.11.024. - DOI - PubMed
1. Giordano I., Harmuth F., Jacobi H., Paap B., Vielhaber S., MacHts J., Schöls L., Synofzik M., Sturm M., Tallaksen C., et al. Clinical and Genetic Characteristics of Sporadic Adult-Onset Degenerative Ataxia. Neurology. 2017;89:1043–1049. doi: 10.1212/WNL.0000000000004311. - DOI - PubMed
1. Coarelli G., Wirth T., Tranchant C., Koenig M., Durr A., Anheim M. The Inherited Cerebellar Ataxias: An Update. J. Neurol. 2023;270:208–222. doi: 10.1007/S00415-022-11383-6. - DOI - PMC - PubMed

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions
Actions

Grants and funding

7074/470 DIG/BRIC "Bando Ricerche in Collaborazione 2022"

LinkOut - more resources

Full Text Sources
Medical
- MedlinePlus Health Information

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

Affiliations

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical