Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 3;24(11):3613.
doi: 10.3390/s24113613.

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

Affiliations

Optimizing Rare Disease Gait Classification through Data Balancing and Generative AI: Insights from Hereditary Cerebellar Ataxia

Dante Trabassi et al. Sensors (Basel). .

Abstract

The interpretability of gait analysis studies in people with rare diseases, such as those with primary hereditary cerebellar ataxia (pwCA), is frequently limited by the small sample sizes and unbalanced datasets. The purpose of this study was to assess the effectiveness of data balancing and generative artificial intelligence (AI) algorithms in generating synthetic data reflecting the actual gait abnormalities of pwCA. Gait data of 30 pwCA (age: 51.6 ± 12.2 years; 13 females, 17 males) and 100 healthy subjects (age: 57.1 ± 10.4; 60 females, 40 males) were collected at the lumbar level with an inertial measurement unit. Subsampling, oversampling, synthetic minority oversampling, generative adversarial networks, and conditional tabular generative adversarial networks (ctGAN) were applied to generate datasets to be input to a random forest classifier. Consistency and explainability metrics were also calculated to assess the coherence of the generated dataset with known gait abnormalities of pwCA. ctGAN significantly improved the classification performance compared with the original dataset and traditional data augmentation methods. ctGAN are effective methods for balancing tabular datasets from populations with rare diseases, owing to their ability to improve diagnostic models with consistent explainability.

Keywords: cerebellar ataxia; conditional tabular generative artificial network; data augmentation; data balancing; gait analysis; generative artificial intelligence; generative artificial network; inertial measurement unit; rare diseases; synthetic minority oversampling technique.

PubMed Disclaimer

Conflict of interest statement

The authors declare no conflicts of interest.

Figures

Figure 1
Figure 1
Machine learning and data augmentation strategy. A flowchart demonstrating the methodological strategy to enhance machine learning classification in the context of uncommon disease detection. It starts with data collection and progresses to preprocessing and feature noise selection to assure data quality. To overcome dataset imbalance, undersampling, bootstrapping, SMOTE, GAN, and ctGAN were used. LazyPredict package was used for an initial assessment of candidate models. The random forest classifier was then chosen for the classification task, with hyperparameter tuning performed via Bayesian optimization, and its performance was measured using known criteria. Finally, ShAP analysis increased the model’s explainability, ensuring transparency and understanding of how features influence predicted outcomes.
Figure 2
Figure 2
SMOTE application in class balancing. There are two distinct classes represented here: cerebellar ataxia subjects, as the minority class samples and healthy subjects, as the majority class samples. The majority class is represented by a greater number of subjects distributed across the field, whereas the minority class is represented by fewer individuals. SMOTE focuses on the minority class, which has fewer samples and is thus underrepresented in the dataset. A random sample is drawn from the minority class; this sample is designated as xi. Its k-nearest neighbors are evaluated: the diagram depicts the four nearest neighbors of sample xi within the minority class, which are linked by dashed lines. These neighbors are in the feature space. One of the k-nearest neighbors is randomly chosen: a neighbor is chosen at random from among the four nearest neighbors. Interpolating between the original sample xi and the selected neighbor yields a new synthetic minority class sample.
Figure 3
Figure 3
GAN architecture. Starting with the generator, the first layer after the random noise vector is a dense layer, which is fully connected and has 128 neurons. Following that, we have LeakyReLU as the activation function, which allows some negative values to ‘leak’ through, potentially avoiding the problem of dead neurons during training, with an alpha of 0.01 defining the slope of the negative part of the activation function. The subsequent batch normalization layer normalizes each batch’s input to keep the mean close to zero and the standard deviation close to one. This helps to stabilize the training and is widely used in GANs. The final layer of the generator contains as many neurons as there are features to be generated and employs the activation function tanh to produce the generator’s output, which are the synthetic features. In contrast, the discriminator’s first layer contains 64 neurons. The last layer contains a single neuron with a sigmoid activation function. This is because the discriminator is responsible for determining whether the data are real (value close to 1) or synthetic/false (value close to 0). The discriminator is built with binary cross entropy as the loss function, an Adam optimizer with a specified learning rate, and a beta parameter that handles the gradient’s exponential moving average decay.
Figure 4
Figure 4
Working process of a ctGAN. The process begins with inputting random noise into the generator. This noise serves as a seed for creating new data samples. The generator takes this random noise and attempts to generate new synthetic data that closely resemble the distribution of the original training data. The generator gradually learns to produce more realistic data. The generator produces synthetic data, which should be indistinguishable from real data once the ctGAN is fully trained. The GAN discriminator is responsible for distinguishing between real training data and synthetic data produced by the generator. It provides feedback to the generator on the quality of the synthetic data. In addition to the synthetic data, the discriminator receives real training data. Exploratory data analysis and feature engineering processes are used to ensure that the training data is in the proper format and contains the necessary features to effectively train the discriminator. This represents the generator’s successfully generated synthetic sample, which the discriminator is unable to distinguish from real data The ctGAN is trained using an adversarial process in which both the generator and the discriminator iteratively improve themselves.
Figure 5
Figure 5
Correlation heatmap. The heatmap obtained using the Seaborn library’s relplot function displays the correlations between the initial features. Each cell in the matrix represents the partial correlation between two variables, as shown by the variable names on the x- and y-axes. The color of the cell indicates the direction and strength of the correlation: red for positive and blue for negative. The size of the circle within the cell represents the magnitude of the correlation coefficient. A threshold of 0.5 was chosen to determine which characteristics should be included in the dataset. HR, harmonic ratio; sLLE, short-time largest Lyapunov’s exponent; RQArec, %recurrence in recurrence quantification analysis; RQAdet, %determinism in recurrence quantification analysis; CVsteplength, coefficient of variation of step length; AP, ML, V, anterior–posterior, mediolateral, and vertical direction of the acceleration signal, respectively.
Figure 6
Figure 6
Feature Importance plot. Each bar represents a feature used in the RF model, and the length of the bar indicates how important that feature is when making predictions. Importance is typically calculated based on how much each feature reduces the impurity of the division. The Noise feature, which was introduced to help determine which features to keep, acts as a baseline. If the real features are similar or less important than the noise, they may not contribute significantly to the model’s predictions and can be removed in the next iteration.
Figure 7
Figure 7
SHAP value plots. The x-axis shows the SHAP value associated with each feature. A SHAP value indicates the impact of a feature on model output. Positive values increase the prediction, towards a more positive outcome, whereas negative values decrease the prediction towards a more negative outcome. The color denotes the feature value, with red indicating high values and blue indicating low values. For example, if a feature (in red) has high values and is associated with positive SHAP values, the predicted outcome tends to improve as its value increases. The two graphs (a,b) show how the importance and effects of the features differ between the two models, pwCA and HS. Some features have a stronger positive or negative impact in one model than in the other.

Similar articles

Cited by

References

    1. David P.F., David R.C., Juan M.C., Diego T. Human Locomotion Databases: A Systematic Review. IEEE J. Biomed. Health Inform. 2024;28:1716–1729. doi: 10.1109/JBHI.2023.3311677. - DOI - PubMed
    1. Rinaldi M., Ranavolo A., Conforto S., Martino G., Draicchio F., Conte C., Varrecchia T., Bini F., Casali C., Pierelli F., et al. Increased Lower Limb Muscle Coactivation Reduces Gait Performance and Increases Metabolic Cost in Patients with Hereditary Spastic Paraparesis. Clin. Biomech. 2017;48:63–72. doi: 10.1016/J.CLINBIOMECH.2017.07.013. - DOI - PubMed
    1. Buckley E., Mazzà C., McNeill A. A Systematic Review of the Gait Characteristics Associated with Cerebellar Ataxia. Gait Posture. 2018;60:154–163. doi: 10.1016/J.GAITPOST.2017.11.024. - DOI - PubMed
    1. Giordano I., Harmuth F., Jacobi H., Paap B., Vielhaber S., MacHts J., Schöls L., Synofzik M., Sturm M., Tallaksen C., et al. Clinical and Genetic Characteristics of Sporadic Adult-Onset Degenerative Ataxia. Neurology. 2017;89:1043–1049. doi: 10.1212/WNL.0000000000004311. - DOI - PubMed
    1. Coarelli G., Wirth T., Tranchant C., Koenig M., Durr A., Anheim M. The Inherited Cerebellar Ataxias: An Update. J. Neurol. 2023;270:208–222. doi: 10.1007/S00415-022-11383-6. - DOI - PMC - PubMed