Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Feb;16(1):e13599.
doi: 10.1002/jcsm.13599. Epub 2024 Dec 5.

Feature Engineering for the Prediction of Scoliosis in 5q-Spinal Muscular Atrophy

Affiliations

Feature Engineering for the Prediction of Scoliosis in 5q-Spinal Muscular Atrophy

Tu-Lan Vu-Han et al. J Cachexia Sarcopenia Muscle. 2025 Feb.

Abstract

Background: 5q-Spinal muscular atrophy (SMA) is now one of the 5% treatable rare diseases worldwide. As disease-modifying therapies alter disease progression and patient phenotypes, paediatricians and consulting disciplines face new unknowns in their treatment decisions. Conclusions made from historical patient data sets are now mostly limited, and new approaches are needed to ensure our continued best standard-of-care practices for this exceptional patient group. Here, we present a data-driven machine learning approach to a rare disease data set to predict spinal muscular atrophy (SMA)-associated scoliosis.

Methods: We collected data from 84 genetically confirmed 5q-SMA patients who have received novel SMA therapies. We performed expert domain knowledge-directed feature engineering, correlation and predictive power score (PPS) analyses for feature selection. To test the predictive performance of the selected features, we trained a Random Forest Classifier and evaluated model performance using standard metrics.

Results: The SMA data set consisted of 1304 visits and over 360 variables. We performed feature engineering for variables related to 'interventions', 'devices', 'orthosis', 'ventilation', 'muscle contractures' and 'motor milestones'. Through correlation and PPS analysis paired with expert domain knowledge feature selection, we identified relevant features for scoliosis prediction in SMA that included disease progression markers: Hammersmith Functional Motor Scale Expanded 'HFMSE' (PPS = 0.27) and 6-Minute Walk Test '6MWT' scores (PPS = 0.44), 'age' (PPS = 0.41) and 'weight' (PPS = 0.49), 'contractures' (PPS = 0.17), the use of 'assistive devices' (PPS = 0.39, 'ventilation' (PPS = 0.16) and the presence of 'gastric tubes' (PPS = 0.35) in SMA patients. These features were validated using expert domain knowledge and used to train a Random Forest Classifier with an observed accuracy of 0.82 and an average receiver operating characteristic (ROC) area of 0.87.

Conclusion: The introduction of disease-modifying SMA therapies, followed by the implementation of SMA in newborn screenings, has presented physicians with never-seen patients. We used feature engineering tools to overcome one of the main challenges when using data-driven approaches in rare disease data sets. Through predictive modelling of this data, we defined disease progression markers, which are easily assessed during patient visits and can help anticipate scoliosis onset. This highlights the importance of progressive features in the drug-induced revolution of this rare disease and further supports the ongoing efforts to update the SMA classification. We advocate for the consistent documentation of relevant progression markers, which will serve as a basis for data-driven models that physicians can use to update their best standard-of-care practices.

Keywords: feature engineering; gene therapy; machine learning; predictive power score; rare disease; spinal muscular atrophy.

PubMed Disclaimer

Conflict of interest statement

Claudia Weiß is on the honorary advisory board of Novartis, Roche and Biogen and has given an honorary presentation at conferences for Novartis. The other authors declare no conflicts of interest.

Figures

FIGURE 1
FIGURE 1
Engineered features from variables of the SMAScoliosis data set. Variables were grouped by association, converted to numeric and then binary arrays or aggregated (grouped). The binary array was used to calculate either a sum or a score. Detailed mathematical equations used for feature engineering are described in the Supporting Information.
FIGURE 2
FIGURE 2
Flow chart data collection of scoliosis labels using the most reliable scoliosis detection method. All patients with genetically confirmed 5q‐SMA were included in the SMAScoliosis data set. Labels were collected from anteroposterior spine radiographs, and the scoliosis label was ‘1’ (positive) if a Cobb angle > 10° was measured and ‘0’ (negative) if a Cobb angle was < 10°. Forty‐one patients had available spinal radiographs in the PACS, and the scoliosis was labelled according to the measured Cobb angle. If no PACS spinal radiograph was available, but external orthopaedic treatment was documented in the patient EHR (i.e., external radiological or orthopaedic report), the scoliosis label was derived from the documentation (7 patients). If no spinal radiograph was available, we used chest radiographs (6 patients) if the observation aligned with the clinical documentation (e.g., ‘clinical exam: orthograde spine’). The scoliosis label was derived only if both corresponded. Next, we derived the scoliosis label from clinical examinations in the patient's EHR. If multiple entries over time documented higher‐grade clinical scoliosis or lack thereof, the label was set accordingly. All other cases were labelled ‘unknown’ (NA).
FIGURE 3
FIGURE 3
Predictive power score of engineered features versus their constituents. Bar plots grouped by the engineered features ‘Ventilation’, ‘Contractures’, ‘Assistive Devices’ and ‘Orthosis’. The y‐axis labels give the constituent features, and the x‐axis the predictive power score of the feature to ‘scoliosis_yn’.
FIGURE 4
FIGURE 4
Correlation matrix of features from the SMAScoliosis data set: features are listed on the x‐axis and y‐axis. The correlation coefficient r is annotated accordingly. We observe some positive correlation between correlation matrix, where we observe some positive correlation between ‘baseline gastric tube’ (r = 0.78), ‘ventilation score’ (r = 0.28), ‘6MWT score’ (r = 0.39), ‘HFMSE score’ (r = 0.44), ‘RULM score’ (r = 0.32), ‘CHOP motor score’ (r = 0.32),), ‘head circumference’ (r = 0.31), ‘weight’ (r = 0.44), ‘age at assessment’ (r = 0.63), ‘height’ (r = 0.64), ‘first symptoms sum’ (r = 0.22) and ‘contractures score’ (r = 0.30) with our target label ‘scoliosis’.
FIGURE 5
FIGURE 5
Predictive power score matrix of features from the SMAScoliosis data set: the x‐axis lists the features that predict the targets on the y‐axis. Predictive power scores are annotated accordingly. Predictors of our target label ‘scoliosis_yn’ (y‐axis) include ‘age_assess’ (0.41), ‘baseline_gastric_tube’ (0.35), ‘BMI’ (0.23), ‘contractures_score’ (0.17), ‘devices_score’ (0.39), ‘head_circumference’ (0.41), ‘height’ (0.51), ‘hfmse_motor_score’ (0.27) and ‘6mwt_score’ (0.44 ‘weight’ (0.44).
FIGURE 6
FIGURE 6
Feature ranking by calculated predictive power score (PPS).
FIGURE 7
FIGURE 7
Receiver operating characteristics (ROC)‐curves of a random forest classifier trained with SMAScoliosis features.

References

    1. Richter T., Nestler‐Parr S., Babela R., et al., “Rare Disease Terminology and Definitions—A Systematic Global Review: Report of the ISPOR Rare Disease Special Interest Group,” Value in Health 18 (2015): 906–914. - PubMed
    1. Liu J., Barrett J. S., Leonardi E. T., et al., “Natural History and Real‐World Data in Rare Diseases: Applications, Limitations, and Future Perspectives,” Journal of Clinical Pharmacology 62, no. Suppl 2 (2022): S38–S55. - PMC - PubMed
    1. Vill K., Schwartz O., Blaschek A., et al., “Newborn Screening for Spinal Muscular Atrophy in Germany: Clinical Results After 2 Years,” Orphanet Journal of Rare Diseases 16 (2021): 153. - PMC - PubMed
    1. Butchbach M. E., “Copy Number Variations in the Survival Motor Neuron Genes: Implications for Spinal Muscular Atrophy and Other Neurodegenerative Diseases,” Frontiers in Molecular Biosciences 3 (2016): 7. - PMC - PubMed
    1. Wirth B., Karakaya M., Kye M. J., and Mendoza‐Ferreira N., “Twenty‐Five Years of Spinal Muscular Atrophy Research: From Phenotype to Genotype to Therapy, and What Comes Next,” Annual Review of Genomics and Human Genetics 21 (2020): 231–261. - PubMed