Universal feature selection for simultaneous interpretability of multitask datasets
- PMID: 41547940
- PMCID: PMC12896148
- DOI: 10.1186/s13321-025-01096-z
Universal feature selection for simultaneous interpretability of multitask datasets
Abstract
Extracting meaningful features from complex, high-dimensional datasets across scientific domains remains challenging. Current methods often struggle with scalability, limiting their applicability to large datasets, or make restrictive assumptions about feature-property relationships, hindering their ability to capture complex interactions. BoUTS's general and scalable feature selection algorithm surpasses these limitations by identifying both universal features relevant to all datasets and task-specific features predictive for specific subsets. Evaluated on seven diverse chemical regression datasets, BoUTS achieves state-of-the-art feature sparsity while generally maintaining prediction accuracy comparable to specialized methods. Notably, BoUTS's universal features enable domain-specific knowledge transfer between datasets, and we expect these results to be broadly useful to manually-guided inverse problems. Beyond its current application, BoUTS holds potential for elucidating data-poor systems by leveraging information from similar data-rich systems.Scientific Contribution: BoUTS selects nonlinear, universally informative features across multiple datasets. We identify crucial "universal features" across seven real-world chemistry datasets, which enhance cross-dataset interpretability and selection stability. BoUTS is highly scalable and is applicable to tabular data from many domains, and our results identify connections between seemingly unrelated chemical domains.
Keywords: Dimensionality reduction; Multi-output; Multi-source; Variable selection.
© 2025. The Author(s).
Conflict of interest statement
Declarations. Competing interests: The authors declare no competing interests.
Figures
References
-
- Dereli O, Oğuz C, Gönen M (2019) A multitask multiple kernel learning algorithm for survival analysis with application to cancer biology. In: Chaudhuri K, Salakhutdinov R (eds) Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research, vol 97. PMLR, Long Beach, CA, USA, pp 1576–1585. https://proceedings.mlr.press/v97/dereli19a.html
-
- Valmarska A, Miljkovic D, Konitsiotis S, et al (2017) Combining multitask learning and short time series analysis in parkinson’s disease patients stratification. In: ten Teije A, Popow C, Holmes JH, et al (eds) 16th Conference on Artificial Intelligence in Medicine, vol 10259. Springer International Publishing, Vienna, Austria, pp 116–125. 10.1007/978-3-319-59758-4_13
Grants and funding
LinkOut - more resources
Full Text Sources
