. 2024 Jan 2:13:giad111.

doi: 10.1093/gigascience/giad111.

Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis

Akshay Akshay^{1

2}, Mitali Katoch³, Navid Shekarchizadeh^{4

5}, Masoud Abedi⁴, Ankush Sharma^{6

7}, Fiona C Burkhard^{1

8}, Rosalyn M Adam^{9

10

11}, Katia Monastyrskaya^{1

8}, Ali Hashemi Gheinani^{1

8

9

10

11}

Affiliations

¹ Functional Urology Research Group, Department for BioMedical Research DBMR, University of Bern, 3008 Bern, Switzerland.
² Graduate School for Cellular and Biomedical Sciences, University of Bern, 3012 Bern, Switzerland.
³ Institute of Neuropathology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), 91054 Erlangen, Germany.
⁴ Department of Medical Data Science, Leipzig University Medical Centre, 04107 Leipzig, Germany.
⁵ Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, 04105 Leipzig, Germany.
⁶ KG Jebsen Centre for B-cell Malignancies, Institute for Clinical Medicine, University of Oslo, 0318 Oslo, Norway.
⁷ Department of Cancer Immunology, Institute for Cancer Research, Oslo University Hospital, 0310 Oslo, Norway.
⁸ Department of Urology, Inselspital University Hospital, 3010 Bern, Switzerland.
⁹ Urological Diseases Research Center, Boston Children's Hospital, 02115 Boston, MA, USA.
¹⁰ Department of Surgery, Harvard Medical School, 02115 Boston, MA, USA.
¹¹ Broad Institute of MIT and Harvard, Cambridge, 02142 MA, USA.

PMID: 38206587
PMCID: PMC10783149
DOI: 10.1093/gigascience/giad111

Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis

Akshay Akshay et al. Gigascience. 2024.

. 2024 Jan 2:13:giad111.

doi: 10.1093/gigascience/giad111.

Authors

Affiliations

¹ Functional Urology Research Group, Department for BioMedical Research DBMR, University of Bern, 3008 Bern, Switzerland.
² Graduate School for Cellular and Biomedical Sciences, University of Bern, 3012 Bern, Switzerland.
³ Institute of Neuropathology, Universitätsklinikum Erlangen, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), 91054 Erlangen, Germany.
⁴ Department of Medical Data Science, Leipzig University Medical Centre, 04107 Leipzig, Germany.
⁵ Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI) Dresden/Leipzig, 04105 Leipzig, Germany.
⁶ KG Jebsen Centre for B-cell Malignancies, Institute for Clinical Medicine, University of Oslo, 0318 Oslo, Norway.
⁷ Department of Cancer Immunology, Institute for Cancer Research, Oslo University Hospital, 0310 Oslo, Norway.
⁸ Department of Urology, Inselspital University Hospital, 3010 Bern, Switzerland.
⁹ Urological Diseases Research Center, Boston Children's Hospital, 02115 Boston, MA, USA.
¹⁰ Department of Surgery, Harvard Medical School, 02115 Boston, MA, USA.
¹¹ Broad Institute of MIT and Harvard, Cambridge, 02142 MA, USA.

PMID: 38206587
PMCID: PMC10783149
DOI: 10.1093/gigascience/giad111

Abstract

Background: Machine learning (ML) has emerged as a vital asset for researchers to analyze and extract valuable information from complex datasets. However, developing an effective and robust ML pipeline can present a real challenge, demanding considerable time and effort, thereby impeding research progress. Existing tools in this landscape require a profound understanding of ML principles and programming skills. Furthermore, users are required to engage in the comprehensive configuration of their ML pipeline to obtain optimal performance.

Results: To address these challenges, we have developed a novel tool called Machine Learning Made Easy (MLme) that streamlines the use of ML in research, specifically focusing on classification problems at present. By integrating 4 essential functionalities-namely, Data Exploration, AutoML, CustomML, and Visualization-MLme fulfills the diverse requirements of researchers while eliminating the need for extensive coding efforts. To demonstrate the applicability of MLme, we conducted rigorous testing on 6 distinct datasets, each presenting unique characteristics and challenges. Our results consistently showed promising performance across different datasets, reaffirming the versatility and effectiveness of the tool. Additionally, by utilizing MLme's feature selection functionality, we successfully identified significant markers for CD8+ naive (BACH2), CD16+ (CD16), and CD14+ (VCAN) cell populations.

Conclusion: MLme serves as a valuable resource for leveraging ML to facilitate insightful data analysis and enhance research outcomes, while alleviating concerns related to complex coding scripts. The source code and a detailed tutorial for MLme are available at https://github.com/FunctionalUrology/MLme.

Keywords: AutoML; classification problems; data analysis; machine learning; visualization.

PubMed Disclaimer

Conflict of interest statement

The authors declare they have no competing interests.

Figures

**Figure 1:**
Graphical abstract. The input data for Machine Learning Made Easy (MLme) is a file with samples as rows and features as columns, with sample names in the first column and target classes in the last column. MLme provides various features to enhance usability. The data exploration feature enables users to explore the data and gain initial insights. For advanced users, the custom ML feature allows the creation of custom ML pipelines. Upon execution, MLme generates a compressed zip file containing inputParameter.pkl, script.py, and README.txt. Alternatively, users can opt for the AutoML feature, which applies a default ML pipeline to the input file. Both CustomML and AutoML produce a results.pkl file, which can be further analyzed using the visualization feature.

**Figure 2:**
Default ML Pipeline for AutoML. The default ML pipeline can be represented as a flowchart that starts by splitting the input dataset into training and independent test sets, provided the user has activated the test set option. Otherwise, the entire dataset is used for training. In the subsequent step, the training dataset is divided into n bins of equal size through stratified sampling. From these bins, k – 1 are designated as training sets while the remainder becomes the test set. In the preprocessing step, low variance features are removed first, followed by data scaling and resampling. Subsequently, the SelectPercentile univariate feature selection method is applied to select important features, and 5 ML classification algorithms are trained. Model performance is assessed on the test set using 3 different methods, and multiple performance metrics are computed. This entire process is repeated for each unique bin in the k-fold corss validation (CV) method. The pipeline outputs a zip file comprising the log .txt and the results.pkl files. The user can examine the results by visualizing the contents of the pickle file using MLme.

**Figure 3:**
Identification of potential markers for CD8⁺ naive, CD16⁺, and CD14⁺ cell populations in the PBMC dataset. (A) Heatmap visualization showing the expression patterns of 50 genes selected by MLme. (B–D) Expression levels of key markers specific to CD8⁺ naive, CD16⁺, and CD14⁺ cell populations, respectively, within each cell type.

See this image and copyright information in PMC

Update of

Machine Learning Made Easy (MLme): A Comprehensive Toolkit for Machine Learning-Driven Data Analysis.
Akshay A, Katoch M, Shekarchizadeh N, Abedi M, Sharma A, Burkhard FC, Adam RM, Monastyrskaya K, Gheinani AH. Akshay A, et al. bioRxiv [Preprint]. 2023 Jul 4:2023.07.04.546825. doi: 10.1101/2023.07.04.546825. bioRxiv. 2023. Update in: Gigascience. 2024 Jan 2;13:giad111. doi: 10.1093/gigascience/giad111. PMID: 37461685 Free PMC article. Updated. Preprint.

Cited by

Machine Learning Made Easy (MLme): A Comprehensive Toolkit for Machine Learning-Driven Data Analysis.
Akshay A, Katoch M, Shekarchizadeh N, Abedi M, Sharma A, Burkhard FC, Adam RM, Monastyrskaya K, Gheinani AH. Akshay A, et al. bioRxiv [Preprint]. 2023 Jul 4:2023.07.04.546825. doi: 10.1101/2023.07.04.546825. bioRxiv. 2023. Update in: Gigascience. 2024 Jan 2;13:giad111. doi: 10.1093/gigascience/giad111. PMID: 37461685 Free PMC article. Updated. Preprint.
MLcps: machine learning cumulative performance score for classification problems.
Akshay A, Abedi M, Shekarchizadeh N, Burkhard FC, Katoch M, Bigger-Allen A, Adam RM, Monastyrskaya K, Gheinani AH. Akshay A, et al. Gigascience. 2022 Dec 28;12:giad108. doi: 10.1093/gigascience/giad108. Epub 2023 Dec 13. Gigascience. 2022. PMID: 38091508 Free PMC article.

References

1. Lewis JE, Kemp ML. Integration of machine learning and genome-scale metabolic modeling identifies multi-omics biomarkers for radiation resistance. Nat Commun. 2021;12:2700. 10.1038/s41467-021-22989-1. - DOI - PMC - PubMed
1. Tollenaar V, Zekollari H, Lhermitte S, et al. Unexplored Antarctic meteorite collection sites revealed through machine learning. Sci Adv. 2022;8:eabj8138. 10.1126/sciadv.abj8138. - DOI - PMC - PubMed
1. Su Q, Liu Q, Lau RI, et al. Faecal microbiome-based machine learning for multi-class disease diagnosis. Nat Commun. 2022;13:6818. 10.1038/s41467-022-34405-3. - DOI - PMC - PubMed
1. Martínez BA, Shrotri S, Kingsmore KM, et al. Machine learning reveals distinct gene signature profiles in lesional and nonlesional regions of inflammatory skin diseases. Sci Adv. 2022;8:eabn4776. 10.1126/sciadv.abn4776. - DOI - PMC - PubMed
1. Chen Z, Ma W, Li Y, et al. Using machine learning to estimate the incidence rate of intimate partner violence. Sci Rep. 2023;13:5533. 10.1038/s41598-023-31846-8. - DOI - PMC - PubMed

Publication types

Actions
Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions

Grants and funding

LinkOut - more resources

Full Text Sources
Research Materials
- NCI CPTC Antibody Characterization Program

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis

Affiliations

Machine Learning Made Easy (MLme): a comprehensive toolkit for machine learning-driven data analysis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

Publication types

MeSH terms

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials

Abstract

Conflict of interest statement

Figures

Update of

Similar articles

Cited by

References

Publication types

MeSH terms

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Research Materials