Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Review
. 2022 Nov 17;3(2):141-157.
doi: 10.1021/acspolymersau.2c00037. eCollection 2023 Apr 12.

A User's Guide to Machine Learning for Polymeric Biomaterials

Affiliations
Review

A User's Guide to Machine Learning for Polymeric Biomaterials

Travis A Meyer et al. ACS Polym Au. .

Abstract

The development of novel biomaterials is a challenging process, complicated by a design space with high dimensionality. Requirements for performance in the complex biological environment lead to difficult a priori rational design choices and time-consuming empirical trial-and-error experimentation. Modern data science practices, especially artificial intelligence (AI)/machine learning (ML), offer the promise to help accelerate the identification and testing of next-generation biomaterials. However, it can be a daunting task for biomaterial scientists unfamiliar with modern ML techniques to begin incorporating these useful tools into their development pipeline. This Perspective lays the foundation for a basic understanding of ML while providing a step-by-step guide to new users on how to begin implementing these techniques. A tutorial Python script has been developed walking users through the application of an ML pipeline using data from a real biomaterial design challenge based on group's research. This tutorial provides an opportunity for readers to see and experiment with ML and its syntax in Python. The Google Colab notebook can be easily accessed and copied from the following URL: www.gormleylab.com/MLcolab.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interest.

Figures

Figure 1
Figure 1
Schematic of an example Design-Build-Test-Learn paradigm for biomaterials development. Materials are initially designed with certain chemical and physical properties based on either rational design or a comprehensive survey of material features. Materials are then built and tested for desirable characteristics, ideally through high-throughput laboratory automation. This data, alongside design parameters, are then fed into a machine learning (ML) pipeline, where key patterns are extracted to create predictive models. These models can then be used to help design new material generations with targeted functionality. This figure was created with BioRender.
Figure 2
Figure 2
(A) Two-dimensional data set, split into training (blue) and test (orange) subsets, with the modeled function (dashed line) shown. In the case of model underfitting (left), the modeled function does a poor job capturing the variance of the underlying data, leading to high errors for both subsets. Model overfitting (right) leads to a function that perfectly fits the underlying training set profile, leading to low training set error but high test set error. A properly fit function (middle), appropriately captures the underlying pattern in the data and is equally generalizable to new data in the test set. (B) Bias/variance trade off showing loss function error for both the training set (blue) and test set (orange) as a function of model complexity. A model with low complexity has high bias and does not accurately capture data patterns, leading to an underfit model. As the complexity increases, loss function error decreases for both training and test data sets. Eventually, a point is reached where the model begins to overfit the training data, leading to a rise in test set error and an increase in model variance. (C) Example of 5-fold cross validation, where data set is split in five bins containing equal proportions of data and training/validation is completed five times, using different subsets for training (blue) and test (orange) data. Loss can then be compared across all five iterations to get an average loss for the entire data set.
Figure 3
Figure 3
General ML pipeline. The first step involves the conversion of material features and labels into tabular data where categorical data has been encoded and numerical data has been scaled and normalized. In the second step, the model is trained such that the features are fed into the model and a predicted label is compared to the actual label to generate a loss term. Training iterations focus on modifying model parameters to minimize loss across the training data set. Finally, the model is validated against a test data set and model hyperparameters can be systematically tuned to minimize test set loss and find the optimal model design.
Figure 4
Figure 4
(A) K-nearest neighbors (KNN) classifier with unlabeled data point (gray) assigned a label (blue or orange) based on voting by the k = 4 closest labeled data points in feature space. (B) Two-dimensional support vector machine (SVM) classifier where optimal hyperplane separating classes is chosen by maximizing the size of the margin (gray box) based on support vector data points. (C) Hypothetical decision tree used to classify gene expression profiles based on hydrogel characteristics. Features are used as decision points at each node (blue), and observations are sorted into child nodes based on feature values. Eventually, leafs (orange) are reached which contain no further subdivisions, and labels are assigned based on leaf composition. (D) Artificial neural network consisting of an input layer, two hidden layers, and an output layer. Features (F1–F3) are fed into the input layer, where they are subsequently sent to hidden layers for dot-product summation (Σ) and activation function processing (f). This process repeats until the final predicted label is calculated in the output layer.
Figure 5
Figure 5
Model performance is evaluated by plotting the actual REA labels from the data (x-axis) against the predicted REA labels from the model output of the test data set (y-axis). The orange line shows the 1:1 equivalence that would be obtained with a “perfect” model, with values below the line representing an underprediction of the actual REA and values above the line representing an overprediction. Quantitative metrics of model performance (Random Forest R2 score and MAE) are listed above the graph for reference.
Figure 6
Figure 6
Summary plot of SHAP analysis of random forest regression model built here. The relative impact of each feature on model output (SHAP value) is displayed as horizontal position, and the relative magnitude of the feature value for each data point is color-coded (red = high value, blue = low value).

References

    1. Basu B.; Gowtham N. H.; Xiao Y.; Kalidindi S. R.; Leong K. W. Biomaterialomics: Data science-driven pathways to develop fourth-generation biomaterials. Acta Biomaterialia 2022, 143, 1–25. 10.1016/j.actbio.2022.02.027. - DOI - PubMed
    1. Greener J. G.; Kandathil S. M.; Moffat L.; Jones D. T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 2022, 23 (1), 40–55. 10.1038/s41580-021-00407-0. - DOI - PubMed
    1. Inza I.; Calvo B.; Armañanzas R.; Bengoetxea E.; Larranaga P.; Lozano J. A.. Machine learning: an indispensable tool in bioinformatics. In Bioinformatics methods in clinical research; Springer, 2010; pp 25–48. - PubMed
    1. Butler K. T.; Davies D. W.; Cartwright H.; Isayev O.; Walsh A. Machine learning for molecular and materials science. Nature 2018, 559 (7715), 547–555. 10.1038/s41586-018-0337-2. - DOI - PubMed
    1. Kerner J.; Dogan A.; von Recum H. Machine learning and big data provide crucial insight for future biomaterials discovery and research. Acta Biomaterialia 2021, 130, 54–65. 10.1016/j.actbio.2021.05.053. - DOI - PubMed