Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jan:125:104921.
doi: 10.1016/j.jmbbm.2021.104921. Epub 2021 Oct 31.

ColGen: An end-to-end deep learning model to predict thermal stability of de novo collagen sequences

Affiliations

ColGen: An end-to-end deep learning model to predict thermal stability of de novo collagen sequences

Chi-Hua Yu et al. J Mech Behav Biomed Mater. 2022 Jan.

Abstract

Collagen is the most abundant structural protein in humans, with dozens of sequence variants accounting for over 30% of the protein in an animal body. The fibrillar and hierarchical arrangements of collagen are critical in providing mechanical properties with high strength and toughness. Due to this ubiquitous role in human tissues, collagen-based biomaterials are commonly used for tissue repairs and regeneration, requiring chemical and thermal stability over a range of temperatures during materials preparation ex vivo and subsequent utility in vivo. Collagen unfolds from a triple helix to a random coil structure during a temperature interval in which the midpoint or Tm is used as a measure to evaluate the thermal stability of the molecules. However, finding a robust framework to facilitate the design of a specific collagen sequence to yield a specific Tm remains a challenge, including using conventional molecular dynamics modeling. Here we propose a de novo framework to provide a model that outputs the Tm values of input collagen sequences by incorporating deep learning trained on a large data set of collagen sequences and corresponding Tm values. By using this framework, we are able to quickly evaluate how mutations and order in the primary sequence affect the stability of collagen triple helices. Specifically, we confirm that mutations to glycines, mutations in the middle of a sequence, and short sequence lengths cause the greatest drop in Tm values.

Keywords: Collagen; Deep learning; Long short-term memory artificial recurrent neural network; Machine learning; Melting temperature.

PubMed Disclaimer

Figures

Fig. 1.
Fig. 1.. Distribution of data from literature, based on experimental results.
a) Overview of the problem studied here, to predict the melting point Tm from the sequence of collagen molecules. b) Experimental melting temperatures collected. c) Normalized Tm value distribution. Thermal stability data sets for observed Tm values for Gly-X-Y tripeptide units in triple helical collagen-like peptides are integrated here to produce an algorithm for predicting global melting temperatures. Data from ,,-,,-,-.
Fig. 2.
Fig. 2.. Overview of machine learning model.
a) We design a deep learning network to discover hidden features of collagen sequences by introducing embedding layer. b) The structure of our deep learning model starts at an embedding layer, followed by two 1D convolution layers, then we flatten all the features and send them into a fully connected layer for regression to determine Tm value. Figures S1 and S2 provide details of the neural network model.
Fig. 3.
Fig. 3.. Predictive accuracy of ColGen, and training performance.
a) Data comparing training with test set demonstrates a 95% confidence interval. Plotting R2 of training / testing / generation. b) Training and validation error over epochs demonstrate a well fit model. The validation and training errors reach a plateau around 150 epochs.
Fig. 4.
Fig. 4.. Characterization of the effects of various types of mutations, predicted by ColGen.
a) Tm values of mutations in G, P, or O position demonstrates that mutations in the middle of the sequence are the most destabilizing for Tm values. Mutations in the G position are the most destabilizing to the peptide. Error bars indicate standard deviation of all amino acids that were mutated. b) Thermal stability as a function of collagen sequence length, where length is number of repeat units (GOP) demonstrates that there is a critical length at which the Tm can no longer be increased significantly. This critical length is consistent with other studies.
Figure 5.
Figure 5.. Characterization of effect of disorder on Tm, as predicted by the model.
A) Tm values of disorder arranged by G, P, or O position confirm that increasing mutations along the chain decreases thermal stability of the triple helix. Error bars indicate standard deviation of all amino acids that were mutated. b) Tm values of disorder in the O position demonstrates that initial mutations to polar, positive charged, and negative charged amino acids confer the same degree of stability in the molecule. However, upon increasing mutations, polar amino acids are the least destabilizing to the triple helix, suggesting that they should be used for bacterial expression of collagen where expression of O is not possible.

References

    1. Lodish H, Berk A & Zipursky SL Molecular Cell Biology: The Fibrous Proteins of the Matrix. Molecular Cell Biology (W. H. Freeman, 2000).
    1. Prockop DJ & Kivirikko KI Collagens: Molecular biology, diseases, and potentials for therapy. Annual Review of Biochemistry 64, 403–434 (1995). - PubMed
    1. Orgel JPRO et al. The in situ supermolecular structure of type I collagen. Structure 9, 1061–1069 (2001). - PubMed
    1. Ramachandran GN, G. K. Structure of collagen. Nature 593–595 (1955). - PubMed
    1. Rich A, F. C. The structure of collagen. Nature 915–916 (1955). doi:10.1021/ja01644a065 - DOI - PubMed

Publication types

LinkOut - more resources