Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Oct 4;119(40):e2209524119.
doi: 10.1073/pnas.2209524119. Epub 2022 Sep 26.

Discovering design principles of collagen molecular stability using a genetic algorithm, deep learning, and experimental validation

Affiliations

Discovering design principles of collagen molecular stability using a genetic algorithm, deep learning, and experimental validation

Eesha Khare et al. Proc Natl Acad Sci U S A. .

Abstract

Collagen is the most abundant structural protein in humans, providing crucial mechanical properties, including high strength and toughness, in tissues. Collagen-based biomaterials are, therefore, used for tissue repair and regeneration. Utilizing collagen effectively during materials processing ex vivo and subsequent function in vivo requires stability over wide temperature ranges to avoid denaturation and loss of structure, measured as melting temperature (Tm). Although significant research has been conducted on understanding how collagen primary amino acid sequences correspond to Tm values, a robust framework to facilitate the design of collagen sequences with specific Tm remains a challenge. Here, we develop a general model using a genetic algorithm within a deep learning framework to design collagen sequences with specific Tm values. We report 1,000 de novo collagen sequences, and we show that we can efficiently use this model to generate collagen sequences and verify their Tm values using both experimental and computational methods. We find that the model accurately predicts Tm values within a few degrees centigrade. Further, using this model, we conduct a high-throughput study to identify the most frequently occurring collagen triplets that can be directly incorporated into collagen. We further discovered that the number of hydrogen bonds within collagen calculated with molecular dynamics (MD) is directly correlated to the experimental measurement of triple-helical quality. Ultimately, we see this work as a critical step to helping researchers develop collagen sequences with specific Tm values for intended materials manufacturing methods and biomedical applications, realizing a mechanistic materials by design paradigm.

Keywords: collagen; deep learning; generative algorithm; mechanics; thermal stability.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interest.

Figures

Fig. 1.
Fig. 1.
The hierarchy of collagen helps maintain its structural integrity. (A) The collagen amino acid primary sequence, often in the form of G-X-Y repeat triplets, forms a larger chain. The three chains come together to form a triple helix, characteristic of collagen, which is also known as tropocollagen. The tropocollagen assembles into larger fibril and fiber units. (B) This work focuses on the thermal stability of tropocollagen. Thermal stability is characterized by the Tm value, which is the midpoint temperature of the denaturation process of the triple helix of tropocollagen to a disordered state. Once collagen is not in a triple helix, it no longer contributes to the mechanical stability of the larger fiber.
Fig. 2.
Fig. 2.
The machine learning–based genetic algorithm used. Three sequences are randomly selected from a randomly generated population based on the dataset collagen sequences. The three sequences undergo tournament mating to identify the two best parents; these parents undergo further cross-over and mutations within their sequences to produce children offspring. The resulting children are then evaluated with the NLP (natural language processing) ColGen deep learning Tm predictor, and the best child matching the desired Tm value objective function is output. If elitism is implemented in the model, the child is overrepresented in the initial population to help preserve its general sequence features. Numbers in the bottom right of the boxes represent the numbers of sequences in each stage.
Fig. 3.
Fig. 3.
CD temperature scan at 222 nm for collagen peptides demonstrating triple-helix structure: (A) 22 °C peptides CP1 and CP3 and (B) 37 °C peptides CP1 and CP3. Scans at 1 °C/min with sampling every 0.1 °C indicate that de novo peptides have Tm values within a couple of degrees of the target Tm.
Fig. 4.
Fig. 4.
CD wavelength scan at 222 nm for collagen peptides demonstrating triple-helix structure: (A) type I collagen as a control at 5 °C and 70 °C and de novo peptides at (B) 5 °C and (C) 70 °C. (B) Inset is zoomed into wavelength ranges from 210 to 250 nm for clarity. De novo peptides demonstrate the same characteristic behavior as the type I collagen sequence, indicating that they have a triple-helical structure. Both the type I collagen and de novo peptides denature at 0 °C.
Fig. 5.
Fig. 5.
Relationship between collagen triple-helix quality and Tm values using experiment and MD simulation. (A) There is an inverse relationship between the RPN and the difference in Tm value between the experimental CD Tm and ColGen machine learning (ML) predicted Tm: ΔTm=(Tm, experiment CDTm, ColGenGA)Tm, experiment CD. This indicates that the ColGen algorithm is able to more robustly predict the thermal stability of higher-quality triple helices. The RPN also follows a direct relationship with Tm value, indicating that more stable triple helices have a higher Tm. (B) MD simulations show that the CPs maintain roughly the expected stability, measured by rmsd of the triple helix as predicted by ColGen. (C) Hydrogen bonding analysis at 50 °C in the MD simulation shows a similar correlation as the RPN in A. Peptides with more hydrogen bonding generally have a lower deviation from ColGen-predicted Tm values compared with experimental Tm values. Further, RPN has a direct relationship with the number of hydrogen bonds in the CP, indicating that a higher-quality triple helix has more hydrogen bonding.
Fig. 6.
Fig. 6.
High-throughput identification of the most frequent sequences in de novo collagen peptides. The co-occurrence matrix of the 1,000 generated de novo collagen sequences for 22 °C (A) and 37 °C (B) when sorted by the most frequent triplets shows which triplets occur together in the same sequence. These most frequent triplets from 22 °C (C) and 37 °C (D) are substituted n times into a (GPO)14 ideal standard peptide, and their destabilizing effect on Tm is evaluated, where ΔTm = Tm(GPO)14 − Tm(sequence).

References

    1. Sorushanova A., et al. , The collagen suprafamily: From biosynthesis to advanced biomaterial development. Adv. Mater. 31, e1801651 (2019). - PubMed
    1. Lodish H., Berk A., Zipursky S. L., Molecular Cell Biology: The Fibrous Proteins of the Matrix (W. H. Freeman, 2000).
    1. Prockop D. J., Kivirikko K. I., Collagens: Molecular biology, diseases, and potentials for therapy. Annu. Rev. Biochem. 64, 403–434 (1995). - PubMed
    1. Orgel J. P. R. O., Irving T. C., Miller A., Wess T. J., Microfibrillar structure of type I collagen in situ. Proc. Natl. Acad. Sci. U.S.A. 103, 9001–9005 (2006). - PMC - PubMed
    1. Ramachandran G. N., Kartha G., Structure of collagen. Nature 176, 593–595 (1955). - PubMed

Publication types

LinkOut - more resources