. 2022 Dec;9(34):e2201988.

doi: 10.1002/advs.202201988. Epub 2022 Oct 21.

Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Chengxi Li^{1

2

3}, Genwei Zhang¹, Somesh Mohapatra⁴, Alex J Callahan¹, Andrei Loas¹, Rafael Gómez-Bombarelli⁴, Bradley L Pentelute^{1

5

6

7}

Affiliations

¹ Department of Chemistry, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
² College of Chemical and Biological Engineering, Zhejiang University, No.866 Yuhangtang Road, Hangzhou, Zhejiang, 310030, P. R. China.
³ ZJU-Hangzhou Global Scientific and Technological Innovation Center, No.733 Jianshe San Road, Xiaoshan District, Hangzhou, Zhejiang, 311200, P. R. China.
⁴ Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
⁵ The Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, 500 Main Street, Cambridge, MA, 02142, USA.
⁶ Center for Environmental Health Sciences, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
⁷ Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, USA.

PMID: 36270977
PMCID: PMC9731686
DOI: 10.1002/advs.202201988

Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Chengxi Li et al. Adv Sci (Weinh). 2022 Dec.

. 2022 Dec;9(34):e2201988.

doi: 10.1002/advs.202201988. Epub 2022 Oct 21.

Authors

Chengxi Li^{1

2

3}, Genwei Zhang¹, Somesh Mohapatra⁴, Alex J Callahan¹, Andrei Loas¹, Rafael Gómez-Bombarelli⁴, Bradley L Pentelute^{1

5

6

7}

Affiliations

¹ Department of Chemistry, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
² College of Chemical and Biological Engineering, Zhejiang University, No.866 Yuhangtang Road, Hangzhou, Zhejiang, 310030, P. R. China.
³ ZJU-Hangzhou Global Scientific and Technological Innovation Center, No.733 Jianshe San Road, Xiaoshan District, Hangzhou, Zhejiang, 311200, P. R. China.
⁴ Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
⁵ The Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology, 500 Main Street, Cambridge, MA, 02142, USA.
⁶ Center for Environmental Health Sciences, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA, 02139, USA.
⁷ Broad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA, 02142, USA.

PMID: 36270977
PMCID: PMC9731686
DOI: 10.1002/advs.202201988

Abstract

Peptide nucleic acids (PNAs) are potential antisense therapies for genetic, acquired, and viral diseases. Efficiently selecting candidate PNA sequences for synthesis and evaluation from a genome containing hundreds to thousands of options can be challenging. To facilitate this process, this work leverages machine learning (ML) algorithms and automated synthesis technology to predict PNA synthesis efficiency and guide rational PNA sequence design. The training data is collected from individual fluorenylmethyloxycarbonyl (Fmoc) deprotection reactions performed on a fully automated PNA synthesizer. The optimized ML model allows for 93% prediction accuracy and 0.97 Pearson's r. The predicted synthesis scores are validated to be correlated with the experimental high-performance liquid chromatography (HPLC) crude purities (correlation coefficient R² = 0.95). Furthermore, a general applicability of ML is demonstrated through designing synthetically accessible antisense PNA sequences from 102 315 predicted candidates targeting exon 44 of the human dystrophin gene, SARS-CoV-2, HIV, as well as selected genes associated with cardiovascular diseases, type II diabetes, and various cancers. Collectively, ML provides an accurate prediction of PNA synthesis quality and serves as a useful computational tool for informing PNA sequence design.

Keywords: automated synthesis; drug design; machine learning; peptide nucleic acid; yield prediction.

PubMed Disclaimer

Conflict of interest statement

B.L.P. is a co‐founder and/or member of the scientific advisory board of several companies focusing on the development of protein and peptide therapeutics. All other authors declare no competing interests.

Figures

**Figure 1**
Combining machine learning (ML) with automated synthesis technology delivers a design‐build‐test‐learn cycle for PNA sequence design. A Python program‐controlled automated oligonucleotide synthesizer is used to synthesize PNAs, with a real‐time UV–Vis trace monitoring all coupling and deprotection reactions. ML was applied over the integral peak areas calculated from the deprotection steps in the experimental data. A trained and optimized ML model makes prediction on the synthesis efficiency for any arbitrary PNA sequences, and therefore, enables informed sequence design.

**Figure 2**
Benchmark 10 ML model architectures for accurate PNA synthesis prediction. a) The input features include 4 PNA monomers, 16 sequence‐coupling combinations, and sequence length. The integration of the Fmoc deprotection peak area is the output response. b) Performance of 10 different ML model architectures on validation and testing datasets, visualized using parity plots. Individual scatter plots have points in blue for sequences in the validation dataset, and points in orange for sequences in the held‐out testing dataset. Metrics for model performance, unitless/relative root‐mean‐squared‐error (uRMSE), R ², and Pearson's correlation, have been noted for validation and testing datasets in the inset textboxes. Titles of the subplots refer to the specific model architectures. c) Test uRMSE values of 10 ML models of which Ridge model presents the lowest value: 0.07. d) Test Pearson values of 10 ML models of which Ridge model presents the highest score: 0.97. For more model performance details, see Tables S1 and S2 (Supporting Information). Abbreviations: SGD, stochastic gradient descent; GP, Gaussian process; SVR, support vector regression; RF, random forest; GB, gradient boosting; kNN, k‐nearest neighbors.

**Figure 3**
Predicted peptide nucleic acid (PNA) synthesis scores agree with experimental validation. a) Six PNA sequences were randomly generated, including three 10‐mers, one 6‐mer, one 14‐mer, and one 18‐mer. ML predicts the synthesis efficiency, denoted as deprotection peak area of each step, and the trace were found consistent with the experimentally recorded UV data. b) The HPLC crude purities of the six randomly generated PNAs show strong correlation (R ² = 0.95) with ML‐predicted synthesis scores. c) The crude HPLC traces of three same‐length PNAs were compared to demonstrate the distinguishing capability of the ML model. Integration was applied over the main product peaks, as indicated by LC–MS data (Section S3, Supporting Information).

**Figure 4**
ML predicts “high value” antisense PNA sequences for DMD. a) Left, predicted scores for 14 854 18‐mer PNA sequences targeting exon 44 of human dystrophin gene; right, the crude total ion current (TIC) chromatogram and full range mass spectrum of three representative PNA sequences after individual synthesis. b) Yield, HPLC trace, total mass spectrum, and deconvoluted mass of purified easy sequence I. c) Yield, HPLC trace, total mass spectrum, and deconvoluted mass of purified medium sequence II. Failed to obtain pure product of difficult PNA sequence III after purification.

**Figure 5**
ML predicts synthetically accessible antisense PNA sequences for various diseases and cancer targets. Predicted scores for all possible 18‐mer PNA sequences targeting the whole genome of SARS‐CoV‐2 and HIV‐1, or mRNA sequences of ANGPTL3, ANGPTL4, APOB, APOC3, LPA, PCSK9, GCGR, SGLT2, BRAF, EGFR, HER2, KRAS, MDM2, PD‐L1, and VEGF. Top 100 antisense PNA sequences for each target can be found in Section S9 (Supporting Information).

See this image and copyright information in PMC

Cited by

Advance in peptide-based drug development: delivery platforms, therapeutics and vaccines.
Xiao W, Jiang W, Chen Z, Huang Y, Mao J, Zheng W, Hu Y, Shi J. Xiao W, et al. Signal Transduct Target Ther. 2025 Mar 5;10(1):74. doi: 10.1038/s41392-024-02107-5. Signal Transduct Target Ther. 2025. PMID: 40038239 Free PMC article. Review.
Computer vision as a new paradigm for monitoring of solution and solid phase peptide synthesis.
Yan C, Fyfe C, Minty L, Barrington H, Jamieson C, Reid M. Yan C, et al. Chem Sci. 2023 Oct 10;14(42):11872-11880. doi: 10.1039/d3sc01383a. eCollection 2023 Nov 1. Chem Sci. 2023. PMID: 37920332 Free PMC article.

References

1. Syed Y. Y., Drugs 2016, 76, 1699. - PubMed
1. Heo Y. A., Drugs 2020, 80, 329. - PubMed
1. Shirley M., Drugs 2021, 81, 875. - PubMed
1. Dhillon S., Drugs 2020, 80, 1027. - PubMed
1. Prakash V., Gene Ther. 2017, 24, 497. - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Affiliations

Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous

Abstract

Conflict of interest statement

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Medical

Miscellaneous