Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Dec;9(34):e2201988.
doi: 10.1002/advs.202201988. Epub 2022 Oct 21.

Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Affiliations

Machine Learning Guides Peptide Nucleic Acid Flow Synthesis and Sequence Design

Chengxi Li et al. Adv Sci (Weinh). 2022 Dec.

Abstract

Peptide nucleic acids (PNAs) are potential antisense therapies for genetic, acquired, and viral diseases. Efficiently selecting candidate PNA sequences for synthesis and evaluation from a genome containing hundreds to thousands of options can be challenging. To facilitate this process, this work leverages machine learning (ML) algorithms and automated synthesis technology to predict PNA synthesis efficiency and guide rational PNA sequence design. The training data is collected from individual fluorenylmethyloxycarbonyl (Fmoc) deprotection reactions performed on a fully automated PNA synthesizer. The optimized ML model allows for 93% prediction accuracy and 0.97 Pearson's r. The predicted synthesis scores are validated to be correlated with the experimental high-performance liquid chromatography (HPLC) crude purities (correlation coefficient R2 = 0.95). Furthermore, a general applicability of ML is demonstrated through designing synthetically accessible antisense PNA sequences from 102 315 predicted candidates targeting exon 44 of the human dystrophin gene, SARS-CoV-2, HIV, as well as selected genes associated with cardiovascular diseases, type II diabetes, and various cancers. Collectively, ML provides an accurate prediction of PNA synthesis quality and serves as a useful computational tool for informing PNA sequence design.

Keywords: automated synthesis; drug design; machine learning; peptide nucleic acid; yield prediction.

PubMed Disclaimer

Conflict of interest statement

B.L.P. is a co‐founder and/or member of the scientific advisory board of several companies focusing on the development of protein and peptide therapeutics. All other authors declare no competing interests.

Figures

Figure 1
Figure 1
Combining machine learning (ML) with automated synthesis technology delivers a design‐build‐test‐learn cycle for PNA sequence design. A Python program‐controlled automated oligonucleotide synthesizer is used to synthesize PNAs, with a real‐time UV–Vis trace monitoring all coupling and deprotection reactions. ML was applied over the integral peak areas calculated from the deprotection steps in the experimental data. A trained and optimized ML model makes prediction on the synthesis efficiency for any arbitrary PNA sequences, and therefore, enables informed sequence design.
Figure 2
Figure 2
Benchmark 10 ML model architectures for accurate PNA synthesis prediction. a) The input features include 4 PNA monomers, 16 sequence‐coupling combinations, and sequence length. The integration of the Fmoc deprotection peak area is the output response. b) Performance of 10 different ML model architectures on validation and testing datasets, visualized using parity plots. Individual scatter plots have points in blue for sequences in the validation dataset, and points in orange for sequences in the held‐out testing dataset. Metrics for model performance, unitless/relative root‐mean‐squared‐error (uRMSE), R 2, and Pearson's correlation, have been noted for validation and testing datasets in the inset textboxes. Titles of the subplots refer to the specific model architectures. c) Test uRMSE values of 10 ML models of which Ridge model presents the lowest value: 0.07. d) Test Pearson values of 10 ML models of which Ridge model presents the highest score: 0.97. For more model performance details, see Tables S1 and S2 (Supporting Information). Abbreviations: SGD, stochastic gradient descent; GP, Gaussian process; SVR, support vector regression; RF, random forest; GB, gradient boosting; kNN, k‐nearest neighbors.
Figure 3
Figure 3
Predicted peptide nucleic acid (PNA) synthesis scores agree with experimental validation. a) Six PNA sequences were randomly generated, including three 10‐mers, one 6‐mer, one 14‐mer, and one 18‐mer. ML predicts the synthesis efficiency, denoted as deprotection peak area of each step, and the trace were found consistent with the experimentally recorded UV data. b) The HPLC crude purities of the six randomly generated PNAs show strong correlation (R 2 = 0.95) with ML‐predicted synthesis scores. c) The crude HPLC traces of three same‐length PNAs were compared to demonstrate the distinguishing capability of the ML model. Integration was applied over the main product peaks, as indicated by LC–MS data (Section S3, Supporting Information).
Figure 4
Figure 4
ML predicts “high value” antisense PNA sequences for DMD. a) Left, predicted scores for 14 854 18‐mer PNA sequences targeting exon 44 of human dystrophin gene; right, the crude total ion current (TIC) chromatogram and full range mass spectrum of three representative PNA sequences after individual synthesis. b) Yield, HPLC trace, total mass spectrum, and deconvoluted mass of purified easy sequence I. c) Yield, HPLC trace, total mass spectrum, and deconvoluted mass of purified medium sequence II. Failed to obtain pure product of difficult PNA sequence III after purification.
Figure 5
Figure 5
ML predicts synthetically accessible antisense PNA sequences for various diseases and cancer targets. Predicted scores for all possible 18‐mer PNA sequences targeting the whole genome of SARS‐CoV‐2 and HIV‐1, or mRNA sequences of ANGPTL3, ANGPTL4, APOB, APOC3, LPA, PCSK9, GCGR, SGLT2, BRAF, EGFR, HER2, KRAS, MDM2, PD‐L1, and VEGF. Top 100 antisense PNA sequences for each target can be found in Section S9 (Supporting Information).

Similar articles

Cited by

References

    1. Syed Y. Y., Drugs 2016, 76, 1699. - PubMed
    1. Heo Y. A., Drugs 2020, 80, 329. - PubMed
    1. Shirley M., Drugs 2021, 81, 875. - PubMed
    1. Dhillon S., Drugs 2020, 80, 1027. - PubMed
    1. Prakash V., Gene Ther. 2017, 24, 497. - PubMed

Publication types

Substances