Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Aug 21;18(8):1240.
doi: 10.3390/ph18081240.

Smart Formulation: AI-Driven Web Platform for Optimization and Stability Prediction of Compounded Pharmaceuticals Using KNIME

Affiliations

Smart Formulation: AI-Driven Web Platform for Optimization and Stability Prediction of Compounded Pharmaceuticals Using KNIME

Artur Grigoryan et al. Pharmaceuticals (Basel). .

Abstract

Background/Objectives: Smart Formulation is an artificial intelligence-based platform designed to predict the Beyond Use Dates (BUDs) of compounded oral solid dosage forms. The study aims to develop a decision-support tool for pharmacists by integrating molecular, formulation, and environmental parameters to assist in optimizing the stability of extemporaneous preparations. Methods: A tree ensemble regression model was trained using a curated dataset of 55 experimental BUD values collected from the Stabilis database. Each formulation was encoded with molecular descriptors, excipient composition, packaging type, and storage conditions. The model was implemented using the KNIME platform, allowing the integration of cheminformatics and machine learning workflows. After training, the model was used to predict BUDs for 3166 APIs under various formulation and storage scenarios. Results: The analysis revealed a significant impact of excipient type, number, and environmental conditions on API stability. APIs with lower LogP values generally exhibited greater stability, particularly when formulated with a single excipient. Excipients such as cellulose, silica, sucrose, and mannitol were associated with improved stability, whereas HPMC and lactose contributed to faster degradation. The use of two excipients instead of one frequently resulted in reduced BUDs, possibly due to moisture redistribution or phase separation effects. Conclusions: Smart Formulation represents a valuable contribution to computational pharmaceutics, bridging theoretical formulation design with practical compounding needs. The platform offers a scalable, cost-effective alternative to traditional stability testing and is already available for use by healthcare professionals. Its implementation in hospital and community pharmacies may help mitigate drug shortages, support formulation standardization, and improve patient care. Future developments will focus on real-time stability monitoring and adaptive learning for enhanced precision.

Keywords: beyond-use date prediction; drug compounding; excipients; machine learning; molecular descriptors; pharmaceutical stability.

PubMed Disclaimer

Conflict of interest statement

Author Stefan Helfrich was employed by the company KNIME GmbH. Author Fabien Bruno was employed by the company Pharmacie Delpech. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Figures

Figure 1
Figure 1
The relationship between predicted and experimental stability data was significantly estimated by the Tree Ensemble Regression learner and predictor-based model. (◯) excipients (pure or blended with one other excipient, n = 14 data points), (●) API (compounded with one or two excipients, n = 11 data points) conditioned in glass, plastic, and paper packaging (cf. 0 Stability Predictor Model panel in Figure 1). API content and temperature storage were between 0.2 and 100%, and 4 °C, 25 °C, and 40 °C, respectively. (n = 14) demonstrated consistent predictability across glass, plastic, and paper packaging. A comparative analysis of seven machine learning models was conducted to evaluate predictive performance (Table 1). Among the tested models, Tree Ensemble Regression exhibited the highest accuracy (R2 = 0.975) with the lowest Root Mean Squared Error (RMSE = 18.93) and Mean Absolute Error (MAE = 10.16). In contrast, other ensemble and boosting methods showed varying degrees of performance, with Gradient Boosted Trees Regression yielding suboptimal results (R2 = -0.447, RMSE = 137.60). These findings validate the robustness of Smart Formulation in predicting drug stability under diverse formulation and storage conditions. The model effectively captures key interactions between API content, excipients, packaging, and temperature, making it a reliable tool for pharmaceutical stability assessment.
Figure 2
Figure 2
Prediction of API (n = 15; MW: 129.17 to 776.87 g.mol−1, LogP: −0.92 to 5.39; initial content: 10%, conditioning: plastic) stability in 6 pure or 15 combinations of blended excipients as a function of temperature of storage. (A): 4 °C storage, API content: 10%, **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups (ANOVA, post hoc Tukey’s All pairs comparison. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (●) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (B): 25 °C storage, API content: 10%, *p < 0.05 and **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. (C): 40 °C storage, API content: 10%, a, **: p < 0.001 as compared to HPMC and Lactose groups; b, ***: p < 0.0001 as compared to cellulose, HPMC, lactose, silica, and sucrose blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (D): 25 °C storage, API content: 90%, **: p < 0.01 and ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica, and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. https://www.statskingdom.com/linear-regression-calculator.html (accessed on 25 February 2025).
Figure 2
Figure 2
Prediction of API (n = 15; MW: 129.17 to 776.87 g.mol−1, LogP: −0.92 to 5.39; initial content: 10%, conditioning: plastic) stability in 6 pure or 15 combinations of blended excipients as a function of temperature of storage. (A): 4 °C storage, API content: 10%, **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups (ANOVA, post hoc Tukey’s All pairs comparison. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (●) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (B): 25 °C storage, API content: 10%, *p < 0.05 and **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. (C): 40 °C storage, API content: 10%, a, **: p < 0.001 as compared to HPMC and Lactose groups; b, ***: p < 0.0001 as compared to cellulose, HPMC, lactose, silica, and sucrose blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (D): 25 °C storage, API content: 90%, **: p < 0.01 and ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica, and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. https://www.statskingdom.com/linear-regression-calculator.html (accessed on 25 February 2025).
Figure 2
Figure 2
Prediction of API (n = 15; MW: 129.17 to 776.87 g.mol−1, LogP: −0.92 to 5.39; initial content: 10%, conditioning: plastic) stability in 6 pure or 15 combinations of blended excipients as a function of temperature of storage. (A): 4 °C storage, API content: 10%, **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups (ANOVA, post hoc Tukey’s All pairs comparison. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (●) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (B): 25 °C storage, API content: 10%, *p < 0.05 and **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. (C): 40 °C storage, API content: 10%, a, **: p < 0.001 as compared to HPMC and Lactose groups; b, ***: p < 0.0001 as compared to cellulose, HPMC, lactose, silica, and sucrose blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (D): 25 °C storage, API content: 90%, **: p < 0.01 and ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica, and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. https://www.statskingdom.com/linear-regression-calculator.html (accessed on 25 February 2025).
Figure 2
Figure 2
Prediction of API (n = 15; MW: 129.17 to 776.87 g.mol−1, LogP: −0.92 to 5.39; initial content: 10%, conditioning: plastic) stability in 6 pure or 15 combinations of blended excipients as a function of temperature of storage. (A): 4 °C storage, API content: 10%, **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups (ANOVA, post hoc Tukey’s All pairs comparison. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (●) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (B): 25 °C storage, API content: 10%, *p < 0.05 and **: p < 0.01 as compared to HPMC and Lactose groups; ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. (C): 40 °C storage, API content: 10%, a, **: p < 0.001 as compared to HPMC and Lactose groups; b, ***: p < 0.0001 as compared to cellulose, HPMC, lactose, silica, and sucrose blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, mannitol, silica, and sucrose pure excipients, or (●) HPMC and mannitol pure excipients. Shapiro–Wilk p-value > 0.05. (D): 25 °C storage, API content: 90%, **: p < 0.01 and ***: p < 0.0001 as compared to blend groups. A significant linear relationship between prediction stability and LogP of API is shown in presence of either (◯) cellulose, silica, and sucrose, (●) mannitol, (●) HPMC or (●) lactose pure excipients. Shapiro–Wilk p-value > 0.05. https://www.statskingdom.com/linear-regression-calculator.html (accessed on 25 February 2025).
Figure 3
Figure 3
Comparison of BUDs for 3166 APIs (MW: 12.01–1461.43 g.mol−1; LogP: −12.01–17.16) formulated in ■ cellulose, ■ HPMC, ■ lactose, ■ mannitol, ■ silica, and ◻ sucrose (API content: 90% to 1%) then stored at 25°C in plastic containers. For API contents of 1%, 10%, 50%, and 90%, significant differences were observed between excipients in the 80–170-day range (p < 0.0001). No significant difference was found between excipients in the 80–95-day category (Chi2 test).
Figure 4
Figure 4
Frequency of formulation categories with BUDs of 80–170 days. Radar charts represent the distribution of formulation categories (% on the y-axis) based on their BUD across different excipients (cellulose, HPMC, lactose, mannitol, silica, and sucrose). The formulations are classified according to their API content, ranging from 90% to 1%, and stored at various temperatures (40 °C, 25 °C, and 4 °C). The color-coded categories indicate the frequency of formulations falling within specific BUD ranges. For example, unlicensed preparations formulated with HPMC containing 10% API and stored at 40 °C show the following distribution: 50.5% of formulations have a BUD between 150 and 160 days, 29.5% between 140 and 150 days, 10% between 120 and 140 days, 9% between 160 and 170 days, and 1% between 80 and 120 days. All formulations were packaged in plastic containers.
Figure 5
Figure 5
Machine learning workflow used in Smart Formulation to predict the beyond use date (BUD) of active pharmaceutical ingredients (APIs) in oral solid dosage forms. The approach relies on a Tree Ensemble Regression Algorithm, a powerful supervised learning method that captures complex non-linear relationships between molecular properties, formulation parameters, and environmental conditions. The model is trained on a dataset comprising three categories of input descriptors: 1. API Descriptors (18 features)—Molecular properties such as molecular weight (MW), logP (lipophilicity), rotatable bonds (RB), polar surface area (PS), hydrogen bond donors (HBD), and acceptors (HBA), among others. 2. Formulation Descriptors (4 features)—Encoded excipient compositions, including lactose, silica, cellulose, mannitol, sucrose, and hydroxypropyl methylcellulose (HPMC), as well as API content percentage. 3. Conditioning and Storage Descriptors (5 features)—Packaging type (glass, plastic, and paper), storage temperature, and classification of storage conditions. The tree ensemble regression algorithm processes these features to establish correlations between molecular properties, formulation parameters, and environmental conditions, ultimately predicting the BUD in days. Notably, the model identifies an inverse correlation between LogP and BUD, suggesting that higher lipophilicity is associated with reduced stability. This predictive approach enables formulators to estimate stability efficiently, reducing reliance on extensive real-time stability studies.
Figure 6
Figure 6
(A) Predictive stability model of API using Tree Ensemble Regression learner algorithm in KNIME. (B) Automated workflow deployment for API stability prediction in compounded oral solid preparations, considering excipients, API content, packaging, and storage temperature. VI Output data Panel must be filled in https://www.knime.com/smart-formulation-data-app (accessed on 11 April 2025).
Figure 6
Figure 6
(A) Predictive stability model of API using Tree Ensemble Regression learner algorithm in KNIME. (B) Automated workflow deployment for API stability prediction in compounded oral solid preparations, considering excipients, API content, packaging, and storage temperature. VI Output data Panel must be filled in https://www.knime.com/smart-formulation-data-app (accessed on 11 April 2025).
Figure 6
Figure 6
(A) Predictive stability model of API using Tree Ensemble Regression learner algorithm in KNIME. (B) Automated workflow deployment for API stability prediction in compounded oral solid preparations, considering excipients, API content, packaging, and storage temperature. VI Output data Panel must be filled in https://www.knime.com/smart-formulation-data-app (accessed on 11 April 2025).

References

    1. Seghers F., Taylor M.M., Storey A., Dong J., Wi T.C., Wyber R., Ralston K., Nguimfack B.D. Securing the Supply of Benzathine Penicillin: A Global Perspective on Risks and Mitigation Strategies to Prevent Future Shortages. Int. Health. 2024;16:279–282. doi: 10.1093/inthealth/ihad087. - DOI - PMC - PubMed
    1. European Drug Shortages Formulary Project: Approval of Framework and Procedure Documents—European Directorate for the Quality of Medicines & HealthCare—EDQM. [(accessed on 26 March 2025)]. Available online: https://www.edqm.eu/en/-/european-drug-shortages-formulary-project-appro....
    1. Mian P., Maurer J.M., Touw D.J., Vos M.J., Rottier B.L. Pharmacy Compounded Pilocarpine: An Adequate Solution to Overcome Shortage of Pilogel® Discs for Sweat Testing in Patients with Cystic Fibrosis. J. Cyst. Fibros. 2024;23:126–131. doi: 10.1016/j.jcf.2023.09.014. - DOI - PubMed
    1. Allen L.V. PreScription: Shortages Continue—Compounding Pharmacies Fill the Gap… Again! Int. J. Pharm. Compd. 2023;27:180. - PubMed
    1. Gudeman J., Jozwiakowski M., Chollet J., Randell M. Potential Risks of Pharmacy Compounding. Drugs RD. 2013;13:1–8. doi: 10.1007/s40268-013-0005-9. - DOI - PMC - PubMed

LinkOut - more resources