Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Jun 3;21(3):036027.
doi: 10.1088/1741-2552/ad48bb.

Reinforcement learning for closed-loop regulation of cardiovascular system with vagus nerve stimulation: a computational study

Affiliations

Reinforcement learning for closed-loop regulation of cardiovascular system with vagus nerve stimulation: a computational study

Parisa Sarikhani et al. J Neural Eng. .

Abstract

Objective. Vagus nerve stimulation (VNS) is being investigated as a potential therapy for cardiovascular diseases including heart failure, cardiac arrhythmia, and hypertension. The lack of a systematic approach for controlling and tuning the VNS parameters poses a significant challenge. Closed-loop VNS strategies combined with artificial intelligence (AI) approaches offer a framework for systematically learning and adapting the optimal stimulation parameters. In this study, we presented an interactive AI framework using reinforcement learning (RL) for automated data-driven design of closed-loop VNS control systems in a computational study.Approach.Multiple simulation environments with a standard application programming interface were developed to facilitate the design and evaluation of the automated data-driven closed-loop VNS control systems. These environments simulate the hemodynamic response to multi-location VNS using biophysics-based computational models of healthy and hypertensive rat cardiovascular systems in resting and exercise states. We designed and implemented the RL-based closed-loop VNS control frameworks in the context of controlling the heart rate and the mean arterial pressure for a set point tracking task. Our experimental design included two approaches; a general policy using deep RL algorithms and a sample-efficient adaptive policy using probabilistic inference for learning and control.Main results.Our simulation results demonstrated the capabilities of the closed-loop RL-based approaches to learn optimal VNS control policies and to adapt to variations in the target set points and the underlying dynamics of the cardiovascular system. Our findings highlighted the trade-off between sample-efficiency and generalizability, providing insights for proper algorithm selection. Finally, we demonstrated that transfer learning improves the sample efficiency of deep RL algorithms allowing the development of more efficient and personalized closed-loop VNS systems.Significance.We demonstrated the capability of RL-based closed-loop VNS systems. Our approach provided a systematic adaptable framework for learning control strategies without requiring prior knowledge about the underlying dynamics.

Keywords: closed-loop VNS; intelligent systems; neuromodulation; reinforcement learning.

PubMed Disclaimer

Figures

Figure 1.
Figure 1.
Overview of the architecture of the simulation environments for developing closed-loop VNS system demonstrating the interactions of the RL agent with the rat cardiac model using the standard API. The left block represents the reduced-order surrogates of the physiological cardiac models wrapped with the standard Gymnasium API, where the inputs of the model (color-coded as dark blue) are stimulation frequency and stimulation amplitude across three different locations at time t (At). The outputs of the model (color-coded as green) are the response of HR and MAP to the VNS parameters. The model estimates the response of the cardiac system (HRt+1, MAPt+1) to the action At taken at time step t given the current state of the system (HRt, MAPt). The right block represents the reinforcement learning agent, which takes action At according to its policy at time step t, and observes the next state St+!, and reward Rt+1.
Figure 2.
Figure 2.
The pipline used for developing the simulation environments using the Gymnasium API to test and prototype RL algorithms for regulating the cardiovascular system. (a) Used the physiological models of rat cardiac system under multi-location VNS implemented in MATLAB, (b) generated a simulated data set of the response of the cardiac system by varying randomly selecetd VNS parameters, (c) trained the reduced-order TCN model to model the response of HR and MAP to VNS parameters, and (d) used the Gymnasium standard API wrapper over the trained TCN models for easier compatibility with RL algorithms.
Figure 3.
Figure 3.
The overview of designing a general policy. The left panel represents the structure of the simulation environment with the standard Gymnasium API during the training mode. The right panel depicts the simulation environment in the inference mode. The policy network of the RL agents was designed as a simple MLP model, where HRt and MAPt are the current states of the environment. The input vector of the policy network was extended by adding HRtarget  and MAPtarget  (target set-points) to design a general policy. The environment is the reduced-order surrogate of the physiological cardiac models wrapped with the standard Gymnasium API, where the input of the model (color-coded as dark blue) are stimulation frequency and stimulation amplitude across three different locations at time t (At). The output of the model (color-coded as green) are the response of HR and MAP to the VNS parameters.
Figure 4.
Figure 4.
The workflow of adaptive policy using PILCO, illustrating the iterative process where actions are executed according to the recent policy (or randomly selected from the parameter space for the initial query) for N iterations. PILCO collects state transitions and reward values from the environment in response to the actions, augments its dataset, updates the Gaussian process (GP) model of the state transition, and adjusts the policy parameters based on the augmented data. This process is repeated to improve the adaptive policy over time. The environment is the reduced-order surrogate of the physiological cardiac models wrapped with the standard Gymnasium API, where the input of the model (color-coded as dark blue) are stimulation frequency and stimulation amplitude across three different locations at time t (At). The output of the model (color-coded as green) are the response of HR and MAP to the VNS parameters.
Figure 5.
Figure 5.
Comparison of the HR and MAP values predicted from the previously established biophysical model implemented in MATLAB versus the predictions from the reduced-order TCN model. The blue solid lines represent the HR and MAP values generated from the HC biophysical model implemented in MATLAB. The red dashed line represents the corresponding valued generated with reduced-order TCN model.
Figure 6.
Figure 6.
Reward values of the RL agents during the set-point tracking task in four cardiac environments; (a) normalized training reward values per episode for SAC during the training mode, (b) normalized training reward values per episode for PPO during the training mode, (c) reward values of PILCO for 20 randomly selected setpoints (mean ± standard deviation). The normalized reward for deep RL algorithms (a), (b) represent the mean ± standard deviation of the reward calculated through a moving average with window length of 50 to provide a better representation of the agents’ performances over time.
Figure 7.
Figure 7.
Performance comparison of the deep RL agents (PPO and SAC) during the inference mode and PILCO in four cardiac environments. The bars represent the average reward calculated over 20 randomly selected set points and the error bars (shown in back lines) represents the standard deviation of the set-point tracking performance across 20 randomly selected set points.
Figure 8.
Figure 8.
The performance of deep RL algorithms in inference mode for set-point tracking task across four cardiac environments using (a) PPO and (b) SAC algorithms for 200 iterations. The red solid lines represent the desired set points and the blue lines represent the states of the four cardiac models (HR and MAP). The target set points were changed after 100 iterations, where iterations are equal to the cardiac cycle.
Figure 9.
Figure 9.
The stimulation parameters used during the inference mode for the set-point tracking task across the four cardiac environments (a) using PPO and (b) SAC algorithms for 200 iterations. The stimulation parameters were amplitude and frequency across three VNS locations. The target set points were changed after 100 iterations, where iterations are equal to the cardiac cycle.
Figure 10.
Figure 10.
The performance of PILCO in set-point tracking task across four cardiac environments using PILCO (a) and its corresponding stimulation parameters (amplitude and frequency) across three stimulation locations (b). In the left figure (a), the red lines represent the desired set points and the blue lines represent the states of the four cardiac models (HR and MAP).
Figure 11.
Figure 11.
Adaptability of PILCO during the set point tracking task to variations in the target set point (a)–(c) and to variations in the underlying dynamics of the environment (d)–(f). (a), (d) reward value; (b), (e) the state trajectory, and (c), (f) stimulation parameters for 200 iterations, where the changes were applied after 100 iterations.
Figure 12.
Figure 12.
Adaptability of PPO and SAC algorithms to the variations in the underlying dynamics of the environment using transfer learning; (a) comparison of the reward values of PPO and SAC with random initialization (RI) and with transfer learning, (b) performance of PPO and SAC in set point tracking task with the trained policy using transfer learning approach.
Figure 13.
Figure 13.
The effect of the presence of different levels of measurement noise on the performance of the RL agents. Each figure represents the reward values of RL agents during the set-point tracking task in the HC model with different levels of noise added to the state space; (a) normalized training reward values per episode for SAC during the training mode, (b) normalized training reward values per episode for PPO during the training mode, (c) reward values of PILCO during the experiment. The normalized reward for deep RL algorithms (a), (b) represent the mean ± standard deviation of the reward calculated through a moving average with window length of 50 to provide a better representation of the agents’ performances over time. The reward value of PILCO shows the mean ± standard deviation of the reward during 5 runs.

References

    1. Buckley U, Shivkumar K, Ardell J L. Autonomic regulation therapy in heart failure. Curr. Heart Fail Rep. 2015;12:284–93. doi: 10.1007/s11897-015-0263-7. - DOI - PMC - PubMed
    1. Mozaffarian D, et al. Heart disease and stroke statistics—2016 update: a report from the American heart association. Circulation. 2016;133:e38–360. - PubMed
    1. Ottaviani M M, Vallone F, Micera S, Recchia F A. Closed-loop vagus nerve stimulation for the treatment of cardiovascular diseases: state of the art and future directions. Front. Cardiovasc. Med. 2022;9:866957. doi: 10.3389/fcvm.2022.866957. - DOI - PMC - PubMed
    1. Capilupi M J, Kerath S M, Becker L B. Vagus nerve stimulation and the cardiovascular system. Cold Spring Harb. Perspect. Med. 2020;10:a034173. doi: 10.1101/cshperspect.a034173. - DOI - PMC - PubMed
    1. Premchand R K, et al. Autonomic regulation therapy via left or right cervical vagus nerve stimulation in patients with chronic heart failure: results of the ANTHEM-HF trial. J. Cardiac. Fail. 2014;20:808–16. doi: 10.1016/j.cardfail.2014.08.009. - DOI - PubMed

Publication types

LinkOut - more resources