A value-based deep reinforcement learning model with human expertise in optimal treatment of sepsis

XiaoDan Wu¹, RuiChang Li², Zhen He³, TianZhi Yu⁴, ChangQing Cheng⁵

Affiliations

¹ Smart Health Laboratory, Hebei University of Technology, Tianjin, China.
² Smart Health Laboratory, Hebei University of Technology, Tianjin, China. 202011701005@stu.hebut.edu.cn.
³ College of Management and Economics, Tianjin University, Tianjin, China.
⁴ Emergency Department, Tianjin Medical University General Hospital, Tianjin, China.
⁵ Department of Systems Science and Industrial Engineering, State University of New York, Binghamton, NY, USA. ccheng@binghamton.edu.

PMID: 36732666
PMCID: PMC9894526
DOI: 10.1038/s41746-023-00755-5

A value-based deep reinforcement learning model with human expertise in optimal treatment of sepsis

XiaoDan Wu et al. NPJ Digit Med. 2023.

. 2023 Feb 2;6(1):15.

doi: 10.1038/s41746-023-00755-5.

Authors

XiaoDan Wu¹, RuiChang Li², Zhen He³, TianZhi Yu⁴, ChangQing Cheng⁵

Affiliations

¹ Smart Health Laboratory, Hebei University of Technology, Tianjin, China.
² Smart Health Laboratory, Hebei University of Technology, Tianjin, China. 202011701005@stu.hebut.edu.cn.
³ College of Management and Economics, Tianjin University, Tianjin, China.
⁴ Emergency Department, Tianjin Medical University General Hospital, Tianjin, China.
⁵ Department of Systems Science and Industrial Engineering, State University of New York, Binghamton, NY, USA. ccheng@binghamton.edu.

PMID: 36732666
PMCID: PMC9894526
DOI: 10.1038/s41746-023-00755-5

Abstract

Deep Reinforcement Learning (DRL) has been increasingly attempted in assisting clinicians for real-time treatment of sepsis. While a value function quantifies the performance of policies in such decision-making processes, most value-based DRL algorithms cannot evaluate the target value function precisely and are not as safe as clinical experts. In this study, we propose a Weighted Dueling Double Deep Q-Network with embedded human Expertise (WD3QNE). A target Q value function with adaptive dynamic weight is designed to improve the estimate accuracy and human expertise in decision-making is leveraged. In addition, the random forest algorithm is employed for feature selection to improve model interpretability. We test our algorithm against state-of-the-art value function methods in terms of expected return, survival rate, action distribution and external validation. The results demonstrate that WD3QNE obtains the highest survival rate of 97.81% in MIMIC-III dataset. Our proposed method is capable of providing reliable treatment decisions with embedded clinician expertise.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing interests.

Figures

**Fig. 1. Architecture of WD3QNE algorithm.**
a The dynamic treatment process of the WD3QNE agent for sepsis. The continuous state space and discrete action space are then constructed. The DRL agent takes actions based on the current state and clinician expertise. b WD3QNE algorithm structure.

**Fig. 2. Expected return of different algorithms at each learning epoch.**
The value-based DRL algorithms is run for 100 epochs in the validation set with feature selection (37 observation features) and without feature selection (45 observation features). Number 37 means 37 observation features that we select with the random forest algorithm. Number 45 means 45 observation features. Although the DQN algorithm converges fast in the beginning, it exhibits premature convergence.

**Fig. 3. Action distribution for the test set.**
a Action distribution of the human clinician policy. b Action distribution of the DQ3N policy with 37 observation features. c Action distribution of the WD3QNE policy with 37 observation features. We aggregate all actions selected over all timesteps for the five dose bins of both medications. 0 denotes no drug given. We discretize the action space into per-drug quartiles. Action counts represent the utilized times of the drug dose. We can see that the human clinician policies tend to use low doses of vasopressors. The pure AI clinician policies (D3QN) tend to use high doses of vasopressors. The AI clinician policies with embedded human expertise (WD3QNE) tend to use lower doses of vasopressors than D3QN and higher doses of vasopressors than the clinician.

**Fig. 4. Performance results of different binning intervals at each learning epoch.**
The patient trajectories are discretized into different binning intervals, 1 h, 2 h, 4 h, 6 h, and 8 h. a The loss value of different binning intervals in training. b The expected return of different binning intervals in the training test.

**Fig. 5. Bellman error as a function of epochs.**
Visualization of Bellman error evolution. The WD3QN is shown in red. The Dueling DQN is shown in blue while the D3QN is shown in orange.

**Fig. 6. Feature importance score.**
We calculate the classification accuracy with death as the label for different numbers of features. The 37 features (variables) selected with the highest accuracy are displayed. The glossary of vital signs and lab values is provided in Table 4.

**Fig. 7. The relationship between expected return and survival rate.**
a The relationship between expected return and survival rate for 45 observation features. b The relationship between expected return and survival rate for 37 observation features. The relationship learned from observational data and actions taken by actual clinicians in the MIMIC-III dataset.

See this image and copyright information in PMC

References

1. Singer M, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3) Jama. 2016;315:801–810. doi: 10.1001/jama.2016.0287. - DOI - PMC - PubMed
1. Nanayakkara T, et al. Unifying cardiovascular modelling with deep reinforcement learning for uncertainty aware control of sepsis treatment. PLoS Digital Health. 2022;1:e0000012. doi: 10.1371/journal.pdig.0000012. - DOI - PMC - PubMed
1. Evans L, et al. Surviving sepsis campaign: international guidelines for management of sepsis and septic shock 2021. Intensive Care Med. 2021;47:1181–1247. doi: 10.1007/s00134-021-06506-y. - DOI - PMC - PubMed
1. Lauritsen SM, et al. Early detection of sepsis utilizing deep learning on electronic health record event sequences. Artif. Intell. Med. 2020;104:101820. doi: 10.1016/j.artmed.2020.101820. - DOI - PubMed
1. Kallfelz, M. et al. MIMIC-IV demo data in the OMOP Common Data Model (version 0.9). PhysioNet. 10.13026/p1f5-7x35 (2021).

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

A value-based deep reinforcement learning model with human expertise in optimal treatment of sepsis

Affiliations

A value-based deep reinforcement learning model with human expertise in optimal treatment of sepsis

Authors

Affiliations

Abstract

Conflict of interest statement

Figures

References

LinkOut - more resources

Full Text Sources

Other Literature Sources