Multi-Objective Markov Decision Processes for Data-Driven Decision Support

. 2016:17:211.

Epub 2016 Dec 1.

Multi-Objective Markov Decision Processes for Data-Driven Decision Support

Daniel J Lizotte¹, Eric B Laber²

Affiliations

¹ Department of Computer Science, Department of Epidemiology & Biostatistics, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 3K7, Canada.
² Department of Statistics, North Carolina State University, Raliegh, NC 27695, USA.

PMID: 28018133
PMCID: PMC5179144

Multi-Objective Markov Decision Processes for Data-Driven Decision Support

Daniel J Lizotte et al. J Mach Learn Res. 2016.

. 2016:17:211.

Epub 2016 Dec 1.

Authors

Daniel J Lizotte¹, Eric B Laber²

Affiliations

¹ Department of Computer Science, Department of Epidemiology & Biostatistics, The University of Western Ontario, 1151 Richmond Street, London, ON N6A 3K7, Canada.
² Department of Statistics, North Carolina State University, Raliegh, NC 27695, USA.

PMID: 28018133
PMCID: PMC5179144

Abstract

We present new methodology based on Multi-Objective Markov Decision Processes for developing sequential decision support systems from data. Our approach uses sequential decision-making data to provide support that is useful to many different decision-makers, each with different, potentially time-varying preference. To accomplish this, we develop an extension of fitted-Q iteration for multiple objectives that computes policies for all scalarization functions, i.e. preference functions, simultaneously from continuous-state, finite-horizon data. We identify and address several conceptual and computational challenges along the way, and we introduce a new solution concept that is appropriate when different actions have similar expected outcomes. Finally, we demonstrate an application of our method using data from the Clinical Antipsychotic Trials of Intervention Effectiveness and show that our approach offers decision-makers increased choice by a larger class of optimal policies.

Keywords: Markov decision processes; clinical decision support; evidence-based medicine; multi-objective optimization; reinforcement learning.

PubMed Disclaimer

Figures

**Figure 1**
Comparison of existing approaches to eliminating actions at time T. The problems illustrated here have analogs for t < T where the picture is more complicated. In this simple example, we suppose the vector-valued expected rewards (Q_T[1](*s_T, a*), Q_T[2](*s_T, a*)) are (1, 9), (9, 1), (4.9, 4.9), (4.6, 4.6) for actions a₁, a₂, a₃, a₄, respectively. **Figure 1(a)**: Using the method of Lizotte et al. (2010, 2012) based on convex combinations of rewards, actions a₃ and a₄ would be eliminated, and we would have Π_T (*s_t*) = {a₁, a₂}. (Any action whose expected rewards fall in the shaded region would be eliminated.) However, we would prefer to at least include a₃ since it offers a more “moderate” outcome that may be important to some decision-makers. **Figure 1(b)**: Using the Pareto partial order, only action a₄ is eliminated, and we have Π_T (*s_T*) = {a₁, a₂, a₃}. However, we may prefer to include a₄ since its performance is very close to that of a₃, and may be preferable for reasons we cannot infer from our data—e.g. cost, or allergy to a_3.

**Figure 2**
Partial visualization of the members of an example 𝒬_T−1. We fix a state s_T−1 = (50.1, 48.6) in this example, and we plot Q̂_T−1(*s_T, a_T*) for each Q̂_T−1 ∈ 𝒬_T−1 and for each a_T−1 ∈ {, , , , }. For example, the markers near the top of the plot correspond to expected returns for each Q̂ ∈ 𝒬_T that is achievable by taking the action at the current time point and then following a particular future policy. This example 𝒬_T−1 contains 20 Q̂_T−1 functions, each assuming a different π_T.

formula image — **Figure 2**
Partial visualization of the members of an example 𝒬_T−1. We fix a state s_T−1 = (50.1, 48.6) in this example, and we plot Q̂_T−1(*s_T, a_T*) for each Q̂_T−1 ∈ 𝒬_T−1 and for each a_T−1 ∈ {, , , , }. For example, the markers near the top of the plot correspond to expected returns for each Q̂ ∈ 𝒬_T that is achievable by taking the action at the current time point and then following a particular future policy. This example 𝒬_T−1 contains 20 Q̂_T−1 functions, each assuming a different π_T.

**Figure 3**
An NDP on a one-dimensional continuous state-space, a consistent policy, and a ϕ-consistent policy.

**Figure 4**
Comparison of rules for eliminating actions. In this simple example, we suppose the Q-vectors (Q_T[1] (*s_T, a*), Q_T[2] (*s_T, a*)) are (4.9, 4.9), (3, 5.2), (1.8, 5.6), (4.6, 4.6) for a₁, a₂, a₃, a₄, respectively, and suppose Δ₁ = Δ₂ = 0.5. **Figure 4(a)**: Using the Practical Domination rule, action a₄ is not eliminated by a₃ because it is not much worse according to either basis reward, as judged by Δ₁ and Δ₂. Action a₂ is eliminated because although it is slightly better than a₁ according to basis reward 2, it is much worse according to basis reward 1. Similarly, a₃ is eliminated by a₂. Note the small solid rectangle to the left of a₂: points in this region (including a₃) are dominated by a₂, but not by a₁. This illustrates the non-transitivity of the Practical Domination relation, and in turn shows that it is not a partial order. **Figure 4(b)**: Using Strong Practical Domination, which is a partial order, no actions are eliminated, and there are no regions of non-transitivity.

**Figure 5**
NDP produced by taking the union over actions recommended by Lizotte et al. (2010, 2012)

**Figure 6**
NDP produced by $Π_{≺}^{\exists}$ with Pareto Domination.

**Figure 7**
CATIE NDP for Phase 1 made using $Π_{≺}^{\exists}$ ; “warning” actions that would have been eliminated by Practical Domination but not by Strong Practical Domination have been removed.

**Figure 8**
NDP produced by $Π_{≺}^{\forall}$ with Strong Practical Domination.

See this image and copyright information in PMC

Cited by

Bayesian Nonparametric Policy Search with Application to Periodontal Recall Intervals.
Guan Q, Reich BJ, Laber EB, Bandyopadhyay D. Guan Q, et al. J Am Stat Assoc. 2020;115(531):1066-1078. doi: 10.1080/01621459.2019.1660169. Epub 2019 Oct 9. J Am Stat Assoc. 2020. PMID: 33012901 Free PMC article.
Quantiles based personalized treatment selection for multivariate outcomes and multiple treatments.
Kulasekera KB, Siriwardhana C. Kulasekera KB, et al. Stat Med. 2022 Jul 10;41(15):2695-2710. doi: 10.1002/sim.9377. Epub 2022 Mar 16. Stat Med. 2022. PMID: 35699385 Free PMC article.
Multi-Response Based Personalized Treatment Selection with Data from Crossover Designs for Multiple Treatments.
Kulasekera KB, Siriwardhana C. Kulasekera KB, et al. Commun Stat Simul Comput. 2022;51(2):554-569. doi: 10.1080/03610918.2019.1656739. Epub 2019 Sep 10. Commun Stat Simul Comput. 2022. PMID: 35299995 Free PMC article.
An Optimal Policy for Patient Laboratory Tests in Intensive Care Units.
Cheng LF, Prasad N, Engelhardt BE. Cheng LF, et al. Pac Symp Biocomput. 2019;24:320-331. Pac Symp Biocomput. 2019. PMID: 30864333 Free PMC article.
Precision Medicine.
Kosorok MR, Laber EB. Kosorok MR, et al. Annu Rev Stat Appl. 2019 Mar;6:263-286. doi: 10.1146/annurev-statistics-030718-105251. Annu Rev Stat Appl. 2019. PMID: 31073534 Free PMC article.

See all "Cited by" articles

References

1. Alagoz O, Hsu H, Schaefer AJ, Roberts MS. Markov decision processes: A tool for sequential decision making under uncertainty. Medical decision making : an international journal of the Society for Medical Decision Making. 2010;30(4):474–483. ISSN 0272-989X. - PMC - PubMed
1. Allison DB, Mentore JL, Heo M, Chandler LP, Cappelleri JC, Infante MC, Weiden PJ. Antipsychotic-induced weight gain: A comprehensive research synthesis. American Journal of Psychiatry. 1999 Nov;156:1686–1696. - PubMed
1. Bertsekas DP. Dynamic Programming and Optimal Control, Vol. II. 3rd. Athena Scientific; 2007. ISBN 1886529302, 9781886529304.
1. Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. chapter 2.1. Athena Scientific; 1996. p. 12.
1. Blatt D, Murphy SA, Zhu J. Technical Report 04-63. The Methodology Center, Penn. State University; 2004. A-learning for approximate planning.

Grants and funding

LinkOut - more resources

Full Text Sources
- Europe PubMed Central
- PubMed Central

[1] Alagoz O, Hsu H, Schaefer AJ, Roberts MS. Markov decision processes: A tool for sequential decision making under uncertainty. Medical decision making : an international journal of the Society for Medical Decision Making. 2010;30(4):474–483. ISSN 0272-989X. - PMC - PubMed

[2] Alagoz O, Hsu H, Schaefer AJ, Roberts MS. Markov decision processes: A tool for sequential decision making under uncertainty. Medical decision making : an international journal of the Society for Medical Decision Making. 2010;30(4):474–483. ISSN 0272-989X. - PMC - PubMed

[3] Allison DB, Mentore JL, Heo M, Chandler LP, Cappelleri JC, Infante MC, Weiden PJ. Antipsychotic-induced weight gain: A comprehensive research synthesis. American Journal of Psychiatry. 1999 Nov;156:1686–1696. - PubMed

[4] Allison DB, Mentore JL, Heo M, Chandler LP, Cappelleri JC, Infante MC, Weiden PJ. Antipsychotic-induced weight gain: A comprehensive research synthesis. American Journal of Psychiatry. 1999 Nov;156:1686–1696. - PubMed

[5] Bertsekas DP. Dynamic Programming and Optimal Control, Vol. II. 3rd. Athena Scientific; 2007. ISBN 1886529302, 9781886529304.

[6] Bertsekas DP. Dynamic Programming and Optimal Control, Vol. II. 3rd. Athena Scientific; 2007. ISBN 1886529302, 9781886529304.

[7] Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. chapter 2.1. Athena Scientific; 1996. p. 12.

[8] Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. chapter 2.1. Athena Scientific; 1996. p. 12.

[9] Blatt D, Murphy SA, Zhu J. Technical Report 04-63. The Methodology Center, Penn. State University; 2004. A-learning for approximate planning.

[10] Blatt D, Murphy SA, Zhu J. Technical Report 04-63. The Methodology Center, Penn. State University; 2004. A-learning for approximate planning.

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Multi-Objective Markov Decision Processes for Data-Driven Decision Support

Affiliations

Multi-Objective Markov Decision Processes for Data-Driven Decision Support

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Grants and funding

LinkOut - more resources

Full Text Sources