Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016:17:211.
Epub 2016 Dec 1.

Multi-Objective Markov Decision Processes for Data-Driven Decision Support

Affiliations

Multi-Objective Markov Decision Processes for Data-Driven Decision Support

Daniel J Lizotte et al. J Mach Learn Res. 2016.

Abstract

We present new methodology based on Multi-Objective Markov Decision Processes for developing sequential decision support systems from data. Our approach uses sequential decision-making data to provide support that is useful to many different decision-makers, each with different, potentially time-varying preference. To accomplish this, we develop an extension of fitted-Q iteration for multiple objectives that computes policies for all scalarization functions, i.e. preference functions, simultaneously from continuous-state, finite-horizon data. We identify and address several conceptual and computational challenges along the way, and we introduce a new solution concept that is appropriate when different actions have similar expected outcomes. Finally, we demonstrate an application of our method using data from the Clinical Antipsychotic Trials of Intervention Effectiveness and show that our approach offers decision-makers increased choice by a larger class of optimal policies.

Keywords: Markov decision processes; clinical decision support; evidence-based medicine; multi-objective optimization; reinforcement learning.

PubMed Disclaimer

Figures

Figure 1
Figure 1
Comparison of existing approaches to eliminating actions at time T. The problems illustrated here have analogs for t < T where the picture is more complicated. In this simple example, we suppose the vector-valued expected rewards (QT[1](sT, a), QT[2](sT, a)) are (1, 9), (9, 1), (4.9, 4.9), (4.6, 4.6) for actions a1, a2, a3, a4, respectively. Figure 1(a): Using the method of Lizotte et al. (2010, 2012) based on convex combinations of rewards, actions a3 and a4 would be eliminated, and we would have ΠT (st) = {a1, a2}. (Any action whose expected rewards fall in the shaded region would be eliminated.) However, we would prefer to at least include a3 since it offers a more “moderate” outcome that may be important to some decision-makers. Figure 1(b): Using the Pareto partial order, only action a4 is eliminated, and we have ΠT (sT) = {a1, a2, a3}. However, we may prefer to include a4 since its performance is very close to that of a3, and may be preferable for reasons we cannot infer from our data—e.g. cost, or allergy to a3.
Figure 2
Figure 2
Partial visualization of the members of an example 𝒬T−1. We fix a state sT−1 = (50.1, 48.6) in this example, and we plot T−1(sT, aT) for each T−1 ∈ 𝒬T−1 and for each aT−1 ∈ {formula image, formula image, formula image, formula image, formula image}. For example, the formula image markers near the top of the plot correspond to expected returns for each ∈ 𝒬T that is achievable by taking the formula image action at the current time point and then following a particular future policy. This example 𝒬T−1 contains 20 T−1 functions, each assuming a different πT.
Figure 3
Figure 3
An NDP on a one-dimensional continuous state-space, a consistent policy, and a ϕ-consistent policy.
Figure 4
Figure 4
Comparison of rules for eliminating actions. In this simple example, we suppose the Q-vectors (QT[1] (sT, a), QT[2] (sT, a)) are (4.9, 4.9), (3, 5.2), (1.8, 5.6), (4.6, 4.6) for a1, a2, a3, a4, respectively, and suppose Δ1 = Δ2 = 0.5. Figure 4(a): Using the Practical Domination rule, action a4 is not eliminated by a3 because it is not much worse according to either basis reward, as judged by Δ1 and Δ2. Action a2 is eliminated because although it is slightly better than a1 according to basis reward 2, it is much worse according to basis reward 1. Similarly, a3 is eliminated by a2. Note the small solid rectangle to the left of a2: points in this region (including a3) are dominated by a2, but not by a1. This illustrates the non-transitivity of the Practical Domination relation, and in turn shows that it is not a partial order. Figure 4(b): Using Strong Practical Domination, which is a partial order, no actions are eliminated, and there are no regions of non-transitivity.
Figure 5
Figure 5
NDP produced by taking the union over actions recommended by Lizotte et al. (2010, 2012)
Figure 6
Figure 6
NDP produced by Π with Pareto Domination.
Figure 7
Figure 7
CATIE NDP for Phase 1 made using Π; “warning” actions that would have been eliminated by Practical Domination but not by Strong Practical Domination have been removed.
Figure 8
Figure 8
NDP produced by Π with Strong Practical Domination.

Similar articles

Cited by

References

    1. Alagoz O, Hsu H, Schaefer AJ, Roberts MS. Markov decision processes: A tool for sequential decision making under uncertainty. Medical decision making : an international journal of the Society for Medical Decision Making. 2010;30(4):474–483. ISSN 0272-989X. - PMC - PubMed
    1. Allison DB, Mentore JL, Heo M, Chandler LP, Cappelleri JC, Infante MC, Weiden PJ. Antipsychotic-induced weight gain: A comprehensive research synthesis. American Journal of Psychiatry. 1999 Nov;156:1686–1696. - PubMed
    1. Bertsekas DP. Dynamic Programming and Optimal Control, Vol. II. 3rd. Athena Scientific; 2007. ISBN 1886529302, 9781886529304.
    1. Bertsekas DP, Tsitsiklis JN. Neuro-Dynamic Programming. chapter 2.1. Athena Scientific; 1996. p. 12.
    1. Blatt D, Murphy SA, Zhu J. Technical Report 04-63. The Methodology Center, Penn. State University; 2004. A-learning for approximate planning.

LinkOut - more resources