scholarly journals Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)

2017 ◽  
Vol 59 ◽  
pp. 229-264 ◽  
Author(s):  
Asrar Ahmed ◽  
Pradeep Varakantham ◽  
Meghna Lowalekar ◽  
Yossiri Adulyasak ◽  
Patrick Jaillet

Markov Decision Processes (MDPs) are an effective model to represent decision processes in the presence of transitional uncertainty and reward tradeoffs. However, due to the difficulty in exactly specifying the transition and reward functions in MDPs, researchers have proposed uncertain MDP models and robustness objectives in solving those models. Most approaches for computing robust policies have focused on the computation of maximin policies which maximize the value in the worst case amongst all realisations of uncertainty. Given the overly conservative nature of maximin policies, recent work has proposed minimax regret as an ideal alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only and they are also limited in their scalability. Therefore, we provide a general model of uncertain MDPs that considers uncertainty over both transition and reward functions. Furthermore, we also consider dependence of the uncertainty across different states and decision epochs. We also provide a mixed integer linear program formulation for minimizing regret given a set of samples of the transition and reward functions in the uncertain MDP. In addition, we provide two myopic variants of regret, namely Cumulative Expected Myopic Regret (CEMR) and One Step Regret (OSR) that can be optimized in a scalable manner. Specifically, we provide dynamic programming and policy iteration based algorithms to optimize CEMR and OSR respectively. Finally, to demonstrate the effectiveness of our approaches, we provide comparisons on two benchmark problems from literature. We observe that optimizing the myopic variants of regret, OSR and CEMR are better than directly optimizing the regret.

2017 ◽  
Vol 26 (03) ◽  
pp. 1760014
Author(s):  
Paul Weng ◽  
Olivier Spanjaard

Markov decision processes (MDP) have become one of the standard models for decisiontheoretic planning problems under uncertainty. In its standard form, rewards are assumed to be numerical additive scalars. In this paper, we propose a generalization of this model allowing rewards to be functional. The value of a history is recursively computed by composing the reward functions. We show that several variants of MDPs presented in the literature can be instantiated in this setting. We then identify sufficient conditions on these reward functions for dynamic programming to be valid. We also discuss the infinite horizon case and the case where a maximum operator does not exist. In order to show the potential of our framework, we conclude the paper by presenting several illustrative examples.


Author(s):  
Shun Zhang ◽  
Edmund H. Durfee ◽  
Satinder Singh

As it achieves a goal on behalf of its human user, an autonomous agent's actions may have side effects that change features of its environment in ways that negatively surprise its user. An agent that can be trusted to operate safely should thus only change features the user has explicitly permitted. We formalize this problem, and develop a planning algorithm that avoids potentially negative side effects given what the agent knows about (un)changeable features. Further, we formulate a provably minimax-regret querying strategy for the agent to selectively ask the user about features that it hasn't explicitly been told about. We empirically show how much faster it is than a more exhaustive approach and how much better its queries are than those found by the best known heuristic.


2001 ◽  
Vol 14 ◽  
pp. 29-51 ◽  
Author(s):  
N. L. Zhang ◽  
W. Zhang

Partially observable Markov decision processes (POMDPs) have recently become popular among many AI researchers because they serve as a natural model for planning under uncertainty. Value iteration is a well-known algorithm for finding optimal policies for POMDPs. It typically takes a large number of iterations to converge. This paper proposes a method for accelerating the convergence of value iteration. The method has been evaluated on an array of benchmark problems and was found to be very effective: It enabled value iteration to converge after only a few iterations on all the test problems.


1983 ◽  
Vol 20 (04) ◽  
pp. 835-842
Author(s):  
David Assaf

The paper presents sufficient conditions for certain functions to be convex. Functions of this type often appear in Markov decision processes, where their maximum is the solution of the problem. Since a convex function takes its maximum at an extreme point, the conditions may greatly simplify a problem. In some cases a full solution may be obtained after the reduction is made. Some illustrative examples are discussed.


Sign in / Sign up

Export Citation Format

Share Document