A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index

José Niño-Mora

doi:10.3390/math8122226

A Fast-Pivoting Algorithm for Whittle’s Restless Bandit Index

Mathematics ◽

10.3390/math8122226 ◽

2020 ◽

Vol 8 (12) ◽

pp. 2226 ◽

Cited By ~ 1

Author(s):

José Niño-Mora

Keyword(s):

Numerical Study ◽

Index Policy ◽

State Spaces ◽

Restless Bandit ◽

Restless Bandits ◽

Pivoting Algorithm ◽

Markov Decision ◽

Whittle Index ◽

Decision Epoch ◽

Change State

The Whittle index for restless bandits (two-action semi-Markov decision processes) provides an intuitively appealing optimal policy for controlling a single generic project that can be active (engaged) or passive (rested) at each decision epoch, and which can change state while passive. It further provides a practical heuristic priority-index policy for the computationally intractable multi-armed restless bandit problem, which has been widely applied over the last three decades in multifarious settings, yet mostly restricted to project models with a one-dimensional state. This is due in part to the difficulty of establishing indexability (existence of the index) and of computing the index for projects with large state spaces. This paper draws on the author’s prior results on sufficient indexability conditions and an adaptive-greedy algorithmic scheme for restless bandits to obtain a new fast-pivoting algorithm that computes the n Whittle index values of an n-state restless bandit by performing, after an initialization stage, n steps that entail (2/3)n3+O(n2) arithmetic operations. This algorithm also draws on the parametric simplex method, and is based on elucidating the pattern of parametric simplex tableaux, which allows to exploit special structure to substantially simplify and reduce the complexity of simplex pivoting steps. A numerical study demonstrates substantial runtime speed-ups versus alternative algorithms.

Download Full-text

INDEXABILITY AND OPTIMAL INDEX POLICIES FOR A CLASS OF REINITIALISING RESTLESS BANDITS

Probability in the Engineering and Informational Sciences ◽

10.1017/s026996481500025x ◽

2015 ◽

Vol 30 (1) ◽

pp. 1-23 ◽

Cited By ~ 1

Author(s):

Sofía S. Villar

Keyword(s):

Closed Form ◽

Surveillance Systems ◽

Bandit Problems ◽

Index Policy ◽

Initial State ◽

Restless Bandit ◽

Index Policies ◽

Markov Decision ◽

Whittle Index ◽

Partially Observable

Motivated by a class of Partially Observable Markov Decision Processes with application in surveillance systems in which a set of imperfectly observed state processes is to be inferred from a subset of available observations through a Bayesian approach, we formulate and analyze a special family of multi-armed restless bandit problems. We consider the problem of finding an optimal policy for observing the processes that maximizes the total expected net rewards over an infinite time horizon subject to the resource availability. From the Lagrangian relaxation of the original problem, an index policy can be derived, as long as the existence of the Whittle index is ensured. We demonstrate that such a class of reinitializing bandits in which the projects' state deteriorates while active and resets to its initial state when passive until its completion possesses the structural property of indexability and we further show how to compute the index in closed form. In general, the Whittle index rule for restless bandit problems does not achieve optimality. However, we show that the proposed Whittle index rule is optimal for the problem under study in the case of stochastically heterogenous arms under the expected total criterion, and it is further recovered by a simple tractable rule referred to as the 1-limited Round Robin rule. Moreover, we illustrate the significant suboptimality of other widely used heuristic: the Myopic index rule, by computing in closed form its suboptimality gap. We present numerical studies which illustrate for the more general instances the performance advantages of the Whittle index rule over other simple heuristics.

Download Full-text

Fast Two-Stage Computation of an Index Policy for Multi-Armed Bandits with Setup Delays

Mathematics ◽

10.3390/math9010052 ◽

2020 ◽

Vol 9 (1) ◽

pp. 52

Author(s):

José Niño-Mora

Keyword(s):

Numerical Study ◽

Arithmetic Operation ◽

Bandit Problem ◽

Index Policy ◽

Two Stage ◽

Second Stage ◽

Whittle Index ◽

Set Up ◽

Computing Method ◽

Special Case

We consider the multi-armed bandit problem with penalties for switching that include setup delays and costs, extending the former results of the author for the special case with no switching delays. A priority index for projects with setup delays that characterizes, in part, optimal policies was introduced by Asawa and Teneketzis in 1996, yet without giving a means of computing it. We present a fast two-stage index computing method, which computes the continuation index (which applies when the project has been set up) in a first stage and certain extra quantities with cubic (arithmetic-operation) complexity in the number of project states and then computes the switching index (which applies when the project is not set up), in a second stage, with quadratic complexity. The approach is based on new methodological advances on restless bandit indexation, which are introduced and deployed herein, being motivated by the limitations of previous results, exploiting the fact that the aforementioned index is the Whittle index of the project in its restless reformulation. A numerical study demonstrates substantial runtime speed-ups of the new two-stage index algorithm versus a general one-stage Whittle index algorithm. The study further gives evidence that, in a multi-project setting, the index policy is consistently nearly optimal.

Download Full-text

Index policies for a class of discounted restless bandits

Advances in Applied Probability ◽

10.1017/s0001867800011903 ◽

2002 ◽

Vol 34 (04) ◽

pp. 754-774 ◽

Cited By ~ 8

Author(s):

K. D. Glazebrook ◽

J. Niño-Mora ◽

P. S. Ansell

Keyword(s):

Conservation Laws ◽

Special Class ◽

Computational Study ◽

Bandit Problems ◽

Index Policy ◽

Restless Bandit ◽

Restless Bandits ◽

Index Policies ◽

Strong Performance ◽

Dual Speed

The paper concerns a class of discounted restless bandit problems which possess an indexability property. Conservation laws yield an expression for the reward suboptimality of a general policy. These results are utilised to study the closeness to optimality of an index policy for a special class of simple and natural dual speed restless bandits for which indexability is guaranteed. The strong performance of the index policy is confirmed by a computational study.

Download Full-text

Index policies for a class of discounted restless bandits

Advances in Applied Probability ◽

10.1239/aap/1037990952 ◽

2002 ◽

Vol 34 (4) ◽

pp. 754-774 ◽

Cited By ~ 18

Author(s):

K. D. Glazebrook ◽

J. Niño-Mora ◽

P. S. Ansell

Keyword(s):

Conservation Laws ◽

Special Class ◽

Computational Study ◽

Bandit Problems ◽

Index Policy ◽

Restless Bandit ◽

Restless Bandits ◽

Index Policies ◽

Strong Performance ◽

Dual Speed

Download Full-text

Managing admission and discharge processes in intensive care units

Health Care Management Science ◽

10.1007/s10729-021-09560-6 ◽

2021 ◽

Author(s):

Jie Bai ◽

Andreas Fügener ◽

Jochen Gönsch ◽

Jens O. Brunner ◽

Manfred Blobner

Keyword(s):

Intensive Care ◽

Intensive Care Units ◽

Numerical Study ◽

Early Discharge ◽

Fixed Costs ◽

Real World Data ◽

Trade Off ◽

Markov Decision ◽

Critical Patients ◽

Expected Mortality

AbstractThe intensive care unit (ICU) is one of the most crucial and expensive resources in a health care system. While high fixed costs usually lead to tight capacities, shortages have severe consequences. Thus, various challenging issues exist: When should an ICU admit or reject arriving patients in general? Should ICUs always be able to admit critical patients or rather focus on high utilization? On an operational level, both admission control of arriving patients and demand-driven early discharge of currently residing patients are decision variables and should be considered simultaneously. This paper discusses the trade-off between medical and monetary goals when managing intensive care units by modeling the problem as a Markov decision process. Intuitive, myopic rule mimicking decision-making in practice is applied as a benchmark. In a numerical study based on real-world data, we demonstrate that the medical results deteriorate dramatically when focusing on monetary goals only, and vice versa. Using our model, we illustrate the trade-off along an efficiency frontier that accounts for all combinations of medical and monetary goals. Coming from a solution that optimizes monetary costs, a significant reduction of expected mortality can be achieved at little additional monetary cost.

Download Full-text

Whittle Index Policy for Opportunistic Scheduling: Heterogeneous Multistate Channels

Restless Multi-Armed Bandit in Opportunistic Scheduling ◽

10.1007/978-3-030-69959-8_5 ◽

2021 ◽

pp. 109-141

Author(s):

Kehao Wang ◽

Lin Chen

Keyword(s):

Opportunistic Scheduling ◽

Index Policy ◽

Whittle Index

Download Full-text

Whittle Index Policy for Multichannel Scheduling in Queueing Systems

2019 IEEE International Symposium on Information Theory (ISIT) ◽

10.1109/isit.2019.8849774 ◽

2019 ◽

Cited By ~ 1

Author(s):

Saad Kriouile ◽

Maialen Larranaga ◽

Mohamad Assaad

Keyword(s):

Queueing Systems ◽

Index Policy ◽

Whittle Index

Download Full-text

Hierarchical Reinforcement Learning

Encyclopedia of Artificial Intelligence ◽

10.4018/978-1-59904-849-9.ch122 ◽

2011 ◽

pp. 825-830

Author(s):

Carlos Diuk ◽

Michael Littman

Keyword(s):

Reinforcement Learning ◽

Learning Problems ◽

Underlying Structure ◽

Sequential Decision ◽

State Spaces ◽

Hierarchical Reinforcement Learning ◽

Markov Decision ◽

Finite Set ◽

State Abstraction ◽

Main Ideas

Reinforcement learning (RL) deals with the problem of an agent that has to learn how to behave to maximize its utility by its interactions with an environment (Sutton & Barto, 1998; Kaelbling, Littman & Moore, 1996). Reinforcement learning problems are usually formalized as Markov Decision Processes (MDP), which consist of a finite set of states and a finite number of possible actions that the agent can perform. At any given point in time, the agent is in a certain state and picks an action. It can then observe the new state this action leads to, and receives a reward signal. The goal of the agent is to maximize its long-term reward. In this standard formalization, no particular structure or relationship between states is assumed. However, learning in environments with extremely large state spaces is infeasible without some form of generalization. Exploiting the underlying structure of a problem can effect generalization and has long been recognized as an important aspect in representing sequential decision tasks (Boutilier et al., 1999). Hierarchical Reinforcement Learning is the subfield of RL that deals with the discovery and/or exploitation of this underlying structure. Two main ideas come into play in hierarchical RL. The first one is to break a task into a hierarchy of smaller subtasks, each of which can be learned faster and easier than the whole problem. Subtasks can also be performed multiple times in the course of achieving the larger task, reusing accumulated knowledge and skills. The second idea is to use state abstraction within subtasks: not every task needs to be concerned with every aspect of the state space, so some states can actually be abstracted away and treated as the same for the purpose of the given subtask.

Download Full-text

The Computation of Average Optimal Policies in Denumerable State Markov Decision Chains

Advances in Applied Probability ◽

10.1017/s0001867800027816 ◽

1997 ◽

Vol 29 (01) ◽

pp. 114-137

Author(s):

Linn I. Sennott

Keyword(s):

Discrete Time ◽

Average Cost ◽

Queueing Systems ◽

State Spaces ◽

Original Process ◽

Optimal Policies ◽

Finite State ◽

Markov Decision ◽

Optimal Average ◽

Infinite State

This paper studies the expected average cost control problem for discrete-time Markov decision processes with denumerably infinite state spaces. A sequence of finite state space truncations is defined such that the average costs and average optimal policies in the sequence converge to the optimal average cost and an optimal policy in the original process. The theory is illustrated with several examples from the control of discrete-time queueing systems. Numerical results are discussed.

Download Full-text

Opportunistic Scheduling Revisited Using Restless Bandits: Indexability and Index Policy

GLOBECOM 2017 - 2017 IEEE Global Communications Conference ◽

10.1109/glocom.2017.8254159 ◽

2017 ◽

Cited By ~ 3

Author(s):

Kehao Wang ◽

Jihong Yu ◽

Lin Chen ◽

Moe Win

Keyword(s):

Opportunistic Scheduling ◽

Index Policy ◽

Restless Bandits

Download Full-text