Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

1995 ◽  
Vol 27 (4) ◽  
pp. 1054-1078 ◽  
Author(s):  
Rajeev Agrawal

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(log n) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O(log n) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.

1995 ◽  
Vol 27 (04) ◽  
pp. 1054-1078 ◽  
Author(s):  
Rajeev Agrawal

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(log n) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O(log n) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.


Mathematics ◽  
2021 ◽  
Vol 9 (13) ◽  
pp. 1568
Author(s):  
Shaul K. Bar-Lev

Let F=Fθ:θ∈Θ⊂R be a family of probability distributions indexed by a parameter θ and let X1,⋯,Xn be i.i.d. r.v.’s with L(X1)=Fθ∈F. Then, F is said to be reproducible if for all θ∈Θ and n∈N, there exists a sequence (αn)n≥1 and a mapping gn:Θ→Θ,θ⟼gn(θ) such that L(αn∑i=1nXi)=Fgn(θ)∈F. In this paper, we prove that a natural exponential family F is reproducible iff it possesses a variance function which is a power function of its mean. Such a result generalizes that of Bar-Lev and Enis (1986, The Annals of Statistics) who proved a similar but partial statement under the assumption that F is steep as and under rather restricted constraints on the forms of αn and gn(θ). We show that such restrictions are not required. In addition, we examine various aspects of reproducibility, both theoretically and practically, and discuss the relationship between reproducibility, convolution and infinite divisibility. We suggest new avenues for characterizing other classes of families of distributions with respect to their reproducibility and convolution properties .


1995 ◽  
Vol 32 (1) ◽  
pp. 168-182 ◽  
Author(s):  
K. D. Glazebrook ◽  
S. Greatrix

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.


2016 ◽  
Vol 4 ◽  
Author(s):  
TIM AUSTIN

Sofic entropy is an invariant for probability-preserving actions of sofic groups. It was introduced a few years ago by Lewis Bowen, and shown to extend the classical Kolmogorov–Sinai entropy from the setting of amenable groups. Some parts of Kolmogorov–Sinai entropy theory generalize to sofic entropy, but in other respects this new invariant behaves less regularly. This paper explores conditions under which sofic entropy is additive for Cartesian products of systems. It is always subadditive, but the reverse inequality can fail. We define a new entropy notion in terms of probability distributions on the spaces of good models of an action. Using this, we prove a general lower bound for the sofic entropy of a Cartesian product in terms of separate quantities for the two factor systems involved. We also prove that this lower bound is optimal in a certain sense, and use it to derive some sufficient conditions for the strict additivity of sofic entropy itself. Various other properties of this new entropy notion are also developed.


2008 ◽  
Vol 40 (02) ◽  
pp. 377-400 ◽  
Author(s):  
Savas Dayanik ◽  
Warren Powell ◽  
Kazutoshi Yamazaki

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.


2008 ◽  
Vol 40 (2) ◽  
pp. 377-400 ◽  
Author(s):  
Savas Dayanik ◽  
Warren Powell ◽  
Kazutoshi Yamazaki

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.


1995 ◽  
Vol 32 (01) ◽  
pp. 168-182
Author(s):  
K. D. Glazebrook ◽  
S. Greatrix

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.


Sign in / Sign up

Export Citation Format

Share Document