Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Rajeev Agrawal

doi:10.2307/1427934

Sample mean based index policies by O(log n) regret for the multi-armed bandit problem

Advances in Applied Probability ◽

10.1017/s0001867800047790 ◽

1995 ◽

Vol 27 (04) ◽

pp. 1054-1078 ◽

Cited By ~ 2

Author(s):

Rajeev Agrawal

Keyword(s):

Lower Bound ◽

Large Deviations ◽

Probability Distributions ◽

Infinite Horizon ◽

Exponential Families ◽

Contraction Principle ◽

Bandit Problem ◽

Sample Mean ◽

Seminal Work ◽

Index Policies

We consider a non-Bayesian infinite horizon version of the multi-armed bandit problem with the objective of designing simple policies whose regret increases slowly with time. In their seminal work on this problem, Lai and Robbins had obtained a O(log n) lower bound on the regret with a constant that depends on the Kullback–Leibler number. They also constructed policies for some specific families of probability distributions (including exponential families) that achieved the lower bound. In this paper we construct index policies that depend on the rewards from each arm only through their sample mean. These policies are computationally much simpler and are also applicable much more generally. They achieve a O(log n) regret with a constant that is also based on the Kullback–Leibler number. This constant turns out to be optimal for one-parameter exponential families; however, in general it is derived from the optimal one via a ‘contraction' principle. Our results rely entirely on a few key lemmas from the theory of large deviations.

Download Full-text

On the Notion of Reproducibility and Its Full Implementation to Natural Exponential Families

Mathematics ◽

10.3390/math9131568 ◽

2021 ◽

Vol 9 (13) ◽

pp. 1568

Author(s):

Shaul K. Bar-Lev

Keyword(s):

Power Function ◽

Exponential Family ◽

Probability Distributions ◽

Infinite Divisibility ◽

Variance Function ◽

Exponential Families ◽

Natural Exponential Family ◽

Natural Exponential Families ◽

Convolution Properties ◽

The Relationship

Let F=Fθ:θ∈Θ⊂R be a family of probability distributions indexed by a parameter θ and let X1,⋯,Xn be i.i.d. r.v.’s with L(X1)=Fθ∈F. Then, F is said to be reproducible if for all θ∈Θ and n∈N, there exists a sequence (αn)n≥1 and a mapping gn:Θ→Θ,θ⟼gn(θ) such that L(αn∑i=1nXi)=Fgn(θ)∈F. In this paper, we prove that a natural exponential family F is reproducible iff it possesses a variance function which is a power function of its mean. Such a result generalizes that of Bar-Lev and Enis (1986, The Annals of Statistics) who proved a similar but partial statement under the assumption that F is steep as and under rather restricted constraints on the forms of αn and gn(θ). We show that such restrictions are not required. In addition, we examine various aspects of reproducibility, both theoretically and practically, and discuss the relationship between reproducibility, convolution and infinite divisibility. We suggest new avenues for characterizing other classes of families of distributions with respect to their reproducibility and convolution properties .

Download Full-text

On transforming an index for generalised bandit problems

Journal of Applied Probability ◽

10.2307/3214927 ◽

1995 ◽

Vol 32 (1) ◽

pp. 168-182 ◽

Cited By ~ 4

Author(s):

K. D. Glazebrook ◽

S. Greatrix

Keyword(s):

Dynamic Programming ◽

Policy Evaluation ◽

Gittins Index ◽

Bandit Problem ◽

Bandit Problems ◽

Index Policies

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.

Download Full-text

ADDITIVITY PROPERTIES OF SOFIC ENTROPY AND MEASURES ON MODEL SPACES

Forum of Mathematics Sigma ◽

10.1017/fms.2016.18 ◽

2016 ◽

Vol 4 ◽

Cited By ~ 4

Author(s):

TIM AUSTIN

Keyword(s):

Lower Bound ◽

Probability Distributions ◽

Cartesian Product ◽

Sufficient Conditions ◽

Entropy Theory ◽

Reverse Inequality ◽

Cartesian Products ◽

Model Spaces ◽

Sofic Entropy ◽

Sofic Groups

Sofic entropy is an invariant for probability-preserving actions of sofic groups. It was introduced a few years ago by Lewis Bowen, and shown to extend the classical Kolmogorov–Sinai entropy from the setting of amenable groups. Some parts of Kolmogorov–Sinai entropy theory generalize to sofic entropy, but in other respects this new invariant behaves less regularly. This paper explores conditions under which sofic entropy is additive for Cartesian products of systems. It is always subadditive, but the reverse inequality can fail. We define a new entropy notion in terms of probability distributions on the spaces of good models of an action. Using this, we prove a general lower bound for the sofic entropy of a Cartesian product in terms of separate quantities for the two factor systems involved. We also prove that this lower bound is optimal in a certain sense, and use it to derive some sufficient conditions for the strict additivity of sofic entropy itself. Various other properties of this new entropy notion are also developed.

Download Full-text

Large deviations for the empirical process of a symmetric measure: a lower bound

Statistics & Probability Letters ◽

10.1016/j.spl.2003.06.008 ◽

2004 ◽

Vol 66 (2) ◽

pp. 197-204 ◽

Cited By ~ 1

Author(s):

Tryfon Daras

Keyword(s):

Lower Bound ◽

Large Deviations ◽

Empirical Process ◽

Symmetric Measure

Download Full-text

Index policies for discounted bandit problems with availability constraints

Advances in Applied Probability ◽

10.1017/s0001867800002573 ◽

2008 ◽

Vol 40 (02) ◽

pp. 377-400 ◽

Cited By ~ 1

Author(s):

Savas Dayanik ◽

Warren Powell ◽

Kazutoshi Yamazaki

Keyword(s):

Bandit Problem ◽

Bandit Problems ◽

Index Policy ◽

State Action ◽

Index Policies ◽

Availability Constraints ◽

Whittle Index ◽

Multiarmed Bandit

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.

Download Full-text

A Systematic Theory of Exponential Families of Probability Distributions

Theory of Probability and Its Applications ◽

10.1137/1111041 ◽

1966 ◽

Vol 11 (3) ◽

pp. 425-435 ◽

Cited By ~ 6

Author(s):

N. N. Chentsov

Keyword(s):

Probability Distributions ◽

Exponential Families ◽

Systematic Theory

Download Full-text

Large Deviations Theory in Exponential Families

The Annals of Mathematical Statistics ◽

10.1214/aoms/1177698121 ◽

1968 ◽

Vol 39 (5) ◽

pp. 1402-1424 ◽

Cited By ~ 18

Author(s):

Bradley Efron ◽

Donald Traux

Keyword(s):

Large Deviations ◽

Exponential Families ◽

Large Deviations Theory

Download Full-text

Index policies for discounted bandit problems with availability constraints

Advances in Applied Probability ◽

10.1239/aap/1214950209 ◽

2008 ◽

Vol 40 (2) ◽

pp. 377-400 ◽

Cited By ~ 5

Author(s):

Savas Dayanik ◽

Warren Powell ◽

Kazutoshi Yamazaki

Keyword(s):

Bandit Problem ◽

Bandit Problems ◽

Index Policy ◽

State Action ◽

Index Policies ◽

Availability Constraints ◽

Whittle Index ◽

Multiarmed Bandit

A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. The Whittle index policy is derived, and its properties are studied. Then it is assumed that the arms may break down, but repair is an option at some cost, and the new Whittle index policy is derived. Both problems are indexable. The proposed index policies cannot be dominated by any other index policy over all multiarmed bandit problems considered here. Whittle indices are evaluated for Bernoulli arms with unknown success probabilities.

Download Full-text

On transforming an index for generalised bandit problems

Journal of Applied Probability ◽

10.1017/s0021900200102633 ◽

1995 ◽

Vol 32 (01) ◽

pp. 168-182

Author(s):

K. D. Glazebrook ◽

S. Greatrix

Keyword(s):

Dynamic Programming ◽

Policy Evaluation ◽

Gittins Index ◽

Bandit Problem ◽

Bandit Problems ◽

Index Policies

Nash (1980) demonstrated that index policies are optimal for a class of generalised bandit problem. A transform of the index concerned has many of the attributes of the Gittins index. The transformed index is positive-valued, with maximal values yielding optimal actions. It may be characterised as the value of a restart problem and is hence computable via dynamic programming methodologies. The transformed index can also be used in procedures for policy evaluation.

Download Full-text