Regret Bounds for Reinforcement Learning via Markov Chain Concentration
Keyword(s):
We give a simple optimistic algorithm for which it is easy to derive regret bounds of O(sqrt{t-mix SAT}) steps in uniformly ergodic Markov decision processes with S states, A actions, and mixing time parameter t-mix. These bounds are the first regret bounds in the general, non-episodic setting with an optimal dependence on all given parameters. They could only be improved by using an alternative mixing time parameter.
Keyword(s):
2010 ◽
Vol 411
(29-30)
◽
pp. 2684-2695
◽
2017 ◽
pp. 768-777
Keyword(s):
1984 ◽
Vol 16
(3)
◽
pp. 392-393
Keyword(s):
Keyword(s):
Keyword(s):
2007 ◽
Vol 17
(1)
◽
pp. 23-52
◽
Keyword(s):
1984 ◽
Vol 35
(4)
◽
pp. 366-367
Keyword(s):