discounted criterion
Recently Published Documents


TOTAL DOCUMENTS

19
(FIVE YEARS 2)

H-INDEX

5
(FIVE YEARS 0)

Author(s):  
Xiaoteng Ma ◽  
Xiaohang Tang ◽  
Li Xia ◽  
Jun Yang ◽  
Qianchuan Zhao

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.


2021 ◽  
Vol 229 ◽  
pp. 01047
Author(s):  
Abdellatif Semmouri ◽  
Mostafa Jourhmane ◽  
Bahaa Eddine Elbaghazaoui

In this paper we consider a constrained optimization of discrete time Markov Decision Processes (MDPs) with finite state and action spaces, which accumulate both a reward and costs at each decision epoch. We will study the problem of finding a policy that maximizes the expected total discounted reward subject to the constraints that the expected total discounted costs are not greater than given values. Thus, we will investigate the decomposition method of the state space into the strongly communicating classes for computing an optimal or a nearly optimal stationary policy. The discounted criterion has many applications in several areas such that the Forest Management, the Management of Energy Consumption, the finance, the Communication System (Mobile Networks) and the artificial intelligence.


2002 ◽  
Vol 39 (2) ◽  
pp. 233-250 ◽  
Author(s):  
Xianping Guo ◽  
Weiping Zhu

In this paper, we consider denumerable-state continuous-time Markov decision processes with (possibly unbounded) transition and reward rates and general action space under the discounted criterion. We provide a set of conditions weaker than those previously known and then prove the existence of optimal stationary policies within the class of all possibly randomized Markov policies. Moreover, the results in this paper are illustrated by considering the birth-and-death processes with controlled immigration in which the conditions in this paper are satisfied, whereas the earlier conditions fail to hold.


2002 ◽  
Vol 39 (02) ◽  
pp. 233-250 ◽  
Author(s):  
Xianping Guo ◽  
Weiping Zhu

In this paper, we consider denumerable-state continuous-time Markov decision processes with (possibly unbounded) transition and reward rates and general action space under the discounted criterion. We provide a set of conditions weaker than those previously known and then prove the existence of optimal stationary policies within the class of all possibly randomized Markov policies. Moreover, the results in this paper are illustrated by considering the birth-and-death processes with controlled immigration in which the conditions in this paper are satisfied, whereas the earlier conditions fail to hold.


2001 ◽  
Vol 15 (4) ◽  
pp. 557-564 ◽  
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-de-Oca

This article concerns Markov decision chains with finite state and action spaces, and a control policy is graded via the expected total-reward criterion associated to a nonnegative reward function. Within this framework, a classical theorem guarantees the existence of an optimal stationary policy whenever the optimal value function is finite, a result that is obtained via a limit process using the discounted criterion. The objective of this article is to present an alternative approach, based entirely on the properties of the expected total-reward index, to establish such an existence result.


Sign in / Sign up

Export Citation Format

Share Document