scholarly journals Reducing Entropy Overestimation in Soft Actor Critic Using Dual Policy Network

2021 ◽  
Vol 2021 ◽  
pp. 1-13
Author(s):  
Hamid Ali ◽  
Hammad Majeed ◽  
Imran Usman ◽  
Khaled A. Almejalli

In reinforcement learning (RL), an agent learns an environment through hit and trail. This behavior allows the agent to learn in complex and difficult environments. In RL, the agent normally learns the given environment by exploring or exploiting. Most of the algorithms suffer from under exploration in the latter stage of the episodes. Recently, an off-policy algorithm called soft actor critic (SAC) is proposed that overcomes this problem by maximizing entropy as it learns the environment. In it, the agent tries to maximize entropy along with the expected discounted rewards. In SAC, the agent tries to be as random as possible while moving towards the maximum reward. This randomness allows the agent to explore the environment and stops it from getting stuck into local optima. We believe that maximizing the entropy causes the overestimation of entropy term which results in slow policy learning. This is because of the drastic change in action distribution whenever agent revisits the similar states. To overcome this problem, we propose a dual policy optimization framework, in which two independent policies are trained. Both the policies try to maximize entropy by choosing actions against the minimum entropy to reduce the overestimation. The use of two policies result in better and faster convergence. We demonstrate our approach on different well known continuous control simulated environments. Results show that our proposed technique achieves better results against state of the art SAC algorithm and learns better policies.

Symmetry ◽  
2019 ◽  
Vol 11 (7) ◽  
pp. 938
Author(s):  
Syed Rameez Naqvi ◽  
Ali Roman ◽  
Tallha Akram ◽  
Majed M. Alhaisoni ◽  
Muhammad Naeem ◽  
...  

Pipelines, in Reduced Instruction Set Computer (RISC) microprocessors, are expected to provide increased throughputs in most cases. However, there are a few instructions, and therefore entire assembly language codes, that execute faster and hazard-free without pipelines. It is usual for the compilers to generate codes from high level description that are more suitable for the underlying hardware to maintain symmetry with respect to performance; this, however, is not always guaranteed. Therefore, instead of trying to optimize the description to suit the processor design, we try to determine the more suitable processor variant for the given code during compile time, and dynamically reconfigure the system accordingly. In doing so, however, we first need to classify each code according to its suitability to a different processor variant. The latter, in turn, gives us confidence in performance symmetry against various types of codes—this is the primary contribution of the proposed work. We first develop mathematical performance models of three conventional microprocessor designs, and propose a symmetry-improving nonlinear optimization method to achieve code-to-design mapping. Our analysis is based on four different architectures and 324,000 different assembly language codes, each with between 10 and 1000 instructions with different percentages of commonly seen instruction types. Our results suggest that in the sub-micron era, where execution time of each instruction is merely in a few nanoseconds, codes accumulating as low as 5% (or above) hazard causing instructions execute more swiftly on processors without pipelines.


1996 ◽  
Vol 30 (4) ◽  
pp. 1033-1056 ◽  
Author(s):  
Chang Jui-Te

Effective combat performance depends on the following: First, there must be a sound command structure capable of making rational decisions. Second, there must be efficient means of communication to transmit decisions through the chain of command and to give the commanders continuous control over their units. There must also be sufficient transportation to allow the units to execute their mission in a timely way. Third, there must be adequate quality and quantity of weapons and supplies commensurable with the given military mission. Fourth, there must be high-quality soldiers at all levels able to perform their duties competently. Finally, the entire military effort must be guided by clear and coherent strategic thinking.


Author(s):  
K. A. A. Mustafa ◽  
N. Botteghi ◽  
B. Sirmacek ◽  
M. Poel ◽  
S. Stramigioli

<p><strong>Abstract.</strong> We introduce a new autonomous path planning algorithm for mobile robots for reaching target locations in an unknown environment where the robot relies on its on-board sensors. In particular, we describe the design and evaluation of a deep reinforcement learning motion planner with continuous linear and angular velocities to navigate to a desired target location based on deep deterministic policy gradient (DDPG). Additionally, the algorithm is enhanced by making use of the available knowledge of the environment provided by a grid-based SLAM with Rao-Blackwellized particle filter algorithm in order to shape the reward function in an attempt to improve the convergence rate, escape local optima and reduce the number of collisions with the obstacles. A comparison is made between a reward function shaped based on the map provided by the SLAM algorithm and a reward function when no knowledge of the map is available. Results show that the required learning time has been decreased in terms of number of episodes required to converge, which is 560 episodes compared to 1450 episodes in the standard RL algorithm, after adopting the proposed approach and the number of obstacle collision is reduced as well with a success ratio of 83% compared to 56% in the standard RL algorithm. The results are validated in a simulated experiment on a skid-steering mobile robot.</p>


Entropy ◽  
2021 ◽  
Vol 23 (12) ◽  
pp. 1702
Author(s):  
Haibo Sun ◽  
Feng Zhu ◽  
Yanzi Kong ◽  
Jianyu Wang ◽  
Pengfei Zhao

Active object recognition (AOR) aims at collecting additional information to improve recognition performance by purposefully adjusting the viewpoint of an agent. How to determine the next best viewpoint of the agent, i.e., viewpoint planning (VP), is a research focus. Most existing VP methods perform viewpoint exploration in the discrete viewpoint space, which have to sample viewpoint space and may bring in significant quantization error. To address this challenge, a continuous VP approach for AOR based on reinforcement learning is proposed. Specifically, we use two separate neural networks to model the VP policy as a parameterized Gaussian distribution and resort the proximal policy optimization framework to learn the policy. Furthermore, an adaptive entropy regularization based dynamic exploration scheme is presented to automatically adjust the viewpoint exploration ability in the learning process. To the end, experimental results on the public dataset GERMS well demonstrate the superiority of our proposed VP method.


Author(s):  
Yinlam Chow ◽  
Brandon Cui ◽  
Moonkyung Ryu ◽  
Mohammad Ghavamzadeh

Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL. However, designing such algorithms is often challenging because the bias in simulated data may overshadow the ease of data generation. A potential solution to this challenge is to jointly learn and improve model and policy using a universal objective function. In this paper, we leverage the connection between RL and probabilistic inference, and formulate such an objective function as a variational lower-bound of a log-likelihood. This allows us to use expectation maximization (EM) and iteratively fix a baseline policy and learn a variational distribution, consisting of a model and a policy (E-step), followed by improving the baseline policy given the learned variational distribution (M-step). We propose model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and show how the variational distribution learned by them can be used to optimize the M-step in a fully model-based fashion. Our experiments on a number of continuous control tasks show that our model-based (E-step) algorithm, called variational model-based policy optimization (VMBPO), is more sample-efficient and robust to hyper-parameter tuning than its model-free (E-step) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-the-art model-based and model-free RL algorithms and show its sample efficiency and performance.


2020 ◽  
Vol 10 (16) ◽  
pp. 5722 ◽  
Author(s):  
Duy Quang Tran ◽  
Sang-Hoon Bae

Advanced deep reinforcement learning shows promise as an approach to addressing continuous control tasks, especially in mixed-autonomy traffic. In this study, we present a deep reinforcement-learning-based model that considers the effectiveness of leading autonomous vehicles in mixed-autonomy traffic at a non-signalized intersection. This model integrates the Flow framework, the simulation of urban mobility simulator, and a reinforcement learning library. We also propose a set of proximal policy optimization hyperparameters to obtain reliable simulation performance. First, the leading autonomous vehicles at the non-signalized intersection are considered with varying autonomous vehicle penetration rates that range from 10% to 100% in 10% increments. Second, the proximal policy optimization hyperparameters are input into the multiple perceptron algorithm for the leading autonomous vehicle experiment. Finally, the superiority of the proposed model is evaluated using all human-driven vehicle and leading human-driven vehicle experiments. We demonstrate that full-autonomy traffic can improve the average speed and delay time by 1.38 times and 2.55 times, respectively, compared with all human-driven vehicle experiments. Our proposed method generates more positive effects when the autonomous vehicle penetration rate increases. Additionally, the leading autonomous vehicle experiment can be used to dissipate the stop-and-go waves at a non-signalized intersection.


2020 ◽  
Vol 34 (04) ◽  
pp. 6770-6777
Author(s):  
Chuheng Zhang ◽  
Yuanqi Li ◽  
Jian Li

It is known that existing policy gradient methods (such as vanilla policy gradient, PPO, A2C) may suffer from overly large gradients when the current policy is close to deterministic, leading to an unstable training process. We show that such instability can happen even in a very simple environment. To address this issue, we propose a new method, called target distribution learning (TDL), for policy improvement in reinforcement learning. TDL alternates between proposing a target distribution and training the policy network to approach the target distribution. TDL is more effective in constraining the KL divergence between updated policies, and hence leads to more stable policy improvements over iterations. Our experiments show that TDL algorithms perform comparably to (or better than) state-of-the-art algorithms for most continuous control tasks in the MuJoCo environment while being more stable in training.


Author(s):  
Peixi Peng ◽  
Junliang Xing ◽  
Lili Cao ◽  
Lisen Mu ◽  
Chang Huang

The task of real-time combat game is to coordinate multiple units to defeat their enemies controlled by the given opponent in a real-time combat scenario. It is difficult to design a high-level Artificial Intelligence (AI) program for such a task due to its extremely large state-action space and real-time requirements. This paper formulates this task as a collective decentralized partially observable Markov decision process, and designs a Deep Decentralized Policy Network (DDPN) to model the polices. To train DDPN effectively, a novel two-stage learning algorithm is proposed which combines imitation learning from opponent and reinforcement learning by no-regret dynamics. Extensive experimental results on various combat scenarios indicate that proposed method can defeat different opponent models and significantly outperforms many state-of-the-art approaches.


Author(s):  
Jiajin Li ◽  
Baoxiang Wang ◽  
Shengyu Zhang

Policy optimization on high-dimensional continuous control tasks exhibits its difficulty caused by the large variance of the policy gradient estimators. We present the action subspace dependent gradient (ASDG) estimator which incorporates the Rao-Blackwell theorem (RB) and Control Variates (CV) into a unified framework to reduce the variance. To invoke RB, our proposed algorithm (POSA) learns the underlying factorization structure among the action space based on the second-order advantage information. POSA captures the quadratic information explicitly and efficiently by utilizing the wide \& deep architecture. Empirical studies show that our proposed approach demonstrates the performance improvements on high-dimensional synthetic settings and OpenAI Gym's MuJoCo continuous control tasks.


Author(s):  
Xiaoteng Ma ◽  
Xiaohang Tang ◽  
Li Xia ◽  
Jun Yang ◽  
Qianchuan Zhao

Most of reinforcement learning algorithms optimize the discounted criterion which is beneficial to accelerate the convergence and reduce the variance of estimates. Although the discounted criterion is appropriate for certain tasks such as financial related problems, many engineering problems treat future rewards equally and prefer a long-run average criterion. In this paper, we study the reinforcement learning problem with the long-run average criterion. Firstly, we develop a unified trust region theory with discounted and average criteria. With the average criterion, a novel performance bound within the trust region is derived with the Perturbation Analysis (PA) theory. Secondly, we propose a practical algorithm named Average Policy Optimization (APO), which improves the value estimation with a novel technique named Average Value Constraint. To the best of our knowledge, our work is the first one to study the trust region approach with the average criterion and it complements the framework of reinforcement learning beyond the discounted criterion. Finally, experiments are conducted in the continuous control environment MuJoCo. In most tasks, APO performs better than the discounted PPO, which demonstrates the effectiveness of our approach.


Sign in / Sign up

Export Citation Format

Share Document