Grid approximations of MDPs with continuous state/action spaces

In this paper, we consider the important problem of safe exploration in reinforcement learning. While reinforcement learning is well-suited to domains with complex transition dynamics and high-dimensional state-action spaces, an additional challenge is posed by the need for safe and efficient exploration. Traditional exploration techniques are not particularly useful for solving dangerous tasks, where the trial and error process may lead to the selection of actions whose execution in some states may result in damage to the learning system (or any other system). Consequently, when an agent begins an interaction with a dangerous and high-dimensional state-action space, an important question arises; namely, that of how to avoid (or at least minimize) damage caused by the exploration of the state-action space. We introduce the PI-SRL algorithm which safely improves suboptimal albeit robust behaviors for continuous state and action control tasks and which efficiently learns from the experience gained from the environment. We evaluate the proposed method in four complex tasks: automatic car parking, pole-balancing, helicopter hovering, and business management.

Download Full-text

Compositional RL Agents That Follow Language Commands in Temporal Logic

Frontiers in Robotics and AI ◽

10.3389/frobt.2021.689550 ◽

2021 ◽

Vol 8 ◽

Author(s):

Yen-Ling Kuo ◽

Boris Katz ◽

Andrei Barbu

Keyword(s):

Neural Networks ◽

Temporal Logic ◽

Recurrent Neural Networks ◽

Prior Work ◽

State Action ◽

Compositional Structure ◽

Learning Agent ◽

Continuous State ◽

Compositional Structures ◽

Action Spaces

We demonstrate how a reinforcement learning agent can use compositional recurrent neural networks to learn to carry out commands specified in linear temporal logic (LTL). Our approach takes as input an LTL formula, structures a deep network according to the parse of the formula, and determines satisfying actions. This compositional structure of the network enables zero-shot generalization to significantly more complex unseen formulas. We demonstrate this ability in multiple problem domains with both discrete and continuous state-action spaces. In a symbolic domain, the agent finds a sequence of letters that satisfy a specification. In a Minecraft-like environment, the agent finds a sequence of actions that conform to a formula. In the Fetch environment, the robot finds a sequence of arm configurations that move blocks on a table to fulfill the commands. While most prior work can learn to execute one formula reliably, we develop a novel form of multi-task learning for RL agents that allows them to learn from a diverse set of tasks and generalize to a new set of diverse tasks without any additional training. The compositional structures presented here are not specific to LTL, thus opening the path to RL agents that perform zero-shot generalization in other compositional domains.

Download Full-text

Collision-free path planning for welding manipulator via hybrid algorithm of deep reinforcement learning and inverse kinematics

Complex & Intelligent Systems ◽

10.1007/s40747-021-00366-1 ◽

2021 ◽

Author(s):

Jie Zhong ◽

Tao Wang ◽

Lianglun Cheng

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Free Path ◽

Inverse Kinematics ◽

Multiple Dimensions ◽

Continuous State ◽

Planning Algorithm ◽

Convergence Performance ◽

Path Planner ◽

Action Spaces

AbstractIn actual welding scenarios, an effective path planner is needed to find a collision-free path in the configuration space for the welding manipulator with obstacles around. However, as a state-of-the-art method, the sampling-based planner only satisfies the probability completeness and its computational complexity is sensitive with state dimension. In this paper, we propose a path planner for welding manipulators based on deep reinforcement learning for solving path planning problems in high-dimensional continuous state and action spaces. Compared with the sampling-based method, it is more robust and is less sensitive with state dimension. In detail, to improve the learning efficiency, we introduce the inverse kinematics module to provide prior knowledge while a gain module is also designed to avoid the local optimal policy, we integrate them into the training algorithm. To evaluate our proposed planning algorithm in multiple dimensions, we conducted multiple sets of path planning experiments for welding manipulators. The results show that our method not only improves the convergence performance but also is superior in terms of optimality and robustness of planning compared with most other planning algorithms.

Download Full-text

Action selection in continuous state and action spaces by cooperation and competition of extended kohonen maps

Proceedings of the second international joint conference on Autonomous agents and multiagent systems - AAMAS '03 ◽

10.1145/860575.860793 ◽

2003 ◽

Author(s):

Kian Hsiang Low ◽

Wee Kheng Leow ◽

Marcelo H. Ang

Keyword(s):

Action Selection ◽

Kohonen Maps ◽

Cooperation And Competition ◽

Continuous State ◽

Action Spaces

Download Full-text

A Deep Reinforcement Learning-Based MPPT Control for PV Systems under Partial Shading Condition

Sensors ◽

10.3390/s20113039 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3039

Author(s):

Bao Chau Phan ◽

Ying-Chih Lai ◽

Chin E. Lin

Keyword(s):

Reinforcement Learning ◽

Maximum Power ◽

Maximum Power Point ◽

Partial Shading ◽

Discrete State ◽

Efficient Operation ◽

Pv Systems ◽

Continuous State ◽

Power Point ◽

Action Spaces

On the issues of global environment protection, the renewable energy systems have been widely considered. The photovoltaic (PV) system converts solar power into electricity and significantly reduces the consumption of fossil fuels from environment pollution. Besides introducing new materials for the solar cells to improve the energy conversion efficiency, the maximum power point tracking (MPPT) algorithms have been developed to ensure the efficient operation of PV systems at the maximum power point (MPP) under various weather conditions. The integration of reinforcement learning and deep learning, named deep reinforcement learning (DRL), is proposed in this paper as a future tool to deal with the optimization control problems. Following the success of deep reinforcement learning (DRL) in several fields, the deep Q network (DQN) and deep deterministic policy gradient (DDPG) are proposed to harvest the MPP in PV systems, especially under a partial shading condition (PSC). Different from the reinforcement learning (RL)-based method, which is only operated with discrete state and action spaces, the methods adopted in this paper are used to deal with continuous state spaces. In this study, DQN solves the problem with discrete action spaces, while DDPG handles the continuous action spaces. The proposed methods are simulated in MATLAB/Simulink for feasibility analysis. Further tests under various input conditions with comparisons to the classical Perturb and observe (P&O) MPPT method are carried out for validation. Based on the simulation results in this study, the performance of the proposed methods is outstanding and efficient, showing its potential for further applications.

Download Full-text

Reinforcement Distribution in Continuous State Action Space Fuzzy Q–Learning: A Novel Approach

Fuzzy Logic and Applications - Lecture Notes in Computer Science ◽

10.1007/11676935_5 ◽

2006 ◽

pp. 40-45

Author(s):

Andrea Bonarini ◽

Francesco Montrone ◽

Marcello Restelli

Keyword(s):

Action Space ◽

Q Learning ◽

State Action ◽

Novel Approach ◽

Continuous State ◽

Reinforcement Distribution

Download Full-text

Fitted Natural Actor-Critic: A New Algorithm for Continuous State-Action MDPs

Machine Learning and Knowledge Discovery in Databases - Lecture Notes in Computer Science ◽

10.1007/978-3-540-87481-2_5 ◽

2008 ◽

pp. 66-81 ◽

Cited By ~ 9

Author(s):

Francisco S. Melo ◽

Manuel Lopes

Keyword(s):

State Action ◽

Continuous State

Download Full-text

Online Tuning of a PID Controller with a Fuzzy Reinforcement Learning MAS for Flow Rate Control of a Desalination Unit

Electronics ◽

10.3390/electronics8020231 ◽

2019 ◽

Vol 8 (2) ◽

pp. 231 ◽

Cited By ~ 2

Author(s):

Panagiotis Kofinas ◽

Anastasios I. Dounis

Keyword(s):

Reinforcement Learning ◽

Flow Rate ◽

Pid Controller ◽

Hybrid Control ◽

Q Learning ◽

State Action ◽

Continuous State ◽

Multi Agent ◽

Flow Rate Control ◽

Online Tuning

This paper proposes a hybrid Zeigler-Nichols (Z-N) fuzzy reinforcement learning MAS (Multi-Agent System) approach for online tuning of a Proportional Integral Derivative (PID) controller in order to control the flow rate of a desalination unit. The PID gains are set by the Z-N method and then are adapted online through the fuzzy Q-learning MAS. The fuzzy Q-learning is introduced in each agent in order to confront with the continuous state-action space. The global state of the MAS is defined by the value of the error and the derivative of error. The MAS consists of three agents and the output signal of each agent defines the percentage change of each gain. The increment or the reduction of each gain can be in the range of 0% to 100% of its initial value. The simulation results highlight the performance of the suggested hybrid control strategy through comparison with the conventional PID controller tuned by Z-N.

Download Full-text

Learning Quadcopter Maneuvers with Concurrent Methods of Policy Optimization

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2017.p0639 ◽

2017 ◽

Vol 21 (4) ◽

pp. 639-649

Author(s):

Pei-Hua Huang ◽

◽

Osamu Hasegawa

Keyword(s):

Trust Region ◽

Control Policy ◽

Asynchronous Learning ◽

State Action ◽

Learning Framework ◽

Learning Speed ◽

Continuous State ◽

Continuous Actions ◽

Policy Optimization ◽

Robotic Application

This study presents an aerial robotic application of deep reinforcement learning that imparts an asynchronous learning framework and trust region policy optimization to a simulated quad-rotor helicopter (quadcopter) environment. In particular, we optimized a control policy asynchronously through interaction with concurrent instances of the environment. The control system was benchmarked and extended with examples to tackle continuous state-action tasks for the quadcoptor: hovering control and balancing an inverted pole. Performing these maneuvers required continuous actions for sensitive control of small acceleration changes of the quadcoptor, thereby maximizing the scalar reward of the defined tasks. The simulation results demonstrated an enhancement of the learning speed and reliability for the tasks.

Download Full-text