Efficient Reinforcement Learning Using Recursive Least-Squares Methods

Journal of Artificial Intelligence Research ◽

10.1613/jair.946 ◽

2002 ◽

Vol 16 ◽

pp. 259-292 ◽

Cited By ~ 74

Author(s):

X. Xu ◽

H. He ◽

D. Hu

Keyword(s):

Reinforcement Learning ◽

Markov Chains ◽

Least Squares ◽

Learning Algorithm ◽

Learning Control ◽

Experimental Results ◽

Recursive Least Squares ◽

Wide Range ◽

Adaptive Heuristic ◽

Data Efficiency

The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD(lambda) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD(lambda) can be viewed as the extension of RLS-TD(0) from lambda=0 to general lambda within interval [0,1], so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD(lambda) are proved for ergodic Markov chains. Compared to the existing LS-TD(lambda) algorithm, RLS-TD(lambda) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD(lambda) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings. The Fast-AHC algorithm is derived by applying the proposed RLS-TD(lambda) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD(lambda). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD(lambda) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods.

Download Full-text

Minibatch Recursive Least Squares Q-Learning

Computational Intelligence and Neuroscience ◽

10.1155/2021/5370281 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Chunyuan Zhang ◽

Qi Song ◽

Zeng Meng

Keyword(s):

Reinforcement Learning ◽

Least Squares ◽

Linear Function ◽

Function Approximation ◽

Learning Algorithm ◽

Learning Algorithms ◽

Optimization Technique ◽

Recursive Least Squares ◽

Q Learning ◽

Linear Function Approximation

The deep Q-network (DQN) is one of the most successful reinforcement learning algorithms, but it has some drawbacks such as slow convergence and instability. In contrast, the traditional reinforcement learning algorithms with linear function approximation usually have faster convergence and better stability, although they easily suffer from the curse of dimensionality. In recent years, many improvements to DQN have been made, but they seldom make use of the advantage of traditional algorithms to improve DQN. In this paper, we propose a novel Q-learning algorithm with linear function approximation, called the minibatch recursive least squares Q-learning (MRLS-Q). Different from the traditional Q-learning algorithm with linear function approximation, the learning mechanism and model structure of MRLS-Q are more similar to those of DQNs with only one input layer and one linear output layer. It uses the experience replay and the minibatch training mode and uses the agent’s states rather than the agent’s state-action pairs as the inputs. As a result, it can be used alone for low-dimensional problems and can be seamlessly integrated into DQN as the last layer for high-dimensional problems as well. In addition, MRLS-Q uses our proposed average RLS optimization technique, so that it can achieve better convergence performance whether it is used alone or integrated with DQN. At the end of this paper, we demonstrate the effectiveness of MRLS-Q on the CartPole problem and four Atari games and investigate the influences of its hyperparameters experimentally.

Download Full-text

REINFORCEMENT LEARNING CONTROL FOR SHIP STEERING USING RECURSIVE LEAST-SQUARES ALGORITHM

IFAC Proceedings Volumes ◽

10.3182/20050703-6-cz-1902.00243 ◽

2005 ◽

Vol 38 (1) ◽

pp. 133-138 ◽

Cited By ~ 1

Author(s):

Zhi-peng Shen ◽

Chen Guo ◽

Shi-chun Yuan

Keyword(s):

Reinforcement Learning ◽

Least Squares ◽

Learning Control ◽

Recursive Least Squares ◽

Recursive Least Squares Algorithm ◽

Ship Steering ◽

Least Squares Algorithm

Download Full-text

A general fuzzified CMAC based reinforcement learning control for ship steering using recursive least-squares algorithm

Neurocomputing ◽

10.1016/j.neucom.2009.03.021 ◽

2010 ◽

Vol 73 (4-6) ◽

pp. 700-706 ◽

Cited By ~ 8

Author(s):

Zhipeng Shen ◽

Chen Guo ◽

Ning Zhang

Keyword(s):

Reinforcement Learning ◽

Least Squares ◽

Learning Control ◽

Recursive Least Squares ◽

Recursive Least Squares Algorithm ◽

Ship Steering ◽

Least Squares Algorithm

Download Full-text

Tuning heuristics and convergence analysis of reinforcement learning algorithm for online data-based optimal control design

Research Society and Development ◽

10.33448/rsd-v9i2.2128 ◽

2020 ◽

Vol 9 (2) ◽

pp. e188922128

Author(s):

Fábio Nogueira da Silva ◽

João Viana Fonseca Neto

Keyword(s):

Reinforcement Learning ◽

Convergence Analysis ◽

Output Feedback ◽

Learning Algorithm ◽

Policy Iteration ◽

Recursive Least Squares ◽

Input Output ◽

Output Data ◽

Data Generator ◽

Reinforcement Learning Algorithm

A heuristic for tuning and convergence analysis of the reinforcement learning algorithm for control with output feedback with only input / output data generated by a model is presented. To promote convergence analysis, it is necessary to perform the parameter adjustment in the algorithms used for data generation, and iteratively solve the control problem. A heuristic is proposed to adjust the data generator parameters creating surfaces to assist in the convergence and robustness analysis process of the optimal online control methodology. The algorithm tested is the discrete linear quadratic regulator (DLQR) with output feedback, based on reinforcement learning algorithms through temporal difference learning in the policy iteration scheme to determine the optimal policy using input / output data only. In the policy iteration algorithm, recursive least squares (RLS) is used to estimate online parameters associated with output feedback DLQR. After applying the proposed tuning heuristics, the influence of the parameters could be clearly seen, and the convergence analysis facilitated.

Download Full-text

Operational “Feel” Adjustment by Reinforcement Learning for a Power-Assisted Positioning Task

International Journal of Automation Technology ◽

10.20965/ijat.2009.p0671 ◽

2009 ◽

Vol 3 (6) ◽

pp. 671-680 ◽

Cited By ~ 1

Author(s):

Tetsuya Morizono ◽

◽

Yoji Yamada ◽

Masatake Higashi ◽

◽

...

Keyword(s):

Reinforcement Learning ◽

Real Time ◽

Task Performance ◽

User Satisfaction ◽

Learning Algorithm ◽

Impedance Control ◽

Experimental Results ◽

Multiple Goals ◽

And Task

Controlling “feel” when operating a power-assist robot is important for improving robot operability, user satisfaction, and task performance efficiency. Autonomous adjustment of “feel” is considered with robots under impedance control, and reinforcement learning in adjustment when a task includes repetitive positioning is discussed. Experimental results demonstrate that an operational “feel” pattern appropriate for positioning at a goal is developed by adjustment. Adjustment assuming a single fixed goal is expanded to cases including multiple goals, in which it is assumed that one goal is chosen by a user in real time. To adjust operational “feel” to individual goals, an algorithm infers the goal. The same result as that for a single fixed goal is obtained in experiments, but experimental results suggest that design must be improved to where the accuracy of inference to the goal is taken into account by the adjustment learning algorithm.

Download Full-text

Recursive learning-based joint digital predistorter for power amplifier and I/Q modulator impairments

International Journal of Microwave and Wireless Technologies ◽

10.1017/s1759078710000280 ◽

2010 ◽

Vol 2 (2) ◽

pp. 173-182 ◽

Cited By ~ 12

Author(s):

Lauri Anttila ◽

Peter Händel ◽

Olli Mylläri ◽

Mikko Valkama

Keyword(s):

Least Squares ◽

Power Amplifier ◽

Learning Algorithm ◽

Local Oscillator ◽

Direct Conversion ◽

Joint Estimation ◽

Recursive Least Squares ◽

Feedback Signal ◽

Frequency Dependent ◽

Radio Transmitters

The main implementation impairments degrading the performance of direct-conversion radio transmitters are in-phase/quadrature (I/Q) mismatch, local oscillator (LO) leakage, and power amplifier (PA) nonlinear distortion. In this article, we propose a recursive least-squares-based learning algorithm for joint digital predistortion (PD) of frequency-dependent PA and I/Q modulator impairments. The predistorter is composed of a parallel connection of two parallel Hammerstein (PH) predistorters and an LO leakage compensator, yielding a predistorter which as a whole is fully linear in the parameters. In the parameter estimation stage, proper feedback signal from the transmitter radio frequency (RF) stage back to the digital parts is deployed, combined with the indirect learning architecture and recursive least-squares training. The proposed structure is one of the first techniques to explicitly consider the joint estimation and mitigation of frequency-dependent PA and I/Q modulator impairments. Extensive simulation and measurement analysis is carried out to verify the operation and efficiency of the proposed PD technique. In general, the obtained results demonstrate linearization and I/Q modulator calibration performance clearly exceeding the performance of current state-of-the-art reference techniques.

Download Full-text

Enhanced Policy Adaptation Through Directed Explorative Learning

International Journal of Humanoid Robotics ◽

10.1142/s0219843615500280 ◽

2015 ◽

Vol 12 (03) ◽

pp. 1550028 ◽

Cited By ~ 2

Author(s):

Rok Vuga ◽

Bojan Nemec ◽

Aleš Ude

Keyword(s):

Reinforcement Learning ◽

Iterative Learning Control ◽

Optimization Problems ◽

Learning Algorithm ◽

Learning Control ◽

Iterative Learning ◽

Learning Framework ◽

Policy Adaptation ◽

Explorative Learning ◽

Integrated Policy

In this paper, we propose an integrated policy learning framework that fuses iterative learning control (ILC) and reinforcement learning. Integration is accomplished at the exploration level of the reinforcement learning algorithm. The proposed algorithm combines fast convergence properties of iterative learning control and robustness of reinforcement learning. This way, the advantages of both approaches are retained while overcoming their respective limitations. The proposed approach was verified in simulation and in real robot experiments on three challenging motion optimization problems.

Download Full-text

Robust recursive least squares learning algorithm for principal component analysis

IEEE Transactions on Neural Networks ◽

10.1109/72.822524 ◽

2000 ◽

Vol 11 (1) ◽

pp. 215-221 ◽

Cited By ~ 50

Author(s):

Shan Ouyang ◽

Zheng Bao ◽

Gui-Sheng Liao

Keyword(s):

Principal Component Analysis ◽

Least Squares ◽

Learning Algorithm ◽

Principal Component ◽

Component Analysis ◽

Recursive Least Squares ◽

Least Squares Learning

Download Full-text

A novel weighted recursive least squares based on Euclidean particle swarm optimization

Kybernetes ◽

10.1108/03684921311310602 ◽

2013 ◽

Vol 42 (2) ◽

pp. 268-281 ◽

Cited By ~ 13

Author(s):

Moêz Soltani ◽

Abdelkader Chaari

Keyword(s):

Particle Swarm Optimization ◽

Least Squares ◽

Learning Algorithm ◽

Particle Swarm ◽

New Method ◽

Recursive Least Squares ◽

Swarm Optimization ◽

Content Type ◽

Initial States ◽

Two Phases

PurposeThe purpose of this paper is to present a new methodology for identification of the parameters of the local linear Takagi‐Sugeno fuzzy models using weighted recursive least squares. The weighted recursive least squares (WRLS) is sensitive to initialization which leads to no converge. In order to overcome this problem, Euclidean particle swarm optimization (EPSO) is employed to optimize the initial states of WRLS. Finally, validation results are given to demonstrate the effectiveness and accuracy of the proposed algorithm. A comparative study is presented. Validation results involving simulations of numerical examples and the liquid level process have demonstrated the practicality of the algorithm.Design/methodology/approachA new method for nonlinear system modelling. The proposed algorithm is employed to optimize the initial states of WRLS algorithm in two phases of learning algorithm.FindingsThe results obtained using this novel approach were comparable with other modeling approaches reported in the literature. The proposed algorithm is able to handle various types of modeling problems with high accuracy.Originality/valueIn this paper, a new method is employed to optimize the initial states of WRLS algorithm in two phases of the learning algorithm.

Download Full-text

LEARNABILITY OF E–STABLE EQUILIBRIA

Macroeconomic Dynamics ◽

10.1017/s1365100512000703 ◽

2013 ◽

Vol 18 (5) ◽

pp. 959-984 ◽

Cited By ~ 4

Author(s):

Atanas Christev ◽

Sergey Slobodyan

Keyword(s):

Monetary Policy ◽

Private Sector ◽

Least Squares ◽

Rational Expectations ◽

Learning Algorithm ◽

Recursive Least Squares ◽

Stability Conditions ◽

Speed Of Convergence ◽

Robust Learning ◽

Stable Equilibria

If private sector agents update their beliefs with a learning algorithm other than recursive least squares, expectational stability or learnability of rational expectations equilibria (REE) is not guaranteed. Monetary policy under commitment, with a determinate and E-stable REE, may not imply robust learning stability of such equilibria if the RLS speed of convergence is slow. In this paper, we propose a refinement of E-stability conditions that allows us to select equilibria more robust to specification of the learning algorithm within the RLS/SG/GSG class. E-stable equilibria characterized by faster speed of convergence under RLS learning are learnable with SG or generalized SG algorithms as well.

Download Full-text