scholarly journals Efficient Reinforcement Learning Using Recursive Least-Squares Methods

2002 ◽  
Vol 16 ◽  
pp. 259-292 ◽  
Author(s):  
X. Xu ◽  
H. He ◽  
D. Hu

The recursive least-squares (RLS) algorithm is one of the most well-known algorithms used in adaptive filtering, system identification and adaptive control. Its popularity is mainly due to its fast convergence speed, which is considered to be optimal in practice. In this paper, RLS methods are used to solve reinforcement learning problems, where two new reinforcement learning algorithms using linear value function approximators are proposed and analyzed. The two algorithms are called RLS-TD(lambda) and Fast-AHC (Fast Adaptive Heuristic Critic), respectively. RLS-TD(lambda) can be viewed as the extension of RLS-TD(0) from lambda=0 to general lambda within interval [0,1], so it is a multi-step temporal-difference (TD) learning algorithm using RLS methods. The convergence with probability one and the limit of convergence of RLS-TD(lambda) are proved for ergodic Markov chains. Compared to the existing LS-TD(lambda) algorithm, RLS-TD(lambda) has advantages in computation and is more suitable for online learning. The effectiveness of RLS-TD(lambda) is analyzed and verified by learning prediction experiments of Markov chains with a wide range of parameter settings. The Fast-AHC algorithm is derived by applying the proposed RLS-TD(lambda) algorithm in the critic network of the adaptive heuristic critic method. Unlike conventional AHC algorithm, Fast-AHC makes use of RLS methods to improve the learning-prediction efficiency in the critic. Learning control experiments of the cart-pole balancing and the acrobot swing-up problems are conducted to compare the data efficiency of Fast-AHC with conventional AHC. From the experimental results, it is shown that the data efficiency of learning control can also be improved by using RLS methods in the learning-prediction process of the critic. The performance of Fast-AHC is also compared with that of the AHC method using LS-TD(lambda). Furthermore, it is demonstrated in the experiments that different initial values of the variance matrix in RLS-TD(lambda) are required to get better performance not only in learning prediction but also in learning control. The experimental results are analyzed based on the existing theoretical work on the transient phase of forgetting factor RLS methods.

2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Chunyuan Zhang ◽  
Qi Song ◽  
Zeng Meng

The deep Q-network (DQN) is one of the most successful reinforcement learning algorithms, but it has some drawbacks such as slow convergence and instability. In contrast, the traditional reinforcement learning algorithms with linear function approximation usually have faster convergence and better stability, although they easily suffer from the curse of dimensionality. In recent years, many improvements to DQN have been made, but they seldom make use of the advantage of traditional algorithms to improve DQN. In this paper, we propose a novel Q-learning algorithm with linear function approximation, called the minibatch recursive least squares Q-learning (MRLS-Q). Different from the traditional Q-learning algorithm with linear function approximation, the learning mechanism and model structure of MRLS-Q are more similar to those of DQNs with only one input layer and one linear output layer. It uses the experience replay and the minibatch training mode and uses the agent’s states rather than the agent’s state-action pairs as the inputs. As a result, it can be used alone for low-dimensional problems and can be seamlessly integrated into DQN as the last layer for high-dimensional problems as well. In addition, MRLS-Q uses our proposed average RLS optimization technique, so that it can achieve better convergence performance whether it is used alone or integrated with DQN. At the end of this paper, we demonstrate the effectiveness of MRLS-Q on the CartPole problem and four Atari games and investigate the influences of its hyperparameters experimentally.


2020 ◽  
Vol 9 (2) ◽  
pp. e188922128
Author(s):  
Fábio Nogueira da Silva ◽  
João Viana Fonseca Neto

A heuristic for tuning and convergence analysis of the reinforcement learning algorithm for control with output feedback with only input / output data generated by a model is presented. To promote convergence analysis, it is necessary to perform the parameter adjustment in the algorithms used for data generation, and iteratively solve the control problem. A heuristic is proposed to adjust the data generator parameters creating surfaces to assist in the convergence and robustness analysis process of the optimal online control methodology. The algorithm tested is the discrete linear quadratic regulator (DLQR) with output feedback, based on reinforcement learning algorithms through temporal difference learning in the policy iteration scheme to determine the optimal policy using input / output data only. In the policy iteration algorithm, recursive least squares (RLS) is used to estimate online parameters associated with output feedback DLQR. After applying the proposed tuning heuristics, the influence of the parameters could be clearly seen, and the convergence analysis facilitated.


2009 ◽  
Vol 3 (6) ◽  
pp. 671-680 ◽  
Author(s):  
Tetsuya Morizono ◽  
◽  
Yoji Yamada ◽  
Masatake Higashi ◽  
◽  
...  

Controlling “feel” when operating a power-assist robot is important for improving robot operability, user satisfaction, and task performance efficiency. Autonomous adjustment of “feel” is considered with robots under impedance control, and reinforcement learning in adjustment when a task includes repetitive positioning is discussed. Experimental results demonstrate that an operational “feel” pattern appropriate for positioning at a goal is developed by adjustment. Adjustment assuming a single fixed goal is expanded to cases including multiple goals, in which it is assumed that one goal is chosen by a user in real time. To adjust operational “feel” to individual goals, an algorithm infers the goal. The same result as that for a single fixed goal is obtained in experiments, but experimental results suggest that design must be improved to where the accuracy of inference to the goal is taken into account by the adjustment learning algorithm.


2010 ◽  
Vol 2 (2) ◽  
pp. 173-182 ◽  
Author(s):  
Lauri Anttila ◽  
Peter Händel ◽  
Olli Mylläri ◽  
Mikko Valkama

The main implementation impairments degrading the performance of direct-conversion radio transmitters are in-phase/quadrature (I/Q) mismatch, local oscillator (LO) leakage, and power amplifier (PA) nonlinear distortion. In this article, we propose a recursive least-squares-based learning algorithm for joint digital predistortion (PD) of frequency-dependent PA and I/Q modulator impairments. The predistorter is composed of a parallel connection of two parallel Hammerstein (PH) predistorters and an LO leakage compensator, yielding a predistorter which as a whole is fully linear in the parameters. In the parameter estimation stage, proper feedback signal from the transmitter radio frequency (RF) stage back to the digital parts is deployed, combined with the indirect learning architecture and recursive least-squares training. The proposed structure is one of the first techniques to explicitly consider the joint estimation and mitigation of frequency-dependent PA and I/Q modulator impairments. Extensive simulation and measurement analysis is carried out to verify the operation and efficiency of the proposed PD technique. In general, the obtained results demonstrate linearization and I/Q modulator calibration performance clearly exceeding the performance of current state-of-the-art reference techniques.


2015 ◽  
Vol 12 (03) ◽  
pp. 1550028 ◽  
Author(s):  
Rok Vuga ◽  
Bojan Nemec ◽  
Aleš Ude

In this paper, we propose an integrated policy learning framework that fuses iterative learning control (ILC) and reinforcement learning. Integration is accomplished at the exploration level of the reinforcement learning algorithm. The proposed algorithm combines fast convergence properties of iterative learning control and robustness of reinforcement learning. This way, the advantages of both approaches are retained while overcoming their respective limitations. The proposed approach was verified in simulation and in real robot experiments on three challenging motion optimization problems.


Kybernetes ◽  
2013 ◽  
Vol 42 (2) ◽  
pp. 268-281 ◽  
Author(s):  
Moêz Soltani ◽  
Abdelkader Chaari

PurposeThe purpose of this paper is to present a new methodology for identification of the parameters of the local linear Takagi‐Sugeno fuzzy models using weighted recursive least squares. The weighted recursive least squares (WRLS) is sensitive to initialization which leads to no converge. In order to overcome this problem, Euclidean particle swarm optimization (EPSO) is employed to optimize the initial states of WRLS. Finally, validation results are given to demonstrate the effectiveness and accuracy of the proposed algorithm. A comparative study is presented. Validation results involving simulations of numerical examples and the liquid level process have demonstrated the practicality of the algorithm.Design/methodology/approachA new method for nonlinear system modelling. The proposed algorithm is employed to optimize the initial states of WRLS algorithm in two phases of learning algorithm.FindingsThe results obtained using this novel approach were comparable with other modeling approaches reported in the literature. The proposed algorithm is able to handle various types of modeling problems with high accuracy.Originality/valueIn this paper, a new method is employed to optimize the initial states of WRLS algorithm in two phases of the learning algorithm.


2013 ◽  
Vol 18 (5) ◽  
pp. 959-984 ◽  
Author(s):  
Atanas Christev ◽  
Sergey Slobodyan

If private sector agents update their beliefs with a learning algorithm other than recursive least squares, expectational stability or learnability of rational expectations equilibria (REE) is not guaranteed. Monetary policy under commitment, with a determinate and E-stable REE, may not imply robust learning stability of such equilibria if the RLS speed of convergence is slow. In this paper, we propose a refinement of E-stability conditions that allows us to select equilibria more robust to specification of the learning algorithm within the RLS/SG/GSG class. E-stable equilibria characterized by faster speed of convergence under RLS learning are learnable with SG or generalized SG algorithms as well.


Sign in / Sign up

Export Citation Format

Share Document