scholarly journals Anticipatory Classifier System with Average Reward Criterion in Discretized Multi-Step Environments

2021 ◽  
Vol 11 (3) ◽  
pp. 1098
Author(s):  
Norbert Kozłowski ◽  
Olgierd Unold

Initially, Anticipatory Classifier Systems (ACS) were designed to address both single and multistep decision problems. In the latter case, the objective was to maximize the total discounted rewards, usually based on Q-learning algorithms. Studies on other Learning Classifier Systems (LCS) revealed many real-world sequential decision problems where the preferred objective is the maximization of the average of successive rewards. This paper proposes a relevant modification toward the learning component, allowing us to address such problems. The modified system is called AACS2 (Averaged ACS2) and is tested on three multistep benchmark problems.

2015 ◽  
Vol 52 (2) ◽  
pp. 419-440
Author(s):  
Rolando Cavazos-Cadena ◽  
Raúl Montes-De-Oca ◽  
Karel Sladký

This paper concerns discrete-time Markov decision chains with denumerable state and compact action sets. Besides standard continuity requirements, the main assumption on the model is that it admits a Lyapunov function ℓ. In this context the average reward criterion is analyzed from the sample-path point of view. The main conclusion is that if the expected average reward associated to ℓ2 is finite under any policy then a stationary policy obtained from the optimality equation in the standard way is sample-path average optimal in a strong sense.


Author(s):  
Atsushi Wada ◽  
◽  
Keiki Takadama ◽  
◽  

Learning Classifier Systems (LCSs) are rule-based adaptive systems that have both Reinforcement Learning (RL) and rule-discovery mechanisms for effective and practical on-line learning. With the aim of establishing a common theoretical basis between LCSs and RL algorithms to share each field's findings, a detailed analysis was performed to compare the learning processes of these two approaches. Based on our previous work on deriving an equivalence between the Zeroth-level Classifier System (ZCS) and Q-learning with Function Approximation (FA), this paper extends the analysis to the influence of actually applying the conditions for this equivalence. Comparative experiments have revealed interesting implications: (1) ZCS's original parameter, the deduction rate, plays a role in stabilizing the action selection, but (2) from the Reinforcement Learning perspective, such a process inhibits the ability to accurately estimate values for the entire state-action space, thus limiting the performance of ZCS in problems requiring accurate value estimation.


Author(s):  
Atsushi Wada ◽  
◽  
Keiki Takadama ◽  
◽  

Learning Classifier Systems (LCSs) are rule-based adaptive systems that have both Reinforcement Learning (RL) and rule-discovery mechanisms for effective and practical online learning. An analysis of the reinforcement process of XCS, one of the currently mainstream LCSs, is performed from the aspect of RL. Upon comparing XCS's update method with gradient-descent-based parameter update in RL, differences are found in the following elements: (1) residual term, (2) gradient term, and (3) payoff definition. All possible combinations of the variants in each element are implemented and tested on multi-step benchmark problems. This revealed that few specific combinations work effectively with XCS's accuracy-based rule-discovery process, while pure gradient-descent-based update showed the worst performance.


1999 ◽  
Vol 30 (7-8) ◽  
pp. 7-20
Author(s):  
M. Kurano ◽  
M. Yasuda ◽  
J.-I. Nakagami ◽  
Y. Yoshida

1996 ◽  
Vol 28 (4) ◽  
pp. 1123-1144
Author(s):  
K. D. Glazebrook

A single machine is available to process a collection of jobs J, each of which evolves stochastically under processing. Jobs incur costs while awaiting the machine at a rate which is state dependent and processing must respect a set of precedence constraints Γ. Index policies are optimal in a variety of scenarios. The indices concerned are characterised as values of restart problems with the average reward criterion. This characterisation yields a range of efficient approaches to their computation. Index-based suboptimality bounds are derived for general processing policies. These bounds enable us to develop sensitivity analyses and to evaluate scheduling heuristics.


Sign in / Sign up

Export Citation Format

Share Document