scholarly journals Reward-predictive representations generalize across tasks in reinforcement learning

2019 ◽  
Author(s):  
Lucas Lehnert ◽  
Michael L. Littman ◽  
Michael J. Frank

AbstractIn computer science, reinforcement learning is a powerful framework with which artificial agents can learn to maximize their performance for any given Markov decision process (MDP). Advances over the last decade, in combination with deep neural networks, have enjoyed performance advantages over humans in many difficult task settings. However, such frameworks perform far less favorably when evaluated in their ability to generalize or transfer representations across different tasks. Existing algorithms that facilitate transfer typically are limited to cases in which the transition function or the optimal policy is portable to new contexts, but achieving “deep transfer” characteristic of human behavior has been elusive. Such transfer typically requires discovery of abstractions that permit analogical reuse of previously learned representations to superficially distinct tasks. Here, we demonstrate that abstractions that minimize error in predictions of reward outcomes generalize across tasks with different transition and reward functions. Such reward-predictive representations compress the state space of a task into a lower dimensional representation by combining states that are equivalent in terms of both the transition and reward functions. Because only state equivalences are considered, the resulting state representation is not tied to the transition and reward functions themselves and thus generalizes across tasks with different reward and transition functions. These results contrast with those using abstractions that myopically maximize reward in any given MDP and motivate further experiments in humans and animals to investigate if neural and cognitive systems involved in state representation perform abstractions that facilitate such equivalence relations.Author summaryHumans are capable of transferring abstract knowledge from one task to another. For example, in a right-hand-drive country, a driver has to use the right arm to operate the shifter. A driver who learned how to drive in a right-hand-drive country can adapt to operating a left-hand-drive car and use the other arm for shifting instead of re-learning how to drive. Despite the fact that both tasks require different coordination of motor skills, both tasks are the same in an abstract sense: In both tasks, a car is operated and there is the same progression from 1st to 2nd gear and so on. We study distinct algorithms by which a reinforcement learning agent can discover state representations that encode knowledge about a particular task, and evaluate how well they can generalize. Through a sequence of simulation results, we show that state abstractions that minimize errors in prediction about future reward outcomes generalize across tasks, even those that superficially differ in both the goals (rewards) and the transitions from one state to the next. This work motivates biological studies to determine if distinct circuits are adapted to maximize reward vs. to discover useful state representations.

2017 ◽  
Author(s):  
Nicholas Franklin ◽  
Michael J. Frank

AbstractHumans are remarkably adept at generalizing knowledge between experiences in a way that can be difficult for computers. Often, this entails generalizing constituent pieces of experiences that do not fully overlap, but nonetheless share useful similarities with, previously acquired knowledge. However, it is often unclear how knowledge gained in one context should generalize to another. Previous computational models and data suggest that rather than learning about each individual context, humans build latent abstract structures and learn to link these structures to arbitrary contexts, facilitating generalization. In these models, task structures that are more popular across contexts are more likely to be revisited in new contexts. However, these models can only re-use policies as a whole and are unable to transfer knowledge about the transition structure of the environment even if only the goal has changed (or vice-versa). This contrasts with ecological settings, where some aspects of task structure, such as the transition function, will be shared between context separately from other aspects, such as the reward function. Here, we develop a novel non-parametric Bayesian agent that forms independent latent clusters for transition and reward functions, affording separable transfer of their constituent parts across contexts. We show that the relative performance of this agent compared to an agent that jointly clusters reward and transition functions depends environmental task statistics: the mutual information between transition and reward functions and the stochasticity of the observations. We formalize our analysis through an information theoretic account of the priors, and propose a meta learning agent that dynamically arbitrates between strategies across task domains to optimize a statistical tradeoff.Author summaryA musician may learn to generalize behaviors across instruments for different purposes, for example, reusing hand motions used when playing classical on the flute to play jazz on the saxophone. Conversely, she may learn to play a single song across many instruments that require completely distinct physical motions, but nonetheless transfer knowledge between them. This degree of compositionality is often absent from computational frameworks of learning, forcing agents either to generalize entire learned policies or to learn new policies from scratch. Here, we propose a solution to this problem that allows an agent to generalize components of a policy independently and compare it to an agent that generalizes components as a whole. We show that the degree to which one form of generalization is favored over the other is dependent on the features of task domain, with independent generalization of task components favored in environments with weak relationships between components or high degrees of noise and joint generalization of task components favored when there is a clear, discoverable relationship between task components. Furthermore, we show that the overall meta structure of the environment can be learned and leveraged by an agent that dynamically arbitrates between these forms of structure learning.


Author(s):  
Alessandro Ronca ◽  
Giuseppe De Giacomo

Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.


Author(s):  
N. Botteghi ◽  
R. Schulte ◽  
B. Sirmacek ◽  
M. Poel ◽  
C. Brune

Abstract. Autonomously exploring and mapping is one of the open challenges of robotics and artificial intelligence. Especially when the environments are unknown, choosing the optimal navigation directive is not straightforward. In this paper, we propose a reinforcement learning framework for navigating, exploring, and mapping unknown environments. The reinforcement learning agent is in charge of selecting the commands for steering the mobile robot, while a SLAM algorithm estimates the robot pose and maps the environments. The agent, to select optimal actions, is trained to be curious about the world. This concept translates into the introduction of a curiosity-driven reward function that encourages the agent to steer the mobile robot towards unknown and unseen areas of the world and the map. We test our approach in explorations challenges in different indoor environments. The agent trained with the proposed reward function outperforms the agents trained with reward functions commonly used in the literature for solving such tasks.


Biomimetics ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. 13
Author(s):  
Adam Bignold ◽  
Francisco Cruz ◽  
Richard Dazeley ◽  
Peter Vamplew ◽  
Cameron Foale

Interactive reinforcement learning methods utilise an external information source to evaluate decisions and accelerate learning. Previous work has shown that human advice could significantly improve learning agents’ performance. When evaluating reinforcement learning algorithms, it is common to repeat experiments as parameters are altered or to gain a sufficient sample size. In this regard, to require human interaction every time an experiment is restarted is undesirable, particularly when the expense in doing so can be considerable. Additionally, reusing the same people for the experiment introduces bias, as they will learn the behaviour of the agent and the dynamics of the environment. This paper presents a methodology for evaluating interactive reinforcement learning agents by employing simulated users. Simulated users allow human knowledge, bias, and interaction to be simulated. The use of simulated users allows the development and testing of reinforcement learning agents, and can provide indicative results of agent performance under defined human constraints. While simulated users are no replacement for actual humans, they do offer an affordable and fast alternative for evaluative assisted agents. We introduce a method for performing a preliminary evaluation utilising simulated users to show how performance changes depending on the type of user assisting the agent. Moreover, we describe how human interaction may be simulated, and present an experiment illustrating the applicability of simulating users in evaluating agent performance when assisted by different types of trainers. Experimental results show that the use of this methodology allows for greater insight into the performance of interactive reinforcement learning agents when advised by different users. The use of simulated users with varying characteristics allows for evaluation of the impact of those characteristics on the behaviour of the learning agent.


2021 ◽  
Vol 2 (1) ◽  
pp. 1-25
Author(s):  
Yongsen Ma ◽  
Sheheryar Arshad ◽  
Swetha Muniraju ◽  
Eric Torkildson ◽  
Enrico Rantala ◽  
...  

In recent years, Channel State Information (CSI) measured by WiFi is widely used for human activity recognition. In this article, we propose a deep learning design for location- and person-independent activity recognition with WiFi. The proposed design consists of three Deep Neural Networks (DNNs): a 2D Convolutional Neural Network (CNN) as the recognition algorithm, a 1D CNN as the state machine, and a reinforcement learning agent for neural architecture search. The recognition algorithm learns location- and person-independent features from different perspectives of CSI data. The state machine learns temporal dependency information from history classification results. The reinforcement learning agent optimizes the neural architecture of the recognition algorithm using a Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM). The proposed design is evaluated in a lab environment with different WiFi device locations, antenna orientations, sitting/standing/walking locations/orientations, and multiple persons. The proposed design has 97% average accuracy when testing devices and persons are not seen during training. The proposed design is also evaluated by two public datasets with accuracy of 80% and 83%. The proposed design needs very little human efforts for ground truth labeling, feature engineering, signal processing, and tuning of learning parameters and hyperparameters.


Algorithms ◽  
2021 ◽  
Vol 14 (1) ◽  
pp. 26
Author(s):  
Yiran Xue ◽  
Rui Wu ◽  
Jiafeng Liu ◽  
Xianglong Tang

Existing crowd evacuation guidance systems require the manual design of models and input parameters, incurring a significant workload and a potential for errors. This paper proposed an end-to-end intelligent evacuation guidance method based on deep reinforcement learning, and designed an interactive simulation environment based on the social force model. The agent could automatically learn a scene model and path planning strategy with only scene images as input, and directly output dynamic signage information. Aiming to solve the “dimension disaster” phenomenon of the deep Q network (DQN) algorithm in crowd evacuation, this paper proposed a combined action-space DQN (CA-DQN) algorithm that grouped Q network output layer nodes according to action dimensions, which significantly reduced the network complexity and improved system practicality in complex scenes. In this paper, the evacuation guidance system is defined as a reinforcement learning agent and implemented by the CA-DQN method, which provides a novel approach for the evacuation guidance problem. The experiments demonstrate that the proposed method is superior to the static guidance method, and on par with the manually designed model method.


2021 ◽  
Vol 7 (1) ◽  
Author(s):  
Pankaj Rajak ◽  
Aravind Krishnamoorthy ◽  
Ankit Mishra ◽  
Rajiv Kalia ◽  
Aiichiro Nakano ◽  
...  

AbstractPredictive materials synthesis is the primary bottleneck in realizing functional and quantum materials. Strategies for synthesis of promising materials are currently identified by time-consuming trial and error and there are no known predictive schemes to design synthesis parameters for materials. We use offline reinforcement learning (RL) to predict optimal synthesis schedules, i.e., a time-sequence of reaction conditions like temperatures and concentrations, for the synthesis of semiconducting monolayer MoS2 using chemical vapor deposition. The RL agent, trained on 10,000 computational synthesis simulations, learned threshold temperatures and chemical potentials for onset of chemical reactions and predicted previously unknown synthesis schedules that produce well-sulfidized crystalline, phase-pure MoS2. The model can be extended to multi-task objectives such as predicting profiles for synthesis of complex structures including multi-phase heterostructures and can predict long-time behavior of reacting systems, far beyond the domain of molecular dynamics simulations, making these predictions directly relevant to experimental synthesis.


Symmetry ◽  
2020 ◽  
Vol 12 (4) ◽  
pp. 631
Author(s):  
Chunyang Hu

In this paper, deep reinforcement learning (DRL) and knowledge transfer are used to achieve the effective control of the learning agent for the confrontation in the multi-agent systems. Firstly, a multi-agent Deep Deterministic Policy Gradient (DDPG) algorithm with parameter sharing is proposed to achieve confrontation decision-making of multi-agent. In the process of training, the information of other agents is introduced to the critic network to improve the strategy of confrontation. The parameter sharing mechanism can reduce the loss of experience storage. In the DDPG algorithm, we use four neural networks to generate real-time action and Q-value function respectively and use a momentum mechanism to optimize the training process to accelerate the convergence rate for the neural network. Secondly, this paper introduces an auxiliary controller using a policy-based reinforcement learning (RL) method to achieve the assistant decision-making for the game agent. In addition, an effective reward function is used to help agents balance losses of enemies and our side. Furthermore, this paper also uses the knowledge transfer method to extend the learning model to more complex scenes and improve the generalization of the proposed confrontation model. Two confrontation decision-making experiments are designed to verify the effectiveness of the proposed method. In a small-scale task scenario, the trained agent can successfully learn to fight with the competitors and achieve a good winning rate. For large-scale confrontation scenarios, the knowledge transfer method can gradually improve the decision-making level of the learning agent.


Author(s):  
Nancy Fulda ◽  
Daniel Ricks ◽  
Ben Murdoch ◽  
David Wingate

Autonomous agents must often detect affordances: the set of behaviors enabled by a situation. Affordance extraction is particularly helpful in domains with large action spaces, allowing the agent to prune its search space by avoiding futile behaviors. This paper presents a method for affordance extraction via word embeddings trained on a tagged Wikipedia corpus. The resulting word vectors are treated as a common knowledge database which can be queried using linear algebra. We apply this method to a reinforcement learning agent in a text-only environment and show that affordance-based action selection improves performance in most cases. Our method increases the computational complexity of each learning step but significantly reduces the total number of steps needed. In addition, the agent's action selections begin to resemble those a human would choose.


Sign in / Sign up

Export Citation Format

Share Document