Using Inverse Reinforcement Learning with Real Trajectories to Get More Trustworthy Pedestrian Simulations

Francisco Martinez-Gil; Miguel Lozano; Ignacio García-Fernández; Pau Romero; Dolors Serra; Rafael Sebastián

doi:10.3390/math8091479

Using Inverse Reinforcement Learning with Real Trajectories to Get More Trustworthy Pedestrian Simulations

Mathematics ◽

10.3390/math8091479 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1479

Author(s):

Francisco Martinez-Gil ◽

Miguel Lozano ◽

Ignacio García-Fernández ◽

Pau Romero ◽

Dolors Serra ◽

...

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Machine Learning Techniques ◽

Inverse Reinforcement Learning ◽

The Real ◽

Q Learning ◽

Learning Framework ◽

Entropy Principle ◽

Real Behavior ◽

Function Approximator

Reinforcement learning is one of the most promising machine learning techniques to get intelligent behaviors for embodied agents in simulations. The output of the classic Temporal Difference family of Reinforcement Learning algorithms adopts the form of a value function expressed as a numeric table or a function approximator. The learned behavior is then derived using a greedy policy with respect to this value function. Nevertheless, sometimes the learned policy does not meet expectations, and the task of authoring is difficult and unsafe because the modification of one value or parameter in the learned value function has unpredictable consequences in the space of the policies it represents. This invalidates direct manipulation of the learned value function as a method to modify the derived behaviors. In this paper, we propose the use of Inverse Reinforcement Learning to incorporate real behavior traces in the learning process to shape the learned behaviors, thus increasing their trustworthiness (in terms of conformance to reality). To do so, we adapt the Inverse Reinforcement Learning framework to the navigation problem domain. Specifically, we use Soft Q-learning, an algorithm based on the maximum causal entropy principle, with MARL-Ped (a Reinforcement Learning-based pedestrian simulator) to include information from trajectories of real pedestrians in the process of learning how to navigate inside a virtual 3D space that represents the real environment. A comparison with the behaviors learned using a Reinforcement Learning classic algorithm (Sarsa(λ)) shows that the Inverse Reinforcement Learning behaviors adjust significantly better to the real trajectories.

Download Full-text

A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms

Neural Computation ◽

10.1162/089976699300016070 ◽

1999 ◽

Vol 11 (8) ◽

pp. 2017-2060 ◽

Cited By ~ 70

Author(s):

Csaba Szepesvári ◽

Michael L. Littman

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Learning Algorithms ◽

Sequential Decision ◽

Q Learning ◽

Markov Games ◽

Optimal Behavior ◽

Risk Sensitive ◽

Optimal Value

Reinforcement learning is the problem of generating optimal behavior in a sequential decision-making environment given the opportunity of interacting with it. Many algorithms for solving reinforcement-learning problems work by computing improved estimates of the optimal value function. We extend prior analyses of reinforcement-learning algorithms and present a powerful new theorem that can provide a unified analysis of such value-function-based reinforcement-learning algorithms. The usefulness of the theorem lies in how it allows the convergence of a complex asynchronous reinforcement-learning algorithm to be proved by verifying that a simpler synchronous algorithm converges. We illustrate the application of the theorem by analyzing the convergence of Q-learning, model-based reinforcement learning, Q-learning with multistate updates, Q-learning for Markov games, and risk-sensitive reinforcement learning.

Download Full-text

Q-Learning based Routing Protocol to Enhance Network Lifetime in WSNs

International journal of Computer Networks & Communications ◽

10.5121/ijcnc.2021.13204 ◽

2021 ◽

Vol 13 (2) ◽

pp. 57-80

Author(s):

Arunita Kundaliya ◽

D.K. Lobiyal

Keyword(s):

Reinforcement Learning ◽

Network Lifetime ◽

Residual Energy ◽

Efficient Solutions ◽

Machine Learning Techniques ◽

Q Learning ◽

Learning Techniques ◽

Aodv Protocol ◽

Optimal Action ◽

Additional Memory

In resource constraint Wireless Sensor Networks (WSNs), enhancement of network lifetime has been one of the significantly challenging issues for the researchers. Researchers have been exploiting machine learning techniques, in particular reinforcement learning, to achieve efficient solutions in the domain of WSN. The objective of this paper is to apply Q-learning, a reinforcement learning technique, to enhance the lifetime of the network, by developing distributed routing protocols. Q-learning is an attractive choice for routing due to its low computational requirements and additional memory demands. To facilitate an agent running at each node to take an optimal action, the approach considers node’s residual energy, hop length to sink and transmission power. The parameters, residual energy and hop length, are used to calculate the Q-value, which in turn is used to decide the optimal next-hop for routing. The proposed protocols’ performance is evaluated through NS3 simulations, and compared with AODV protocol in terms of network lifetime, throughput and end-to-end delay.

Download Full-text

Upper Bounds on the Performance of Discretisation in Reinforcement Learning

South African Computer Journal ◽

10.18489/sacj.v0i57.284 ◽

2015 ◽

Author(s):

Michael Robin Mitchley

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Value Function Approximation ◽

Learning Framework ◽

A Value ◽

Continuous State Space ◽

Policy Representation ◽

Continuous State ◽

Tile Coding ◽

Policy Mapping

Reinforcement learning is a machine learning framework whereby an agent learns to perform a task by maximising its total reward received for selecting actions in each state. The policy mapping states to actions that the agent learns is either represented explicitly, or implicitly through a value function. It is common in reinforcement learning to discretise a continuous state space using tile coding or binary features. We prove an upper bound on the performance of discretisation for direct policy representation or value function approximation.

Download Full-text

Reinforcement Learning under Threats

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33019939 ◽

2019 ◽

Vol 33 ◽

pp. 9939-9940 ◽

Cited By ~ 1

Author(s):

Victor Gallego ◽

Roi Naveiro ◽

David Rios Insua

Keyword(s):

Reinforcement Learning ◽

Single Agent ◽

Potential Threat ◽

Q Learning ◽

Learning Framework ◽

Opponent Modeling ◽

Theoretical Approaches ◽

New Learning ◽

Markov Decision ◽

Multi Agent

In several reinforcement learning (RL) scenarios, mainly in security settings, there may be adversaries trying to interfere with the reward generating process. However, when non-stationary environments as such are considered, Q-learning leads to suboptimal results (Busoniu, Babuska, and De Schutter 2010). Previous game-theoretical approaches to this problem have focused on modeling the whole multi-agent system as a game. Instead, we shall face the problem of prescribing decisions to a single agent (the supported decision maker, DM) against a potential threat model (the adversary). We augment the MDP to account for this threat, introducing Threatened Markov Decision Processes (TMDPs). Furthermore, we propose a level-k thinking scheme resulting in a new learning framework to deal with TMDPs. We empirically test our framework, showing the benefits of opponent modeling.

Download Full-text

Reinforcement Learning-Enabled UAV Itinerary Planning for Remote Sensing Applications in Smart Farming

Telecom ◽

10.3390/telecom2030017 ◽

2021 ◽

Vol 2 (3) ◽

pp. 255-270

Author(s):

Saeid Pourroostaei Ardakani ◽

Ali Cheshmehzangi

Keyword(s):

Remote Sensing ◽

Reinforcement Learning ◽

Data Collection ◽

Cost Effective ◽

Environmental Data ◽

Machine Learning Techniques ◽

Q Learning ◽

Sensing Applications ◽

Learning Technique ◽

Target Locations

UAV path planning for remote sensing aims to find the best-fitted routes to complete a data collection mission. UAVs plan the routes and move through them to remotely collect environmental data from particular target zones by using sensory devices such as cameras. Route planning may utilize machine learning techniques to autonomously find/select cost-effective and/or best-fitted routes and achieve optimized results including: minimized data collection delay, reduced UAV power consumption, decreased flight traversed distance and maximized number of collected data samples. This paper utilizes a reinforcement learning technique (location and energy-aware Q-learning) to plan UAV routes for remote sensing in smart farms. Through this, the UAV avoids heuristically or blindly moving throughout a farm, but this takes the benefits of environment exploration–exploitation to explore the farm and find the shortest and most cost-effective paths into target locations with interesting data samples to collect. According to the simulation results, utilizing the Q-learning technique increases data collection robustness and reduces UAV resource consumption (e.g., power), traversed paths, and remote sensing latency as compared to two well-known benchmarks, IEMF and TBID, especially if the target locations are dense and crowded in a farm.

Download Full-text

Gamma-Nets: Generalizing Value Estimation over Timescale

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6027 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5717-5725

Author(s):

Craig Sherstan ◽

Shibhansh Dohare ◽

James MacGlashan ◽

Johannes Günther ◽

Patrick M. Pilarski

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

A Priori ◽

Predictive Ability ◽

Representation Learning ◽

Robot Arm ◽

Function Estimation ◽

Temporal Abstraction ◽

Long Time ◽

Function Approximator

Temporal abstraction is a key requirement for agents making decisions over long time horizons—a fundamental challenge in reinforcement learning. There are many reasons why value estimates at multiple timescales might be useful; recent work has shown that value estimates at different time scales can be the basis for creating more advanced discounting functions and for driving representation learning. Further, predictions at many different timescales serve to broaden an agent's model of its environment. One predictive approach of interest within an online learning setting is general value function (GVFs), which represent models of an agent's world as a collection of predictive questions each defined by a policy, a signal to be predicted, and a prediction timescale. In this paper we present Γ-nets, a method for generalizing value function estimation over timescale, allowing a given GVF to be trained and queried for arbitrary timescales so as to greatly increase the predictive ability and scalability of a GVF-based model. The key to our approach is to use timescale as one of the value estimator's inputs. As a result, the prediction target for any timescale is available at every timestep and we are free to train on any number of timescales. We first provide two demonstrations by 1) predicting a square wave and 2) predicting sensorimotor signals on a robot arm using a linear function approximator. Next, we empirically evaluate Γ-nets in the deep reinforcement learning setting using policy evaluation on a set of Atari video games. Our results show that Γ-nets can be effective for predicting arbitrary timescales, with only a small cost in accuracy as compared to learning estimators for fixed timescales. Γ-nets provide a method for accurately and compactly making predictions at many timescales without requiring a priori knowledge of the task, making it a valuable contribution to ongoing work on model-based planning, representation learning, and lifelong learning algorithms.

Download Full-text

A Hybrid Algorithm in Reinforcement Learning for Crowd Simulation

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.f9187.038620 ◽

2020 ◽

Vol 8 (6) ◽

pp. 5251-5255

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Crowd Simulation ◽

Value Functions ◽

Q Learning ◽

Efficient Measurement ◽

Multi Agent ◽

Hybrid Agent ◽

Multiple Value ◽

Transportation Applications

Exploiting the efficiency and stability of Dynamic Crowd, the paper proposes a hybrid crowd simulation algorithm that runs using multi agents and it mainly focuses on identifying the crowd to simulate. An efficient measurement for both static and dynamic crowd simulation is applied in tracking and transportation applications. The proposed Hybrid Agent Reinforcement Learning (HARL) algorithm combines the Q-Learning off-policy value function and SARSA algorithm on-policy value function, which is used for dynamic crowd evacuation scenario. The HARL algorithm performs multiple value functions and combines the policy value function derived from the multi agent to improve the performance. In addition, the efficiency of the HARL algorithm is able to demonstrate in varied crowd sizes. Two kinds of applications are used in Reinforcement Learning such as tracking applications and transportation monitoring applications for pretending the crowd sizes.

Download Full-text

SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards

The International Journal of Robotics Research ◽

10.1177/0278364918784350 ◽

2018 ◽

Vol 38 (2-3) ◽

pp. 126-145 ◽

Cited By ~ 9

Author(s):

Sanjay Krishnan ◽

Animesh Garg ◽

Richard Liaw ◽

Brijen Thananjeyan ◽

Lauren Miller ◽

...

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Search Algorithm ◽

Inverse Reinforcement Learning ◽

Parallel Parking ◽

Q Learning ◽

Physical Experiments ◽

Long Time ◽

Reward Functions ◽

Behavioral Cloning

We present sequential windowed inverse reinforcement learning (SWIRL), a policy search algorithm that is a hybrid of exploration and demonstration paradigms for robot learning. We apply unsupervised learning to a small number of initial expert demonstrations to structure future autonomous exploration. SWIRL approximates a long time horizon task as a sequence of local reward functions and subtask transition conditions. Over this approximation, SWIRL applies Q-learning to compute a policy that maximizes rewards. Experiments suggest that SWIRL requires significantly fewer rollouts than pure reinforcement learning and fewer expert demonstrations than behavioral cloning to learn a policy. We evaluate SWIRL in two simulated control tasks, parallel parking and a two-link pendulum. On the parallel parking task, SWIRL achieves the maximum reward on the task with 85% fewer rollouts than Q-learning, and one-eight of demonstrations needed by behavioral cloning. We also consider physical experiments on surgical tensioning and cutting deformable sheets using a da Vinci surgical robot. On the deformable tensioning task, SWIRL achieves a 36% relative improvement in reward compared with a baseline of behavioral cloning with segmentation.

Download Full-text

Boosting Offline Reinforcement Learning with Residual Generative Modeling

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/492 ◽

2021 ◽

Author(s):

Hua Wei ◽

Deheng Ye ◽

Zhao Liu ◽

Hao Wu ◽

Bo Yuan ◽

...

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Approximation Error ◽

The State ◽

Training Data ◽

Action Function ◽

Q Learning ◽

State Action ◽

Generative Modeling ◽

Benchmark Datasets

Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration.Current offline RL research includes: 1) generative modeling, i.e., approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game, Honor of Kings.

Download Full-text

Regularising neural networks for future trajectory prediction via inverse reinforcement learning framework

IET Computer Vision ◽

10.1049/iet-cvi.2019.0546 ◽

2020 ◽

Vol 14 (5) ◽

pp. 192-200

Author(s):

Dooseop Choi ◽

Kyoungwook Min ◽

Jeongdan Choi

Keyword(s):

Neural Networks ◽

Reinforcement Learning ◽

Trajectory Prediction ◽

Inverse Reinforcement Learning ◽

Learning Framework

Download Full-text