An Overview of Inverse Reinforcement Learning Techniques

Intelligent Environments 2021 - Ambient Intelligence and Smart Environments ◽

10.3233/aise210097 ◽

2021 ◽

Author(s):

Syed Ihtesham Hussain Shah ◽

Giuseppe De Pietro

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Decision Process ◽

Autonomous Agents ◽

Theoretical Background ◽

Inverse Reinforcement Learning ◽

Reward Function ◽

Learning Techniques ◽

Markov Decision ◽

Potential Use

In decision-making problems reward function plays an important role in finding the best policy. Reinforcement Learning (RL) provides a solution for decision-making problems under uncertainty in an Intelligent Environment (IE). However, it is difficult to specify the reward function for RL agents in large and complex problems. To counter these problems an extension of RL problem named Inverse Reinforcement Learning (IRL) is introduced, where reward function is learned from expert demonstrations. IRL is appealing for its potential use to build autonomous agents, capable of modeling others, deprived of compromising in performance of the task. This approach of learning by demonstrations relies on the framework of Markov Decision Process (MDP). This article elaborates original IRL algorithms along with their close variants to mitigate challenges. The purpose of this paper is to highlight an overview and theoretical background of IRL in the field of Machine Learning (ML) and Artificial Intelligence (AI). We presented a brief comparison between different variants of IRL in this article.

Download Full-text

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Optimal Policies for Quantum Markov Decision Processes

International Journal of Automation and Computing ◽

10.1007/s11633-021-1278-z ◽

2021 ◽

Author(s):

Ming-Sheng Ying ◽

Yuan Feng ◽

Sheng-Gang Ying

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Quantum Systems ◽

Sequential Decision Making ◽

Mathematical Framework ◽

Sequential Decision ◽

Learning Techniques ◽

Optimal Policies ◽

Markov Decision ◽

Programming Algorithms

AbstractMarkov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random. In particular, it serves as a mathematical framework for reinforcement learning. This paper introduces an extension of MDP, namely quantum MDP (qMDP), that can serve as a mathematical model of decision making about quantum systems. We develop dynamic programming algorithms for policy evaluation and finding optimal policies for qMDPs in the case of finite-horizon. The results obtained in this paper provide some useful mathematical tools for reinforcement learning techniques applied to the quantum world.

Download Full-text

Cloud Load Balancing and Reinforcement Learning

Advances in Business Information Systems and Analytics - Cloud Computing Technologies for Green Enterprises ◽

10.4018/978-1-5225-3038-1.ch011 ◽

2018 ◽

pp. 266-291

Author(s):

Abdelghafour Harraz ◽

Mostapha Zbakh

Keyword(s):

Artificial Intelligence ◽

Reinforcement Learning ◽

Load Balancing ◽

Decision Process ◽

Cloud System ◽

Human Intervention ◽

Q Learning ◽

State Action ◽

Learning Techniques ◽

Markov Decision

Artificial Intelligence allows to create engines that are able to explore, learn environments and therefore create policies that permit to control them in real time with no human intervention. It can be applied, through its Reinforcement Learning techniques component, using frameworks such as temporal differences, State-Action-Reward-State-Action (SARSA), Q Learning to name a few, to systems that are be perceived as a Markov Decision Process, this opens door in front of applying Reinforcement Learning to Cloud Load Balancing to be able to dispatch load dynamically to a given Cloud System. The authors will describe different techniques that can used to implement a Reinforcement Learning based engine in a cloud system.

Download Full-text

Efficient PAC Reinforcement Learning in Regular Decision Processes

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/279 ◽

2021 ◽

Author(s):

Alessandro Ronca ◽

Giuseppe De Giacomo

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Polynomial Time ◽

Optimal Policy ◽

Decision Process ◽

Transition Function ◽

Decision Processes ◽

Reward Function ◽

Markov Decision ◽

Reward Functions

Recently regular decision processes have been proposed as a well-behaved form of non-Markov decision process. Regular decision processes are characterised by a transition function and a reward function that depend on the whole history, though regularly (as in regular languages). In practice both the transition and the reward functions can be seen as finite transducers. We study reinforcement learning in regular decision processes. Our main contribution is to show that a near-optimal policy can be PAC-learned in polynomial time in a set of parameters that describe the underlying decision process. We argue that the identified set of parameters is minimal and it reasonably captures the difficulty of a regular decision process.

Download Full-text

FLOW SHOP SCHEDULING WITH REINFORCEMENT LEARNING

Asia Pacific Journal of Operational Research ◽

10.1142/s0217595913500140 ◽

2013 ◽

Vol 30 (05) ◽

pp. 1350014 ◽

Cited By ~ 2

Author(s):

ZHICONG ZHANG ◽

WEIPING WANG ◽

SHOUYAN ZHONG ◽

KAISHUN HU

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Flow Shop ◽

Flow Shop Scheduling ◽

Scheduling Problems ◽

Shop Scheduling ◽

Reward Function ◽

Markov Decision

Reinforcement learning (RL) is a state or action value based machine learning method which solves large-scale multi-stage decision problems such as Markov Decision Process (MDP) and Semi-Markov Decision Process (SMDP) problems. We minimize the makespan of flow shop scheduling problems with an RL algorithm. We convert flow shop scheduling problems into SMDPs by constructing elaborate state features, actions and the reward function. Minimizing the accumulated reward is equivalent to minimizing the schedule objective function. We apply on-line TD(λ) algorithm with linear gradient-descent function approximation to solve the SMDPs. To examine the performance of the proposed RL algorithm, computational experiments are conducted on benchmarking problems in comparison with other scheduling methods. The experimental results support the efficiency of the proposed algorithm and illustrate that the RL approach is a promising computational approach for flow shop scheduling problems worthy of further investigation.

Download Full-text

Object Affordance Driven Inverse Reinforcement Learning Through Conceptual Abstraction and Advice

Paladyn Journal of Behavioral Robotics ◽

10.1515/pjbr-2018-0021 ◽

2018 ◽

Vol 9 (1) ◽

pp. 277-294 ◽

Cited By ~ 1

Author(s):

Rupam Bhattacharyya ◽

Shyamanta M. Hazarika

Keyword(s):

Reinforcement Learning ◽

High Dimensional ◽

Inverse Reinforcement Learning ◽

Intent Recognition ◽

Reward Function ◽

Object Affordances ◽

Learning Agent ◽

Markov Decision ◽

Observed Behaviour ◽

Object Affordance

Abstract Within human Intent Recognition (IR), a popular approach to learning from demonstration is Inverse Reinforcement Learning (IRL). IRL extracts an unknown reward function from samples of observed behaviour. Traditional IRL systems require large datasets to recover the underlying reward function. Object affordances have been used for IR. Existing literature on recognizing intents through object affordances fall short of utilizing its true potential. In this paper, we seek to develop an IRL system which drives human intent recognition along with the capability to handle high dimensional demonstrations exploiting the capability of object affordances. An architecture for recognizing human intent is presented which consists of an extended Maximum Likelihood Inverse Reinforcement Learning agent. Inclusion of Symbolic Conceptual Abstraction Engine (SCAE) along with an advisor allows the agent to work on Conceptually Abstracted Markov Decision Process. The agent recovers object affordance based reward function from high dimensional demonstrations. This function drives a Human Intent Recognizer through identification of probable intents. Performance of the resulting system on the standard CAD-120 dataset shows encouraging result.

Download Full-text

Reinforcement Learning with a Corrupted Reward Channel

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/656 ◽

2017 ◽

Cited By ~ 9

Author(s):

Tom Everitt ◽

Victoria Krakovna ◽

Laurent Orseau ◽

Shane Legg

Keyword(s):

Reinforcement Learning ◽

Real World ◽

Decision Problem ◽

Inverse Reinforcement Learning ◽

Markov Decision Problem ◽

Software Bugs ◽

Reward Function ◽

Learning Agent ◽

Markov Decision ◽

Simplifying Assumptions

No real-world reward function is perfect. Sensory errors and software bugs may result in agents getting higher (or lower) rewards than they should. For example, a reinforcement learning agent may prefer states where a sensory error gives it the maximum reward, but where the true reward is actually small. We formalise this problem as a generalised Markov Decision Problem called Corrupt Reward MDP. Traditional RL methods fare poorly in CRMDPs, even under strong simplifying assumptions and when trying to compensate for the possibly corrupt rewards. Two ways around the problem are investigated. First, by giving the agent richer data, such as in inverse reinforcement learning and semi-supervised reinforcement learning, reward corruption stemming from systematic sensory errors may sometimes be completely managed. Second, by using randomisation to blunt the agent's optimisation, reward corruption can be partially managed under some assumptions.

Download Full-text

Research on autonomous collision avoidance of merchant ship based on inverse reinforcement learning

International Journal of Advanced Robotic Systems ◽

10.1177/1729881420969081 ◽

2020 ◽

Vol 17 (6) ◽

pp. 172988142096908

Author(s):

Mao Zheng ◽

Shuo Xie ◽

Xiumin Chu ◽

Tianquan Zhu ◽

Guohao Tian

Keyword(s):

Reinforcement Learning ◽

Collision Avoidance ◽

Decision Process ◽

Process Model ◽

Cross Entropy ◽

Inverse Reinforcement Learning ◽

Merchant Ship ◽

Finite State ◽

Markov Decision ◽

Ship Collision

To learn the optimal collision avoidance policy of merchant ships controlled by human experts, a finite-state Markov decision process model for ship collision avoidance is proposed based on the analysis of collision avoidance mechanism, and an inverse reinforcement learning (IRL) method based on cross entropy and projection is proposed to obtain the optimal policy from expert’s demonstrations. Collision avoidance simulations in different ship encounters are conducted and the results show that the policy obtained by the proposed IRL has a good inversion effect on two kinds of human experts, which indicate that the proposed method can effectively learn the policy of human experts for ship collision avoidance.

Download Full-text

The Unreasonable Effectiveness of Inverse Reinforcement Learning in Advancing Cancer Research

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5380 ◽

2020 ◽

Vol 34 (01) ◽

pp. 437-445

Author(s):

John Kalantari ◽

Heidi Nelson ◽

Nicholas Chia

Keyword(s):

Reinforcement Learning ◽

Cancer Progression ◽

Optimal Policy ◽

Decision Process ◽

Cancer Biology ◽

Optimal Decision ◽

Inverse Reinforcement Learning ◽

Evolutionary Trajectory ◽

Cancer Data ◽

Reward Function

The “No Free Lunch” theorem states that for any algorithm, elevated performance over one class of problems is offset by its performance over another. Stated differently, no algorithm works for everything. Instead, designing effective algorithms often means exploiting prior knowledge of data relationships specific to a given problem. This “unreasonable efficacy” is especially desirable for complex and seemingly intractable problems in the natural sciences. One such area that is rife with the need for better algorithms is cancer biology—a field where relatively few insights are being generated from relatively large amounts of data. In part, this is due to the inability of mere statistics to reflect cancer as a genetic evolutionary process—one that involves cells actively mutating in order to navigate host barriers, outcompete neighboring cells, and expand spatially.Our work is built upon the central proposition that the Markov Decision Process (MDP) can better represent the process by which cancer arises and progresses. More specifically, by encoding a cancer cell's complex behavior as a MDP, we seek to model the series of genetic changes, or evolutionary trajectory, that leads to cancer as an optimal decision process. We posit that using an Inverse Reinforcement Learning (IRL) approach will enable us to reverse engineer an optimal policy and reward function based on a set of “expert demonstrations” extracted from the DNA of patient tumors. The inferred reward function and optimal policy can subsequently be used to extrapolate the evolutionary trajectory of any tumor. Here, we introduce a Bayesian nonparametric IRL model (PUR-IRL) where the number of reward functions is a priori unbounded in order to account for uncertainty in cancer data, i.e., the existence of latent trajectories and non-uniform sampling. We show that PUR-IRL is “unreasonably effective” in gaining interpretable and intuitive insights about cancer progression from high-dimensional genome data.

Download Full-text

Car-following method based on inverse reinforcement learning for autonomous vehicle decision-making

International Journal of Advanced Robotic Systems ◽

10.1177/1729881418817162 ◽

2018 ◽

Vol 15 (6) ◽

pp. 172988141881716 ◽

Cited By ~ 10

Author(s):

Hongbo Gao ◽

Guanya Shi ◽

Guotao Xie ◽

Bo Cheng

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Optimization Problems ◽

Learning Algorithm ◽

Autonomous Vehicle ◽

Sequential Decision ◽

Inverse Reinforcement Learning ◽

Car Following ◽

Reward Function ◽

Automatic Driving

There are still some problems need to be solved though there are a lot of achievements in the fields of automatic driving. One of those problems is the difficulty of designing a car-following decision-making system for complex traffic conditions. In recent years, reinforcement learning shows the potential in solving sequential decision optimization problems. In this article, we establish the reward function R of each driver data based on the inverse reinforcement learning algorithm, and r visualization is carried out, and then driving characteristics and following strategies are analyzed. At last, we show the efficiency of the proposed method by simulation in a highway environment.

Download Full-text