The actions of others act as a pseudo-reward to drive imitation in the context of social reinforcement learning

Anis Najar; Emmanuelle Bonnet; Bahador Bahrami; Stefano Palminteri

doi:10.1371/journal.pbio.3001028

The actions of others act as a pseudo-reward to drive imitation in the context of social reinforcement learning

PLoS Biology ◽

10.1371/journal.pbio.3001028 ◽

2020 ◽

Vol 18 (12) ◽

pp. e3001028

Author(s):

Anis Najar ◽

Emmanuelle Bonnet ◽

Bahador Bahrami ◽

Stefano Palminteri

Keyword(s):

Reinforcement Learning ◽

Model Comparison ◽

Value Function ◽

Model Simulation ◽

Action Selection ◽

Independent Experiment ◽

Social Reinforcement ◽

Social Signals ◽

Hypothesis Model ◽

New Variant

While there is no doubt that social signals affect human reinforcement learning, there is still no consensus about how this process is computationally implemented. To address this issue, we compared three psychologically plausible hypotheses about the algorithmic implementation of imitation in reinforcement learning. The first hypothesis, decision biasing (DB), postulates that imitation consists in transiently biasing the learner’s action selection without affecting their value function. According to the second hypothesis, model-based imitation (MB), the learner infers the demonstrator’s value function through inverse reinforcement learning and uses it to bias action selection. Finally, according to the third hypothesis, value shaping (VS), the demonstrator’s actions directly affect the learner’s value function. We tested these three hypotheses in 2 experiments (N = 24 and N = 44) featuring a new variant of a social reinforcement learning task. We show through model comparison and model simulation that VS provides the best explanation of learner’s behavior. Results replicated in a third independent experiment featuring a larger cohort and a different design (N = 302). In our experiments, we also manipulated the quality of the demonstrators’ choices and found that learners were able to adapt their imitation rate, so that only skilled demonstrators were imitated. We proposed and tested an efficient meta-learning process to account for this effect, where imitation is regulated by the agreement between the learner and the demonstrator. In sum, our findings provide new insights and perspectives on the computational mechanisms underlying adaptive imitation in human reinforcement learning.

Download Full-text

Imitation as a model-free process in human reinforcement learning

10.1101/797407 ◽

2019 ◽

Author(s):

Anis Najar ◽

Emmanuelle Bonnet ◽

Bahador Bahrami ◽

Stefano Palminteri

Keyword(s):

Reinforcement Learning ◽

Model Comparison ◽

Value Function ◽

Action Selection ◽

Learning Task ◽

Social Signals ◽

Model Free ◽

Free Process ◽

New Variant ◽

New Perspective

While there is not doubt that social signals affect human reinforcement learning, there is still no consensus about their exact computational implementation. To address this issue, we compared three hypotheses about the algorithmic implementation of imitation in human reinforcement learning. A first hypothesis, decision biasing, postulates that imitation consists in transiently biasing the learner’s action selection without affecting her value function. According to the second hypothesis, model-based imitation, the learner infers the demonstrator’s value function through inverse reinforcement learning and uses it for action selection. Finally, according to the third hypothesis, value shaping, demonstrator’s actions directly affect the learner’s value function. We tested these three psychologically plausible hypotheses in two separate experiments (N = 24 and N = 44) featuring a new variant of a social reinforcement learning task, where we manipulated the quantity and the quality of the demonstrator’s choices. We show through model comparison that value shaping is favored, which provides a new perspective on how imitation is integrated into human reinforcement learning.

Download Full-text

Variability in Action Selection Relates to Striatal Dopamine 2/3 Receptor Availability in Humans: A PET Neuroimaging Study Using Reinforcement Learning and Active Inference Models

Cerebral Cortex ◽

10.1093/cercor/bhz327 ◽

2020 ◽

Vol 30 (6) ◽

pp. 3573-3589 ◽

Cited By ~ 1

Author(s):

Rick A Adams ◽

Michael Moutoussis ◽

Matthew M Nour ◽

Tarik Dahoun ◽

Declan Lewis ◽

...

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Model Comparison ◽

Behavioral Model ◽

Action Selection ◽

Striatal Dopamine ◽

Active Inference ◽

Inference Models ◽

Positron Emission ◽

Dopamine Signaling

Abstract Choosing actions that result in advantageous outcomes is a fundamental function of nervous systems. All computational decision-making models contain a mechanism that controls the variability of (or confidence in) action selection, but its neural implementation is unclear—especially in humans. We investigated this mechanism using two influential decision-making frameworks: active inference (AI) and reinforcement learning (RL). In AI, the precision (inverse variance) of beliefs about policies controls action selection variability—similar to decision ‘noise’ parameters in RL—and is thought to be encoded by striatal dopamine signaling. We tested this hypothesis by administering a ‘go/no-go’ task to 75 healthy participants, and measuring striatal dopamine 2/3 receptor (D2/3R) availability in a subset (n = 25) using [11C]-(+)-PHNO positron emission tomography. In behavioral model comparison, RL performed best across the whole group but AI performed best in participants performing above chance levels. Limbic striatal D2/3R availability had linear relationships with AI policy precision (P = 0.029) as well as with RL irreducible decision ‘noise’ (P = 0.020), and this relationship with D2/3R availability was confirmed with a ‘decision stochasticity’ factor that aggregated across both models (P = 0.0006). These findings are consistent with occupancy of inhibitory striatal D2/3Rs decreasing the variability of action selection in humans.

Download Full-text

Multi-Agent/Robot Deep Reinforcement Learning with Macro-Actions (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7255 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13965-13966

Author(s):

Yuchen Xiao ◽

Joshua Hoffman ◽

Tian Xia ◽

Christopher Amato

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Action Observation ◽

Action Selection ◽

New Approaches ◽

Multi Agent ◽

Action Value

We consider the challenges of learning multi-agent/robot macro-action-based deep Q-nets including how to properly update each macro-action value and accurately maintain macro-action-observation trajectories. We address these challenges by first proposing two fundamental frameworks for learning macro-action-value function and joint macro-action-value function. Furthermore, we present two new approaches of learning decentralized macro-action-based policies, which involve a new double Q-update rule that facilitates the learning of decentralized Q-nets by using a centralized Q-net for action selection. Our approaches are evaluated both in simulation and on real robots.

Download Full-text

Reinforcement Learning for Ramp Control: An Analysis of Learning Parameters

PROMET - Traffic&Transportation ◽

10.7307/ptt.v28i4.1830 ◽

2016 ◽

Vol 28 (4) ◽

pp. 371-381 ◽

Cited By ~ 3

Author(s):

Chao Lu ◽

Jie Huang ◽

Jianwei Gong

Keyword(s):

Reinforcement Learning ◽

Discount Rate ◽

Model Simulation ◽

Action Selection ◽

Control Agent ◽

Learning Rate ◽

Superior Performance ◽

Algorithm Performance ◽

Selection Parameter ◽

Ramp Control

Reinforcement Learning (RL) has been proposed to deal with ramp control problems under dynamic traffic conditions; however, there is a lack of sufficient research on the behaviour and impacts of different learning parameters. This paper describes a ramp control agent based on the RL mechanism and thoroughly analyzed the influence of three learning parameters; namely, learning rate, discount rate and action selection parameter on the algorithm performance. Two indices for the learning speed and convergence stability were used to measure the algorithm performance, based on which a series of simulation-based experiments were designed and conducted by using a macroscopic traffic flow model. Simulation results showed that, compared with the discount rate, the learning rate and action selection parameter made more remarkable impacts on the algorithm performance. Based on the analysis, some suggestionsabout how to select suitable parameter values that can achieve a superior performance were provided.

Download Full-text

Solving flow-shop scheduling problem with a reinforcement learning algorithm that generalizes the value function with neural network

Alexandria Engineering Journal ◽

10.1016/j.aej.2021.01.030 ◽

2021 ◽

Vol 60 (3) ◽

pp. 2787-2800

Author(s):

Jianfeng Ren ◽

Chunming Ye ◽

Feng Yang

Keyword(s):

Neural Network ◽

Reinforcement Learning ◽

Value Function ◽

Flow Shop ◽

Learning Algorithm ◽

Flow Shop Scheduling ◽

Scheduling Problem ◽

Shop Scheduling ◽

The Value Function ◽

Reinforcement Learning Algorithm

Download Full-text

Value Function Dynamic Estimation in Reinforcement Learning based on Data Adequacy

Proceedings of the 2020 4th High Performance Computing and Cluster Technologies Conference & 2020 3rd International Conference on Big Data and Artificial Intelligence ◽

10.1145/3409501.3409517 ◽

2020 ◽

Author(s):

Huifan Gao ◽

Yinghui Pan ◽

Jing Tang ◽

Yifeng Zeng ◽

Peihua Chai ◽

...

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Dynamic Estimation ◽

Data Adequacy

Download Full-text

Forecast of the COVID-19 trend in India: a simple modelling approach

10.21203/rs.3.rs-502990/v1 ◽

2021 ◽

Author(s):

Haitao Song ◽

Guihong Fan ◽

Shi Zhao ◽

Huichen Li ◽

Qihua Huang ◽

...

Keyword(s):

Model Simulation ◽

Current Trend ◽

Health And Safety ◽

Wave Pattern ◽

Mortality Data ◽

Local Elections ◽

Sporting Events ◽

Epidemic Size ◽

New Variant ◽

Short Period

Abstract By February 2021, the overall impact of the COVID-19 pandemic in India had been relatively mild in terms of total reported cases and deaths. Surprisingly, the second wave in early April becomes devastating and attracts worldwide attention. On April 30, 2021, India became the first country reporting over 400,000 daily new cases. Multiple factors drove the rapid growth of the epidemic in India and caused a large number of deaths within a very short period. These factors include a new variant with increased transmissibility, a lack of preparations exists national wide, and health and safety precautions poorly implemented or enforced during festivals, sporting events, and state/local elections. Moreover, India's cases and deaths are vastly underreported due to poor infrastructure, and low testing rates. In this paper, we use the COVID-19 mortality data in India and a mathematical model to calculate the effective reproduction number and to model the wave pattern in India. We propose a new approach to forecast the epidemic size and peak timing in India with the aim to inform mitigation in India. Our model simulation matched the reported deaths accurately and is reasonably close to results of serological study. We forecast that the IAR could reach 43% by June 13, 2021 under the current trend, which means 532,629 reported deaths with a 95% CI (552,445, 513,194) ie., double the current total deaths. Our approach is readily applicable in other countries and with other type of data (e.g. excess deaths).

Download Full-text

Multi-agent reinforcement learning using ordinal action selection and approximate policy iteration

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691316500533 ◽

2016 ◽

Vol 14 (06) ◽

pp. 1650053

Author(s):

Daxue Liu ◽

Jun Wu ◽

Xin Xu

Keyword(s):

Reinforcement Learning ◽

Single Agent ◽

Action Selection ◽

Policy Iteration ◽

Common Interest ◽

Policy Space ◽

Markov Games ◽

Approximate Policy Iteration ◽

Multi Agent ◽

Agent Coordination

Multi-agent reinforcement learning (MARL) provides a useful and flexible framework for multi-agent coordination in uncertain dynamic environments. However, the generalization ability and scalability of algorithms to large problem sizes, already problematic in single-agent RL, is an even more formidable obstacle in MARL applications. In this paper, a new MARL method based on ordinal action selection and approximate policy iteration called OAPI (Ordinal Approximate Policy Iteration), is presented to address the scalability issue of MARL algorithms in common-interest Markov Games. In OAPI, an ordinal action selection and learning strategy is integrated with distributed approximate policy iteration not only to simplify the policy space and eliminate the conflicts in multi-agent coordination, but also to realize the approximation of near-optimal policies for Markov Games with large state spaces. Based on the simplified policy space using ordinal action selection, the OAPI algorithm implements distributed approximate policy iteration utilizing online least-squares policy iteration (LSPI). This resulted in multi-agent coordination with good convergence properties with reduced computational complexity. The simulation results of a coordinated multi-robot navigation task illustrate the feasibility and effectiveness of the proposed approach.

Download Full-text