scholarly journals The actions of others act as a pseudo-reward to drive imitation in the context of social reinforcement learning

PLoS Biology ◽  
2020 ◽  
Vol 18 (12) ◽  
pp. e3001028
Author(s):  
Anis Najar ◽  
Emmanuelle Bonnet ◽  
Bahador Bahrami ◽  
Stefano Palminteri

While there is no doubt that social signals affect human reinforcement learning, there is still no consensus about how this process is computationally implemented. To address this issue, we compared three psychologically plausible hypotheses about the algorithmic implementation of imitation in reinforcement learning. The first hypothesis, decision biasing (DB), postulates that imitation consists in transiently biasing the learner’s action selection without affecting their value function. According to the second hypothesis, model-based imitation (MB), the learner infers the demonstrator’s value function through inverse reinforcement learning and uses it to bias action selection. Finally, according to the third hypothesis, value shaping (VS), the demonstrator’s actions directly affect the learner’s value function. We tested these three hypotheses in 2 experiments (N = 24 and N = 44) featuring a new variant of a social reinforcement learning task. We show through model comparison and model simulation that VS provides the best explanation of learner’s behavior. Results replicated in a third independent experiment featuring a larger cohort and a different design (N = 302). In our experiments, we also manipulated the quality of the demonstrators’ choices and found that learners were able to adapt their imitation rate, so that only skilled demonstrators were imitated. We proposed and tested an efficient meta-learning process to account for this effect, where imitation is regulated by the agreement between the learner and the demonstrator. In sum, our findings provide new insights and perspectives on the computational mechanisms underlying adaptive imitation in human reinforcement learning.

2019 ◽  
Author(s):  
Anis Najar ◽  
Emmanuelle Bonnet ◽  
Bahador Bahrami ◽  
Stefano Palminteri

While there is not doubt that social signals affect human reinforcement learning, there is still no consensus about their exact computational implementation. To address this issue, we compared three hypotheses about the algorithmic implementation of imitation in human reinforcement learning. A first hypothesis, decision biasing, postulates that imitation consists in transiently biasing the learner’s action selection without affecting her value function. According to the second hypothesis, model-based imitation, the learner infers the demonstrator’s value function through inverse reinforcement learning and uses it for action selection. Finally, according to the third hypothesis, value shaping, demonstrator’s actions directly affect the learner’s value function. We tested these three psychologically plausible hypotheses in two separate experiments (N = 24 and N = 44) featuring a new variant of a social reinforcement learning task, where we manipulated the quantity and the quality of the demonstrator’s choices. We show through model comparison that value shaping is favored, which provides a new perspective on how imitation is integrated into human reinforcement learning.


2020 ◽  
Vol 30 (6) ◽  
pp. 3573-3589 ◽  
Author(s):  
Rick A Adams ◽  
Michael Moutoussis ◽  
Matthew M Nour ◽  
Tarik Dahoun ◽  
Declan Lewis ◽  
...  

Abstract Choosing actions that result in advantageous outcomes is a fundamental function of nervous systems. All computational decision-making models contain a mechanism that controls the variability of (or confidence in) action selection, but its neural implementation is unclear—especially in humans. We investigated this mechanism using two influential decision-making frameworks: active inference (AI) and reinforcement learning (RL). In AI, the precision (inverse variance) of beliefs about policies controls action selection variability—similar to decision ‘noise’ parameters in RL—and is thought to be encoded by striatal dopamine signaling. We tested this hypothesis by administering a ‘go/no-go’ task to 75 healthy participants, and measuring striatal dopamine 2/3 receptor (D2/3R) availability in a subset (n = 25) using [11C]-(+)-PHNO positron emission tomography. In behavioral model comparison, RL performed best across the whole group but AI performed best in participants performing above chance levels. Limbic striatal D2/3R availability had linear relationships with AI policy precision (P = 0.029) as well as with RL irreducible decision ‘noise’ (P = 0.020), and this relationship with D2/3R availability was confirmed with a ‘decision stochasticity’ factor that aggregated across both models (P = 0.0006). These findings are consistent with occupancy of inhibitory striatal D2/3Rs decreasing the variability of action selection in humans.


2020 ◽  
Vol 34 (10) ◽  
pp. 13965-13966
Author(s):  
Yuchen Xiao ◽  
Joshua Hoffman ◽  
Tian Xia ◽  
Christopher Amato

We consider the challenges of learning multi-agent/robot macro-action-based deep Q-nets including how to properly update each macro-action value and accurately maintain macro-action-observation trajectories. We address these challenges by first proposing two fundamental frameworks for learning macro-action-value function and joint macro-action-value function. Furthermore, we present two new approaches of learning decentralized macro-action-based policies, which involve a new double Q-update rule that facilitates the learning of decentralized Q-nets by using a centralized Q-net for action selection. Our approaches are evaluated both in simulation and on real robots.


2016 ◽  
Vol 28 (4) ◽  
pp. 371-381 ◽  
Author(s):  
Chao Lu ◽  
Jie Huang ◽  
Jianwei Gong

Reinforcement Learning (RL) has been proposed to deal with ramp control problems under dynamic traffic conditions; however, there is a lack of sufficient research on the behaviour and impacts of different learning parameters. This paper describes a ramp control agent based on the RL mechanism and thoroughly analyzed the influence of three learning parameters; namely, learning rate, discount rate and action selection parameter on the algorithm performance. Two indices for the learning speed and convergence stability were used to measure the algorithm performance, based on which a series of simulation-based experiments were designed and conducted by using a macroscopic traffic flow model. Simulation results showed that, compared with the discount rate, the learning rate and action selection parameter made more remarkable impacts on the algorithm performance. Based on the analysis, some suggestionsabout how to select suitable parameter values that can achieve a superior performance were provided.


2021 ◽  
Author(s):  
Haitao Song ◽  
Guihong Fan ◽  
Shi Zhao ◽  
Huichen Li ◽  
Qihua Huang ◽  
...  

Abstract By February 2021, the overall impact of the COVID-19 pandemic in India had been relatively mild in terms of total reported cases and deaths. Surprisingly, the second wave in early April becomes devastating and attracts worldwide attention. On April 30, 2021, India became the first country reporting over 400,000 daily new cases. Multiple factors drove the rapid growth of the epidemic in India and caused a large number of deaths within a very short period. These factors include a new variant with increased transmissibility, a lack of preparations exists national wide, and health and safety precautions poorly implemented or enforced during festivals, sporting events, and state/local elections. Moreover, India's cases and deaths are vastly underreported due to poor infrastructure, and low testing rates. In this paper, we use the COVID-19 mortality data in India and a mathematical model to calculate the effective reproduction number and to model the wave pattern in India. We propose a new approach to forecast the epidemic size and peak timing in India with the aim to inform mitigation in India. Our model simulation matched the reported deaths accurately and is reasonably close to results of serological study. We forecast that the IAR could reach 43% by June 13, 2021 under the current trend, which means 532,629 reported deaths with a 95% CI (552,445, 513,194) ie., double the current total deaths. Our approach is readily applicable in other countries and with other type of data (e.g. excess deaths).


Author(s):  
Daxue Liu ◽  
Jun Wu ◽  
Xin Xu

Multi-agent reinforcement learning (MARL) provides a useful and flexible framework for multi-agent coordination in uncertain dynamic environments. However, the generalization ability and scalability of algorithms to large problem sizes, already problematic in single-agent RL, is an even more formidable obstacle in MARL applications. In this paper, a new MARL method based on ordinal action selection and approximate policy iteration called OAPI (Ordinal Approximate Policy Iteration), is presented to address the scalability issue of MARL algorithms in common-interest Markov Games. In OAPI, an ordinal action selection and learning strategy is integrated with distributed approximate policy iteration not only to simplify the policy space and eliminate the conflicts in multi-agent coordination, but also to realize the approximation of near-optimal policies for Markov Games with large state spaces. Based on the simplified policy space using ordinal action selection, the OAPI algorithm implements distributed approximate policy iteration utilizing online least-squares policy iteration (LSPI). This resulted in multi-agent coordination with good convergence properties with reduced computational complexity. The simulation results of a coordinated multi-robot navigation task illustrate the feasibility and effectiveness of the proposed approach.


Sign in / Sign up

Export Citation Format

Share Document