Dopamine regulates the exploration-exploitation trade-off in rats

Mapping Intimacies ◽

10.1101/482802 ◽

2018 ◽

Cited By ~ 2

Author(s):

François Cinotti ◽

Virginie Fresno ◽

Nassim Aklil ◽

Etienne Coutureau ◽

Benoît Girard ◽

...

Keyword(s):

Computational Models ◽

Learning Model ◽

Prediction Errors ◽

Successful Performance ◽

Trade Off ◽

Q Learning ◽

Reward Prediction ◽

Meta Learning ◽

Formal Relationship ◽

Exploration Exploitation

AbstractIn a volatile environment where rewards are uncertain, successful performance requires a delicate balance between exploitation of the best option and exploration of alternative choices. It has theoretically been proposed that dopamine controls this exploration-exploitation trade-off, specifically that the higher the level of tonic dopamine, the more exploitation is favored. We demonstrate here that there is a formal relationship between the rescaling of dopamine positive reward prediction errors and the exploration-exploitation trade-off in simple non-stationary multi-armed bandit tasks. We further show in rats performing such a task that systemically antagonizing dopamine receptors greatly increases the number of random choices without affecting learning capacities. Simulations and comparison of a set of different computational models (an extended Q-learning model, a directed exploration model, and a meta-learning model) fitted on each individual confirm that, independently of the model, decreasing dopaminergic activity does not affect learning rate but is equivalent to an increase in exploration rate. This study shows that dopamine could adapt the exploration-exploitation trade-off in decision making when facing changing environmental contingencies.

Download Full-text

From internal models toward metacognitive AI

Biological Cybernetics ◽

10.1007/s00422-021-00904-7 ◽

2021 ◽

Author(s):

Mitsuo Kawato ◽

Aurelio Cortese

Keyword(s):

Reinforcement Learning ◽

Computational Models ◽

Monitoring Network ◽

Internal Models ◽

Inverse Model ◽

Small Samples ◽

Prediction Errors ◽

Hierarchical Reinforcement Learning ◽

Reward Prediction ◽

Higher Cognitive Functions

AbstractIn several papers published in Biological Cybernetics in the 1980s and 1990s, Kawato and colleagues proposed computational models explaining how internal models are acquired in the cerebellum. These models were later supported by neurophysiological experiments using monkeys and neuroimaging experiments involving humans. These early studies influenced neuroscience from basic, sensory-motor control to higher cognitive functions. One of the most perplexing enigmas related to internal models is to understand the neural mechanisms that enable animals to learn large-dimensional problems with so few trials. Consciousness and metacognition—the ability to monitor one’s own thoughts, may be part of the solution to this enigma. Based on literature reviews of the past 20 years, here we propose a computational neuroscience model of metacognition. The model comprises a modular hierarchical reinforcement-learning architecture of parallel and layered, generative-inverse model pairs. In the prefrontal cortex, a distributed executive network called the “cognitive reality monitoring network” (CRMN) orchestrates conscious involvement of generative-inverse model pairs in perception and action. Based on mismatches between computations by generative and inverse models, as well as reward prediction errors, CRMN computes a “responsibility signal” that gates selection and learning of pairs in perception, action, and reinforcement learning. A high responsibility signal is given to the pairs that best capture the external world, that are competent in movements (small mismatch), and that are capable of reinforcement learning (small reward-prediction error). CRMN selects pairs with higher responsibility signals as objects of metacognition, and consciousness is determined by the entropy of responsibility signals across all pairs. This model could lead to new-generation AI, which exhibits metacognition, consciousness, dimension reduction, selection of modules and corresponding representations, and learning from small samples. It may also lead to the development of a new scientific paradigm that enables the causal study of consciousness by combining CRMN and decoded neurofeedback.

Download Full-text

Bridging Computational Neuroscience and Machine Learning on Non-Stationary Multi-Armed Bandits

10.1101/117598 ◽

2017 ◽

Author(s):

George Velentzas ◽

Costas Tzafestas ◽

Mehdi Khamassi

Keyword(s):

Machine Learning ◽

Computational Neuroscience ◽

Hybrid Algorithm ◽

Trade Off ◽

Research Fields ◽

Meta Learning ◽

Long Time ◽

Fixed Proportion ◽

Fast Adaptation ◽

Exploration Exploitation

AbstractFast adaptation to changes in the environment requires both natural and artificial agents to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). The problem of finding an efficient exploration-exploitation trade-off has been well studied both in the Machine Learning and Computational Neuroscience fields. Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the former have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal - bounded regret. In parallel, researches in the latter have investigated solutions such as progressively increasing exploitation in response to improvements of performance, transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when performing these actions. In this work, we first try to bridge some of these different methods from the two research fields by rewriting their decision process with a common formalism. We then show numerical simulations of a hybrid algorithm combining bio-inspired meta-learning, kalman filter and exploration bonuses compared to several state-of-the-art alternatives on a set of non-stationary stochastic multi-armed bandit tasks. While we find that different methods are appropriate in different scenarios, the hybrid algorithm displays a good combination of advantages from different methods and outperforms these methods in the studied scenarios.

Download Full-text

A history-derived reward prediction error signal in ventral pallidum

10.1101/807842 ◽

2019 ◽

Author(s):

David J. Ottenheimer ◽

Bilal A. Bari ◽

Elissa Sutlief ◽

Kurt M. Fraser ◽

Tabitha H. Kim ◽

...

Keyword(s):

Reinforcement Learning ◽

Computational Models ◽

Ventral Pallidum ◽

Neural Population ◽

Learning Activity ◽

Prediction Errors ◽

Dopamine System ◽

Reward Seeking ◽

Reward Prediction ◽

Midbrain Dopamine

ABSTRACTLearning from past interactions with the environment is critical for adaptive behavior. Within the framework of reinforcement learning, the nervous system builds expectations about future reward by computing reward prediction errors (RPEs), the difference between actual and predicted rewards. Correlates of RPEs have been observed in the midbrain dopamine system, which is thought to locally compute this important variable in service of learning. However, the extent to which RPE signals may be computed upstream of the dopamine system is largely unknown. Here, we quantify history-based RPE signals in the ventral pallidum (VP), an input region to the midbrain dopamine system implicated in reward-seeking behavior. We trained rats to associate cues with future delivery of reward and fit computational models to predict individual neuron firing rates at the time of reward delivery. We found that a subset of VP neurons encoded RPEs and did so more robustly than nucleus accumbens, an input to VP. VP RPEs predicted trial-by-trial task engagement, and optogenetic inhibition of VP reduced subsequent task-related reward seeking. Consistent with reinforcement learning, activity of VP RPE cells adapted when rewards were delivered in blocks. We further found that history- and cue-based RPEs were largely separate across the VP neural population. The presence of behaviorally-instructive RPE signals in the VP suggests a pivotal role for this region in value-based computations.

Download Full-text

Mood and reward dynamics in human adolescent brain electrophysiology

10.1101/2021.03.04.433969 ◽

2021 ◽

Author(s):

Lucrezia Liuzzi ◽

Katharine K Chang ◽

Hanna Keren ◽

Dipta Saha ◽

Charles Zheng ◽

...

Keyword(s):

Mood Induction ◽

Computational Models ◽

Posterior Cingulate Cortex ◽

Prediction Errors ◽

Neuronal Markers ◽

Non Invasive ◽

Reward Prediction ◽

Oscillatory Power ◽

Gamma Power ◽

Multilevel Statistical Analysis

Despite the frequency of mood disorders in the population our understanding of neuronal markers of mood remains elusive which stalls the development of targeted brain-based treatments for these problems. Computational models can help identifying likely parameters affecting self-reported mood during mood induction tasks. Here we test if our previously proposed computational model dynamics of self-reported mood during monetary gambling can be used to identify trial-by-trial variations in neuronal activity. To this end we shifted mood in healthy (N=24) and depressed (N=30) adolescents by delivering individually tailored reward prediction errors whilst recording magnetoencephalography (MEG) data. Following a pre-registered analysis we hypothesize that expectation (defined by previous reward outcomes) would be predictive of beta-gamma oscillatory power (25-40Hz), a frequency shown to modulate to reward feedback. We also hypothesize that trial variations in the evoked response to the presentation of gambling options and in source localized responses to reward feedback. Through our multilevel statistical analysis we found confirmatory evidence that beta-gamma power is positively related to reward expectation during mood shifts, with possible localized sources in the posterior cingulate cortex. We also confirmed reward prediction error to be predictive of trial-level variations in the response of the paracentral lobule and expectation to have an effect on the cerebellum after presentation of gambling options. To our knowledge, this is the first study to relate fluctuations in mood on a minute time-scale to variations in neural oscillations with non-invasive electrophysiology.

Download Full-text

Reward Prediction Errors Drive Declarative Learning Irrespective of Agency

10.31234/osf.io/63g9w ◽

2020 ◽

Author(s):

Kate Ergo ◽

Luna De Vilder ◽

Esther De Loof ◽

Tom Verguts

Keyword(s):

Learning Theory ◽

Learning Effect ◽

Steady Increase ◽

Prediction Errors ◽

Experimental Paradigm ◽

Reward Prediction ◽

Declarative Learning

Recent years have witnessed a steady increase in the number of studies investigating the role of reward prediction errors (RPEs) in declarative learning. Specifically, in several experimental paradigms RPEs drive declarative learning; with larger and more positive RPEs enhancing declarative learning. However, it is unknown whether this RPE must derive from the participant’s own response, or whether instead any RPE is sufficient to obtain the learning effect. To test this, we generated RPEs in the same experimental paradigm where we combined an agency and a non-agency condition. We observed no interaction between RPE and agency, suggesting that any RPE (irrespective of its source) can drive declarative learning. This result holds implications for declarative learning theory.

Download Full-text

Q-Learnheuristics: Towards Data-Driven Balanced Metaheuristics

Mathematics ◽

10.3390/math9161839 ◽

2021 ◽

Vol 9 (16) ◽

pp. 1839

Author(s):

Broderick Crawford ◽

Ricardo Soto ◽

José Lemus-Romani ◽

Marcelo Becerra-Rozas ◽

José M. Lanza-Gutiérrez ◽

...

Keyword(s):

Combinatorial Problems ◽

Set Covering ◽

Covering Problem ◽

Integration Framework ◽

Q Learning ◽

Whale Optimization ◽

Sine Cosine Algorithm ◽

Exploration Exploitation ◽

Selection Of

One of the central issues that must be resolved for a metaheuristic optimization process to work well is the dilemma of the balance between exploration and exploitation. The metaheuristics (MH) that achieved this balance can be called balanced MH, where a Q-Learning (QL) integration framework was proposed for the selection of metaheuristic operators conducive to this balance, particularly the selection of binarization schemes when a continuous metaheuristic solves binary combinatorial problems. In this work the use of this framework is extended to other recent metaheuristics, demonstrating that the integration of QL in the selection of operators improves the exploration-exploitation balance. Specifically, the Whale Optimization Algorithm and the Sine-Cosine Algorithm are tested by solving the Set Covering Problem, showing statistical improvements in this balance and in the quality of the solutions.

Download Full-text

Dissociating the effect of reward uncertainty and timing uncertainty on neural indices of reward prediction errors: A reward positivity (RewP) event-related potential (ERP) study

Biological Psychology ◽

10.1016/j.biopsycho.2021.108121 ◽

2021 ◽

pp. 108121

Author(s):

Alexandra M. Muir ◽

Addison C. Eberhard ◽

Megan S. Walker ◽

Angus Bennion ◽

Mikle South ◽

...

Keyword(s):

Event Related Potential ◽

Prediction Errors ◽

Reward Positivity ◽

Reward Prediction ◽

Timing Uncertainty

Download Full-text

Deep Q-Learning for Two-Hop Communications of Drone Base Stations

Sensors ◽

10.3390/s21061960 ◽

2021 ◽

Vol 21 (6) ◽

pp. 1960

Author(s):

Azade Fotouhi ◽

Ming Ding ◽

Mahbub Hassan

Keyword(s):

Degrees Of Freedom ◽

Network Performance ◽

Learning Model ◽

Base Stations ◽

Communication Model ◽

Complex Environments ◽

End User ◽

Q Learning ◽

Trajectory Simulation ◽

Target Environment

In this paper, we address the application of the flying Drone Base Stations (DBS) in order to improve the network performance. Given the high degrees of freedom of a DBS, it can change its position and adapt its trajectory according to the users movements and the target environment. A two-hop communication model, between an end-user and a macrocell through a DBS, is studied in this work. We propose Q-learning and Deep Q-learning based solutions to optimize the drone’s trajectory. Simulation results show that, by employing our proposed models, the drone can autonomously fly and adapts its mobility according to the users’ movements. Additionally, the Deep Q-learning model outperforms the Q-learning model and can be applied in more complex environments.

Download Full-text

Iterative relevance feedback with adaptive exploration/exploitation trade-off

Proceedings of the 21st ACM international conference on Information and knowledge management - CIKM '12 ◽

10.1145/2396761.2398435 ◽

2012 ◽

Cited By ~ 7

Author(s):

Nicolae Suditu ◽

François Fleuret

Keyword(s):

Relevance Feedback ◽

Trade Off ◽

Exploration Exploitation

Download Full-text

Emotion prediction errors guide socially adaptive behavior

10.31234/osf.io/azeyk ◽

2021 ◽

Author(s):

Joseph Heffner ◽

Jae-Young Son ◽

Oriel FeldmanHall

Keyword(s):

Decision Making ◽

Real Time ◽

Adaptive Behavior ◽

Emotional Response ◽

New Method ◽

Prediction Errors ◽

Emotional Experiences ◽

Past Work ◽

Reward Prediction ◽

Expected Outcomes

People make decisions based on deviations from expected outcomes, known as prediction errors. Past work has focused on reward prediction errors, largely ignoring violations of expected emotional experiences—emotion prediction errors. We leverage a new method to measure real-time fluctuations in emotion as people decide to punish or forgive others. Across four studies (N=1,016), we reveal that emotion and reward prediction errors have distinguishable contributions to choice, such that emotion prediction errors exert the strongest impact during decision-making. We additionally find that a choice to punish or forgive can be decoded in less than a second from an evolving emotional response, suggesting emotions swiftly influence choice. Finally, individuals reporting significant levels of depression exhibit selective impairments in using emotion—but not reward—prediction errors. Evidence for emotion prediction errors potently guiding social behaviors challenge standard decision-making models that have focused solely on reward.

Download Full-text