Integral reinforcement learning for zero-sum two-player games

Author(s):  
Yuntao Han ◽  
Qibin Zhou ◽  
Fuqing Duan

AbstractThe digital curling game is a two-player zero-sum extensive game in a continuous action space. There are some challenging problems that are still not solved well, such as the uncertainty of strategy, the large game tree searching, and the use of large amounts of supervised data, etc. In this work, we combine NFSP and KR-UCT for digital curling games, where NFSP uses two adversary learning networks and can automatically produce supervised data, and KR-UCT can be used for large game tree searching in continuous action space. We propose two reward mechanisms to make reinforcement learning converge quickly. Experimental results validate the proposed method, and show the strategy model can reach the Nash equilibrium.


Automatica ◽  
2020 ◽  
Vol 112 ◽  
pp. 108672 ◽  
Author(s):  
Adedapo Odekunle ◽  
Weinan Gao ◽  
Masoud Davari ◽  
Zhong-Ping Jiang

Games ◽  
2020 ◽  
Vol 11 (3) ◽  
pp. 25
Author(s):  
Vincent Srihaput ◽  
Kaylee Craplewe ◽  
Benjamin James Dyson

Predictability is a hallmark of poor-quality decision-making during competition. One source of predictability is the strong association between current outcome and future action, as dictated by the reinforcement learning principles of win–stay and lose–shift. We tested the idea that predictability could be reduced during competition by weakening the associations between outcome and action. To do this, participants completed a competitive zero-sum game in which the opponent from the current trial was either replayed (opponent repeat) thereby strengthening the association, or, replaced (opponent change) by a different competitor thereby weakening the association. We observed that win–stay behavior was reduced during opponent change trials but lose–shiftbehavior remained reliably predictable. Consistent with the group data, the number of individuals who exhibited predictable behavior following wins decreased for opponent change relative to opponent repeat trials. Our data show that future actions are more under internal control following positive relative to negative outcomes, and that externally breaking the bonds between outcome and action via opponent association also allows us to become less prone to exploitation.


2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Rui Zhang ◽  
Hui Xia ◽  
Chao Liu ◽  
Ruo-bing Jiang ◽  
Xiang-guo Cheng

Internet of Things realizes the leap from traditional industry to intelligent industry. However, it makes edge devices more vulnerable to attackers during processing perceptual data in real time. To solve the above problem, we use the zero-sum game to build the interactions between attackers and edge devices and propose an antiattack scheme based on deep reinforcement learning. Firstly, we make the k NN-DTW algorithm to find a sample that is similar to the current sample and use the weighted moving mean method to calculate the mean and the variance of the samples. Secondly, to solve the overestimation problem, we develop an optimal strategy algorithm to find the optimal strategy of the edge devices. Experimental results prove that the new scheme improves the payoff of attacked edge devices and decreases the payoff of attackers, thus forcing the attackers to give up the attack.


2020 ◽  
Author(s):  
Vincent Srihaput ◽  
Kaylee Craplewe ◽  
Ben Dyson

Predictability is a hallmark of poor-quality decision-making during competition. One source of predictability is the strong association between current outcome and future action, as dictated by the reinforcement learning principles of win-stay and lose-shift. We tested the idea that predictability could be reduced during competition by weakening the associations between outcome and action. To do this, participants completed a competitive zero-sum game in which the opponent from the current trial was either replayed (opponent repeat) thereby strengthening the association, or, replaced (opponent change) by a different competitor thereby weakening the association. We observed that win-stay behaviour was reduced during opponent change trials but lose-shift behaviour remained reliably predictable. Consistent with the group data, the number of individuals who exhibited predictable behaviour following wins decreased for opponent change relative to opponent repeat trials. Our data show that future actions are more under internal control following positive relative to negative outcomes, and that externally breaking the bonds between outcome and action via opponent association also allow us to become less prone to exploitation.


2009 ◽  
Vol 72 (7-9) ◽  
pp. 1494-1507
Author(s):  
Benoît Frénay ◽  
Marco Saerens

Sign in / Sign up

Export Citation Format

Share Document