A UAV Maneuver Decision-Making Algorithm for Autonomous Airdrop Based on Deep Reinforcement Learning

Ke Li; Kun Zhang; Zhenchong Zhang; Zekun Liu; Shuai Hua; Jianliang He

doi:10.3390/s21062233

A UAV Maneuver Decision-Making Algorithm for Autonomous Airdrop Based on Deep Reinforcement Learning

Sensors ◽

10.3390/s21062233 ◽

2021 ◽

Vol 21 (6) ◽

pp. 2233 ◽

Cited By ~ 1

Author(s):

Ke Li ◽

Kun Zhang ◽

Zhenchong Zhang ◽

Zekun Liu ◽

Shuai Hua ◽

...

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Learning Algorithm ◽

Training Set ◽

Decision Network ◽

Network Training ◽

Interactive Environment ◽

Experience Replay ◽

Aerial Vehicle ◽

Key Issues

How to operate an unmanned aerial vehicle (UAV) safely and efficiently in an interactive environment is challenging. A large amount of research has been devoted to improve the intelligence of a UAV while performing a mission, where finding an optimal maneuver decision-making policy of the UAV has become one of the key issues when we attempt to enable the UAV autonomy. In this paper, we propose a maneuver decision-making algorithm based on deep reinforcement learning, which generates efficient maneuvers for a UAV agent to execute the airdrop mission autonomously in an interactive environment. Particularly, the training set of the learning algorithm by the Prioritized Experience Replay is constructed, that can accelerate the convergence speed of decision network training in the algorithm. It is shown that a desirable and effective maneuver decision-making policy can be found by extensive experimental results.

Download Full-text

Improved Q-Learning Algorithm Based on Approximate State Matching in Agricultural Plant Protection Environment

Entropy ◽

10.3390/e23060737 ◽

2021 ◽

Vol 23 (6) ◽

pp. 737

Author(s):

Fengjie Sun ◽

Xianchang Wang ◽

Rui Zhang

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Optimal Policy ◽

Feasible Solution ◽

Learning Algorithm ◽

Plant Protection ◽

Agricultural Plant ◽

Q Learning ◽

Aerial Vehicle ◽

Optimal Action

An Unmanned Aerial Vehicle (UAV) can greatly reduce manpower in the agricultural plant protection such as watering, sowing, and pesticide spraying. It is essential to develop a Decision-making Support System (DSS) for UAVs to help them choose the correct action in states according to the policy. In an unknown environment, the method of formulating rules for UAVs to help them choose actions is not applicable, and it is a feasible solution to obtain the optimal policy through reinforcement learning. However, experiments show that the existing reinforcement learning algorithms cannot get the optimal policy for a UAV in the agricultural plant protection environment. In this work we propose an improved Q-learning algorithm based on similar state matching, and we prove theoretically that there has a greater probability for UAV choosing the optimal action according to the policy learned by the algorithm we proposed than the classic Q-learning algorithm in the agricultural plant protection environment. This proposed algorithm is implemented and tested on datasets that are evenly distributed based on real UAV parameters and real farm information. The performance evaluation of the algorithm is discussed in detail. Experimental results show that the algorithm we proposed can efficiently learn the optimal policy for UAVs in the agricultural plant protection environment.

Download Full-text

A Novel Adaptive Sampling Strategy for Deep Reinforcement Learning

International Journal of Computational Intelligence and Applications ◽

10.1142/s1469026821500115 ◽

2021 ◽

Vol 20 (02) ◽

pp. 2150011

Author(s):

Xingxing Liang ◽

Li Chen ◽

Yanghe Feng ◽

Zhong Liu ◽

Yang Ma ◽

...

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Adaptive Sampling ◽

Learning Algorithm ◽

Sampling Strategy ◽

Sequential Decision ◽

Fixed Temperature ◽

Sample Distribution ◽

Intelligent Decision Making ◽

Experience Replay

Reinforcement learning, as an effective method to solve complex sequential decision-making problems, plays an important role in areas such as intelligent decision-making and behavioral cognition. It is well known that the sample experience replay mechanism contributes to the development of current deep reinforcement learning by reusing past samples to improve the efficiency of samples. However, the existing priority experience replay mechanism changes the sample distribution in the sample set due to the higher sampling frequency assigned to a specific transition, and it cannot be applied to actor-critic and other on-policy reinforcement learning algorithm. To address this, we propose an adaptive factor based on TD-error, which further increases sample utilization by giving more attention weight to samples of larger TD-error, and embeds it flexibly into the original Deep Q Network and Advantage Actor-Critic algorithm to improve their performance. Then we carried out the performance evaluation for the proposed architecture in the context of CartPole-V1 and 6 environments of Atari game experiments, respectively, and the obtained results either on the conditions of fixed temperature or annealing temperature, when compared to those produced by the vanilla DQN and original A2C, highlight the advantages in cumulative rewards and climb speed of the improved algorithms.

Download Full-text

Heuristic Q-learning based on experience replay for three-dimensional path planning of the unmanned aerial vehicle

Science Progress ◽

10.1177/0036850419879024 ◽

2019 ◽

Vol 103 (1) ◽

pp. 003685041987902 ◽

Cited By ~ 2

Author(s):

Ronglei Xie ◽

Zhijun Meng ◽

Yaoming Zhou ◽

Yunpeng Ma ◽

Zhe Wu

Keyword(s):

Reinforcement Learning ◽

Path Planning ◽

Unmanned Aerial Vehicle ◽

Learning Algorithm ◽

Three Dimensional ◽

Convergence Speed ◽

Average Reward ◽

Heuristic Function ◽

Experience Replay ◽

Aerial Vehicle

In order to solve the problem that the existing reinforcement learning algorithm is difficult to converge due to the excessive state space of the three-dimensional path planning of the unmanned aerial vehicle, this article proposes a reinforcement learning algorithm based on the heuristic function and the maximum average reward value of the experience replay mechanism. The knowledge of track performance is introduced to construct heuristic function to guide the unmanned aerial vehicles’ action selection and reduce the useless exploration. Experience replay mechanism based on maximum average reward increases the utilization rate of excellent samples and the convergence speed of the algorithm. The simulation results show that the proposed three-dimensional path planning algorithm has good learning efficiency, and the convergence speed and training performance are significantly improved.

Download Full-text

An intelligent decision-making method for anti-jamming communication based on deep reinforcement learning

Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University ◽

10.1051/jnwpu/20213930641 ◽

2021 ◽

Vol 39 (3) ◽

pp. 641-649

Author(s):

Bailin Song ◽

Hua Xu ◽

Lei Jiang ◽

Ning Rao

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Success Rate ◽

Intelligent Decision Making ◽

Decision Network ◽

Intelligent Decision ◽

Experience Replay ◽

Average Success Rate ◽

Convergent Algorithm ◽

Fast Decision

In order to solve the problem of intelligent anti-jamming decision-making in battlefield communication, this paper designs an intelligent decision-making method for communication anti-jamming based on deep reinforcement learning. Introducing experience replay and dynamic epsilon mechanism based on PHC under the framework of DQN algorithm, a dynamic epsilon-DQN intelligent decision-making method is proposed. The algorithm can better select the value of epsilon according to the state of the decision network and improve the convergence speed and decision success rate. During the decision-making process, the jamming signals of all communication frequencies are detected, and the results are input into the decision-making algorithm as jamming discriminant information, so that we can effectively avoid being jammed under the condition of no prior jamming information. The experimental results show that the proposed method adapts to various communication models, has a fast decision-making speed, and the average success rate of the convergent algorithm can reach more than 95%, which has a great advantage over the existing decision-making methods.

Download Full-text

A real-time HIL control system on rotary inverted pendulum hardware platform based on double deep Q-network

Measurement and Control ◽

10.1177/00202940211000380 ◽

2021 ◽

Vol 54 (3-4) ◽

pp. 417-428

Author(s):

Yanyan Dai ◽

KiDong Lee ◽

SukGyu Lee

Keyword(s):

Control System ◽

Reinforcement Learning ◽

Inverted Pendulum ◽

Learning Algorithm ◽

Deep Understanding ◽

Control Engineering ◽

Experience Replay ◽

Real Hardware ◽

Rotary Inverted Pendulum ◽

Reinforcement Learning Algorithm

For real applications, rotary inverted pendulum systems have been known as the basic model in nonlinear control systems. If researchers have no deep understanding of control, it is difficult to control a rotary inverted pendulum platform using classic control engineering models, as shown in section 2.1. Therefore, without classic control theory, this paper controls the platform by training and testing reinforcement learning algorithm. Many recent achievements in reinforcement learning (RL) have become possible, but there is a lack of research to quickly test high-frequency RL algorithms using real hardware environment. In this paper, we propose a real-time Hardware-in-the-loop (HIL) control system to train and test the deep reinforcement learning algorithm from simulation to real hardware implementation. The Double Deep Q-Network (DDQN) with prioritized experience replay reinforcement learning algorithm, without a deep understanding of classical control engineering, is used to implement the agent. For the real experiment, to swing up the rotary inverted pendulum and make the pendulum smoothly move, we define 21 actions to swing up and balance the pendulum. Comparing Deep Q-Network (DQN), the DDQN with prioritized experience replay algorithm removes the overestimate of Q value and decreases the training time. Finally, this paper shows the experiment results with comparisons of classic control theory and different reinforcement learning algorithms.

Download Full-text

Research on Air Combat Maneuver Decision-Making Method Based on Reinforcement Learning

Electronics ◽

10.3390/electronics7110279 ◽

2018 ◽

Vol 7 (11) ◽

pp. 279 ◽

Cited By ~ 6

Author(s):

Xianbing Zhang ◽

Guoqing Liu ◽

Chaojie Yang ◽

Jiang Wu

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Learning Algorithm ◽

Signal Design ◽

Training Environment ◽

Strategy Space ◽

Intelligent Decision Making ◽

Combat Training ◽

Network Method ◽

Air Combat

With the development of information technology, the degree of intelligence in air combat is increasing, and the demand for automated intelligent decision-making systems is becoming more intense. Based on the characteristics of over-the-horizon air combat, this paper constructs a super-horizon air combat training environment, which includes aircraft model modeling, air combat scene design, enemy aircraft strategy design, and reward and punishment signal design. In order to improve the efficiency of the reinforcement learning algorithm for the exploration of strategy space, this paper proposes a heuristic Q-Network method that integrates expert experience, and uses expert experience as a heuristic signal to guide the search process. At the same time, heuristic exploration and random exploration are combined. Aiming at the over-the-horizon air combat maneuver decision problem, the heuristic Q-Network method is adopted to train the neural network model in the over-the-horizon air combat training environment. Through continuous interaction with the environment, self-learning of the air combat maneuver strategy is realized. The efficiency of the heuristic Q-Network method and effectiveness of the air combat maneuver strategy are verified by simulation experiments.

Download Full-text

A sample efficient model-based deep reinforcement learning algorithm with experience replay for robot manipulation

International Journal of Intelligent Robotics and Applications ◽

10.1007/s41315-020-00135-2 ◽

2020 ◽

Vol 4 (2) ◽

pp. 217-228

Author(s):

Cheng Zhang ◽

Liang Ma ◽

Alexander Schmitz

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Robot Manipulation ◽

Model Based ◽

Experience Replay ◽

Reinforcement Learning Algorithm

Download Full-text

Driver-like decision-making method for vehicle longitudinal autonomous driving based on deep reinforcement learning

Proceedings of the Institution of Mechanical Engineers Part D Journal of Automobile Engineering ◽

10.1177/09544070211063081 ◽

2021 ◽

pp. 095440702110630

Author(s):

Zhenhai Gao ◽

Xiangtong Yan ◽

Fei Gao ◽

Lei He

Keyword(s):

Decision Making ◽

Reinforcement Learning ◽

Learning Algorithm ◽

Autonomous Driving ◽

Decision Strategies ◽

Reward Function ◽

Human Driver ◽

Reward Functions ◽

A Current ◽

Better Than

Decision-making is one of the key parts of the research on vehicle longitudinal autonomous driving. Considering the behavior of human drivers when designing autonomous driving decision-making strategies is a current research hotspot. In longitudinal autonomous driving decision-making strategies, traditional rule-based decision-making strategies are difficult to apply to complex scenarios. Current decision-making methods that use reinforcement learning and deep reinforcement learning construct reward functions designed with safety, comfort, and economy. Compared with human drivers, the obtained decision strategies still have big gaps. Focusing on the above problems, this paper uses the driver’s behavior data to design the reward function of the deep reinforcement learning algorithm through BP neural network fitting, and uses the deep reinforcement learning DQN algorithm and the DDPG algorithm to establish two driver-like longitudinal autonomous driving decision-making models. The simulation experiment compares the decision-making effect of the two models with the driver curve. The results shows that the two algorithms can realize driver-like decision-making, and the consistency of the DDPG algorithm and human driver behavior is higher than that of the DQN algorithm, the effect of the DDPG algorithm is better than the DQN algorithm.

Download Full-text

Potential Driven Reinforcement Learning for Hard Exploration Tasks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/290 ◽

2020 ◽

Author(s):

Enmin Zhao ◽

Shihong Deng ◽

Yifan Zang ◽

Yongxin Kang ◽

Kai Li ◽

...

Keyword(s):

Reinforcement Learning ◽

Potential Energy ◽

Learning Algorithm ◽

Potential Energy Function ◽

Artificial Potential Field ◽

The Past ◽

State Supervision ◽

Experience Replay ◽

Effectiveness And Efficiency ◽

Simple Heuristics

Experience replay plays a crucial role in Reinforcement Learning (RL), enabling the agent to remember and reuse experience from the past. Most previous methods sample experience transitions using simple heuristics like uniformly sampling or prioritizing those good ones. Since humans can learn from both good and bad experiences, more sophisticated experience replay algorithms need to be developed. Inspired by the potential energy in physics, this work introduces the artificial potential field into experience replay and develops Potentialized Experience Replay (PotER) as a new and effective sampling algorithm for RL in hard exploration tasks with sparse rewards. PotER defines a potential energy function for each state in experience replay and helps the agent to learn from both good and bad experiences using intrinsic state supervision. PotER can be combined with different RL algorithms as well as the self-imitation learning algorithm. Experimental analyses and comparisons on multiple challenging hard exploration environments have verified its effectiveness and efficiency.

Download Full-text

Reward-Free Reinforcement Learning Algorithm Using Prediction Network

Fuzzy Systems and Data Mining VI - Frontiers in Artificial Intelligence and Applications ◽

10.3233/faia200744 ◽

2020 ◽

Author(s):

Zhen Yu ◽

Yimin Feng ◽

Lijun Liu

Keyword(s):

Reinforcement Learning ◽

Learning Algorithm ◽

Value Functions ◽

Learning Method ◽

Reward Function ◽

Network Training ◽

Learning Tasks ◽

Reward Value ◽

Policy Gradient ◽

Reward Functions

In general reinforcement learning tasks, the formulation of reward functions is a very important step in reinforcement learning. The reward function is not easy to formulate in a large number of systems. The network training effect is sensitive to the reward function, and different reward value functions will get different results. For a class of systems that meet specific conditions, the traditional reinforcement learning method is improved. A state quantity function is designed to replace the reward function, which is more efficient than the traditional reward function. At the same time, the predictive network link is designed so that the network can learn the value of the general state by using the special state. The overall structure of the network will be improved based on the Deep Deterministic Policy Gradient (DDPG) algorithm. Finally, the algorithm was successfully applied in the environment of FrozenLake, and achieved good performance. The experiment proves the effectiveness of the algorithm and realizes rewardless reinforcement learning in a class of systems.

Download Full-text