scholarly journals A UAV Maneuver Decision-Making Algorithm for Autonomous Airdrop Based on Deep Reinforcement Learning

Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 2233 ◽  
Author(s):  
Ke Li ◽  
Kun Zhang ◽  
Zhenchong Zhang ◽  
Zekun Liu ◽  
Shuai Hua ◽  
...  

How to operate an unmanned aerial vehicle (UAV) safely and efficiently in an interactive environment is challenging. A large amount of research has been devoted to improve the intelligence of a UAV while performing a mission, where finding an optimal maneuver decision-making policy of the UAV has become one of the key issues when we attempt to enable the UAV autonomy. In this paper, we propose a maneuver decision-making algorithm based on deep reinforcement learning, which generates efficient maneuvers for a UAV agent to execute the airdrop mission autonomously in an interactive environment. Particularly, the training set of the learning algorithm by the Prioritized Experience Replay is constructed, that can accelerate the convergence speed of decision network training in the algorithm. It is shown that a desirable and effective maneuver decision-making policy can be found by extensive experimental results.

Entropy ◽  
2021 ◽  
Vol 23 (6) ◽  
pp. 737
Author(s):  
Fengjie Sun ◽  
Xianchang Wang ◽  
Rui Zhang

An Unmanned Aerial Vehicle (UAV) can greatly reduce manpower in the agricultural plant protection such as watering, sowing, and pesticide spraying. It is essential to develop a Decision-making Support System (DSS) for UAVs to help them choose the correct action in states according to the policy. In an unknown environment, the method of formulating rules for UAVs to help them choose actions is not applicable, and it is a feasible solution to obtain the optimal policy through reinforcement learning. However, experiments show that the existing reinforcement learning algorithms cannot get the optimal policy for a UAV in the agricultural plant protection environment. In this work we propose an improved Q-learning algorithm based on similar state matching, and we prove theoretically that there has a greater probability for UAV choosing the optimal action according to the policy learned by the algorithm we proposed than the classic Q-learning algorithm in the agricultural plant protection environment. This proposed algorithm is implemented and tested on datasets that are evenly distributed based on real UAV parameters and real farm information. The performance evaluation of the algorithm is discussed in detail. Experimental results show that the algorithm we proposed can efficiently learn the optimal policy for UAVs in the agricultural plant protection environment.


Author(s):  
Xingxing Liang ◽  
Li Chen ◽  
Yanghe Feng ◽  
Zhong Liu ◽  
Yang Ma ◽  
...  

Reinforcement learning, as an effective method to solve complex sequential decision-making problems, plays an important role in areas such as intelligent decision-making and behavioral cognition. It is well known that the sample experience replay mechanism contributes to the development of current deep reinforcement learning by reusing past samples to improve the efficiency of samples. However, the existing priority experience replay mechanism changes the sample distribution in the sample set due to the higher sampling frequency assigned to a specific transition, and it cannot be applied to actor-critic and other on-policy reinforcement learning algorithm. To address this, we propose an adaptive factor based on TD-error, which further increases sample utilization by giving more attention weight to samples of larger TD-error, and embeds it flexibly into the original Deep Q Network and Advantage Actor-Critic algorithm to improve their performance. Then we carried out the performance evaluation for the proposed architecture in the context of CartPole-V1 and 6 environments of Atari game experiments, respectively, and the obtained results either on the conditions of fixed temperature or annealing temperature, when compared to those produced by the vanilla DQN and original A2C, highlight the advantages in cumulative rewards and climb speed of the improved algorithms.


2019 ◽  
Vol 103 (1) ◽  
pp. 003685041987902 ◽  
Author(s):  
Ronglei Xie ◽  
Zhijun Meng ◽  
Yaoming Zhou ◽  
Yunpeng Ma ◽  
Zhe Wu

In order to solve the problem that the existing reinforcement learning algorithm is difficult to converge due to the excessive state space of the three-dimensional path planning of the unmanned aerial vehicle, this article proposes a reinforcement learning algorithm based on the heuristic function and the maximum average reward value of the experience replay mechanism. The knowledge of track performance is introduced to construct heuristic function to guide the unmanned aerial vehicles’ action selection and reduce the useless exploration. Experience replay mechanism based on maximum average reward increases the utilization rate of excellent samples and the convergence speed of the algorithm. The simulation results show that the proposed three-dimensional path planning algorithm has good learning efficiency, and the convergence speed and training performance are significantly improved.


Author(s):  
Bailin Song ◽  
Hua Xu ◽  
Lei Jiang ◽  
Ning Rao

In order to solve the problem of intelligent anti-jamming decision-making in battlefield communication, this paper designs an intelligent decision-making method for communication anti-jamming based on deep reinforcement learning. Introducing experience replay and dynamic epsilon mechanism based on PHC under the framework of DQN algorithm, a dynamic epsilon-DQN intelligent decision-making method is proposed. The algorithm can better select the value of epsilon according to the state of the decision network and improve the convergence speed and decision success rate. During the decision-making process, the jamming signals of all communication frequencies are detected, and the results are input into the decision-making algorithm as jamming discriminant information, so that we can effectively avoid being jammed under the condition of no prior jamming information. The experimental results show that the proposed method adapts to various communication models, has a fast decision-making speed, and the average success rate of the convergent algorithm can reach more than 95%, which has a great advantage over the existing decision-making methods.


2021 ◽  
Vol 54 (3-4) ◽  
pp. 417-428
Author(s):  
Yanyan Dai ◽  
KiDong Lee ◽  
SukGyu Lee

For real applications, rotary inverted pendulum systems have been known as the basic model in nonlinear control systems. If researchers have no deep understanding of control, it is difficult to control a rotary inverted pendulum platform using classic control engineering models, as shown in section 2.1. Therefore, without classic control theory, this paper controls the platform by training and testing reinforcement learning algorithm. Many recent achievements in reinforcement learning (RL) have become possible, but there is a lack of research to quickly test high-frequency RL algorithms using real hardware environment. In this paper, we propose a real-time Hardware-in-the-loop (HIL) control system to train and test the deep reinforcement learning algorithm from simulation to real hardware implementation. The Double Deep Q-Network (DDQN) with prioritized experience replay reinforcement learning algorithm, without a deep understanding of classical control engineering, is used to implement the agent. For the real experiment, to swing up the rotary inverted pendulum and make the pendulum smoothly move, we define 21 actions to swing up and balance the pendulum. Comparing Deep Q-Network (DQN), the DDQN with prioritized experience replay algorithm removes the overestimate of Q value and decreases the training time. Finally, this paper shows the experiment results with comparisons of classic control theory and different reinforcement learning algorithms.


Electronics ◽  
2018 ◽  
Vol 7 (11) ◽  
pp. 279 ◽  
Author(s):  
Xianbing Zhang ◽  
Guoqing Liu ◽  
Chaojie Yang ◽  
Jiang Wu

With the development of information technology, the degree of intelligence in air combat is increasing, and the demand for automated intelligent decision-making systems is becoming more intense. Based on the characteristics of over-the-horizon air combat, this paper constructs a super-horizon air combat training environment, which includes aircraft model modeling, air combat scene design, enemy aircraft strategy design, and reward and punishment signal design. In order to improve the efficiency of the reinforcement learning algorithm for the exploration of strategy space, this paper proposes a heuristic Q-Network method that integrates expert experience, and uses expert experience as a heuristic signal to guide the search process. At the same time, heuristic exploration and random exploration are combined. Aiming at the over-the-horizon air combat maneuver decision problem, the heuristic Q-Network method is adopted to train the neural network model in the over-the-horizon air combat training environment. Through continuous interaction with the environment, self-learning of the air combat maneuver strategy is realized. The efficiency of the heuristic Q-Network method and effectiveness of the air combat maneuver strategy are verified by simulation experiments.


Author(s):  
Zhenhai Gao ◽  
Xiangtong Yan ◽  
Fei Gao ◽  
Lei He

Decision-making is one of the key parts of the research on vehicle longitudinal autonomous driving. Considering the behavior of human drivers when designing autonomous driving decision-making strategies is a current research hotspot. In longitudinal autonomous driving decision-making strategies, traditional rule-based decision-making strategies are difficult to apply to complex scenarios. Current decision-making methods that use reinforcement learning and deep reinforcement learning construct reward functions designed with safety, comfort, and economy. Compared with human drivers, the obtained decision strategies still have big gaps. Focusing on the above problems, this paper uses the driver’s behavior data to design the reward function of the deep reinforcement learning algorithm through BP neural network fitting, and uses the deep reinforcement learning DQN algorithm and the DDPG algorithm to establish two driver-like longitudinal autonomous driving decision-making models. The simulation experiment compares the decision-making effect of the two models with the driver curve. The results shows that the two algorithms can realize driver-like decision-making, and the consistency of the DDPG algorithm and human driver behavior is higher than that of the DQN algorithm, the effect of the DDPG algorithm is better than the DQN algorithm.


Author(s):  
Enmin Zhao ◽  
Shihong Deng ◽  
Yifan Zang ◽  
Yongxin Kang ◽  
Kai Li ◽  
...  

Experience replay plays a crucial role in Reinforcement Learning (RL), enabling the agent to remember and reuse experience from the past. Most previous methods sample experience transitions using simple heuristics like uniformly sampling or prioritizing those good ones. Since humans can learn from both good and bad experiences, more sophisticated experience replay algorithms need to be developed. Inspired by the potential energy in physics, this work introduces the artificial potential field into experience replay and develops Potentialized Experience Replay (PotER) as a new and effective sampling algorithm for RL in hard exploration tasks with sparse rewards. PotER defines a potential energy function for each state in experience replay and helps the agent to learn from both good and bad experiences using intrinsic state supervision. PotER can be combined with different RL algorithms as well as the self-imitation learning algorithm. Experimental analyses and comparisons on multiple challenging hard exploration environments have verified its effectiveness and efficiency.


Author(s):  
Zhen Yu ◽  
Yimin Feng ◽  
Lijun Liu

In general reinforcement learning tasks, the formulation of reward functions is a very important step in reinforcement learning. The reward function is not easy to formulate in a large number of systems. The network training effect is sensitive to the reward function, and different reward value functions will get different results. For a class of systems that meet specific conditions, the traditional reinforcement learning method is improved. A state quantity function is designed to replace the reward function, which is more efficient than the traditional reward function. At the same time, the predictive network link is designed so that the network can learn the value of the general state by using the special state. The overall structure of the network will be improved based on the Deep Deterministic Policy Gradient (DDPG) algorithm. Finally, the algorithm was successfully applied in the environment of FrozenLake, and achieved good performance. The experiment proves the effectiveness of the algorithm and realizes rewardless reinforcement learning in a class of systems.


Sign in / Sign up

Export Citation Format

Share Document