scholarly journals Train Scheduling with Deep Q-Network: A Feasibility Test

2020 ◽  
Vol 10 (23) ◽  
pp. 8367
Author(s):  
Intaek Gong ◽  
Sukmun Oh ◽  
Yunhong Min

We consider a train scheduling problem in which both local and express trains are to be scheduled. In this type of train scheduling problem, the key decision is determining the overtaking stations at which express trains overtake their preceding local trains. This problem has been successfully modeled via mixed integer programming (MIP) models. One of the obvious limitation of MIP-based approaches is the lack of freedom to the choices objective and constraint functions. In this paper, as an alternative, we propose an approach based on reinforcement learning. We first decompose the problem into subproblems in which a single express train and its preceding local trains are considered. We, then, formulate the subproblem as a Markov decision process (MDP). Instead of solving each instance of MDP, we train a deep neural network, called deep Q-network (DQN), which approximates Q-value function of any instances of MDP. The learned DQN can be used to make decision by choosing the action which corresponds to the maximum Q-value. The advantage of the proposed method is the ability to incorporate any complex objective and/or constraint functions. We demonstrate the performance of the proposed method by numerical experiments.

2014 ◽  
Vol 46 (01) ◽  
pp. 121-138 ◽  
Author(s):  
Ulrich Rieder ◽  
Marc Wittlinger

We consider an investment problem where observing and trading are only possible at random times. In addition, we introduce drawdown constraints which require that the investor's wealth does not fall under a prior fixed percentage of its running maximum. The financial market consists of a riskless bond and a stock which is driven by a Lévy process. Moreover, a general utility function is assumed. In this setting we solve the investment problem using a related limsup Markov decision process. We show that the value function can be characterized as the unique fixed point of the Bellman equation and verify the existence of an optimal stationary policy. Under some mild assumptions the value function can be approximated by the value function of a contracting Markov decision process. We are able to use Howard's policy improvement algorithm for computing the value function as well as an optimal policy. These results are illustrated in a numerical example.


2014 ◽  
Vol 46 (1) ◽  
pp. 121-138 ◽  
Author(s):  
Ulrich Rieder ◽  
Marc Wittlinger

We consider an investment problem where observing and trading are only possible at random times. In addition, we introduce drawdown constraints which require that the investor's wealth does not fall under a prior fixed percentage of its running maximum. The financial market consists of a riskless bond and a stock which is driven by a Lévy process. Moreover, a general utility function is assumed. In this setting we solve the investment problem using a related limsup Markov decision process. We show that the value function can be characterized as the unique fixed point of the Bellman equation and verify the existence of an optimal stationary policy. Under some mild assumptions the value function can be approximated by the value function of a contracting Markov decision process. We are able to use Howard's policy improvement algorithm for computing the value function as well as an optimal policy. These results are illustrated in a numerical example.


2011 ◽  
Vol 5 (1) ◽  
pp. 49-56
Author(s):  
Waldemar Kaczmarczyk

We consider mixed-integer linear programming (MIP) models of production planning problems known as the small bucket lot-sizing and scheduling problems. We present an application of a class of valid inequalities to the case with lost demand (stock-out) costs. Presented results of numerical experiments made for the the Proportional Lot-sizing and Scheduling Problem (PLSP) confirm benefits of such extended model formulation.


2020 ◽  
Author(s):  
Dileep Kalathil ◽  
Vivek S. Borkar ◽  
Rahul Jain

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms.


2020 ◽  
Vol 40 (1) ◽  
pp. 117-137
Author(s):  
R. Israel Ortega-Gutiérrez ◽  
H. Cruz-Suárez

This paper addresses a class of sequential optimization problems known as Markov decision processes. These kinds of processes are considered on Euclidean state and action spaces with the total expected discounted cost as the objective function. The main goal of the paper is to provide conditions to guarantee an adequate Moreau-Yosida regularization for Markov decision processes (named the original process). In this way, a new Markov decision process that conforms to the Markov control model of the original process except for the cost function induced via the Moreau-Yosida regularization is established. Compared to the original process, this new discounted Markov decision process has richer properties, such as the differentiability of its optimal value function, strictly convexity of the value function, uniqueness of optimal policy, and the optimal value function and the optimal policy of both processes, are the same. To complement the theory presented, an example is provided.


Author(s):  
Bingxin Yao ◽  
Bin Wu ◽  
Siyun Wu ◽  
Yin Ji ◽  
Danggui Chen ◽  
...  

In this paper, an offloading algorithm based on Markov Decision Process (MDP) is proposed to solve the multi-objective offloading decision problem in Mobile Edge Computing (MEC) system. The feature of the algorithm is that MDP is used to make offloading decision. The number of tasks in the task queue, the number of accessible edge clouds and Signal-Noise-Ratio (SNR) of the wireless channel are taken into account in the state space of the MDP model. The offloading delay and energy consumption are considered to define the value function of the MDP model, i.e. the objective function. To maximize the value function, Value Iteration Algorithm is used to obtain the optimal offloading policy. According to the policy, tasks of mobile terminals (MTs) are offloaded to the edge cloud or central cloud, or executed locally. The simulation results show that the proposed algorithm can effectively reduce the offloading delay and energy consumption.


Entropy ◽  
2020 ◽  
Vol 22 (9) ◽  
pp. 955
Author(s):  
Xiaoling Mo ◽  
Daoyun Xu ◽  
Zufeng Fu

In a general Markov decision progress system, only one agent’s learning evolution is considered. However, considering the learning evolution of a single agent in many problems has some limitations, more and more applications involve multi-agent. There are two types of cooperation, game environment among multi-agent. Therefore, this paper introduces a Cooperation Markov Decision Process (CMDP) system with two agents, which is suitable for the learning evolution of cooperative decision between two agents. It is further found that the value function in the CMDP system also converges in the end, and the convergence value is independent of the choice of the value of the initial value function. This paper presents an algorithm for finding the optimal strategy pair (πk0,πk1) in the CMDP system, whose fundamental task is to find an optimal strategy pair and form an evolutionary system CMDP(πk0,πk1). Finally, an example is given to support the theoretical results.


1993 ◽  
Vol 7 (3) ◽  
pp. 369-385 ◽  
Author(s):  
Kyle Siegrist

We consider N sites (N ≤ ∞), each of which may be either occupied or unoccupied. Time is discrete, and at each time unit a set of occupied sites may attempt to capture a previously unoccupied site. The attempt will be successful with a probability that depends on the number of sites making the attempt, in which case the new site will also be occupied. A benefit is gained when new sites are occupied, but capture attempts are costly. The problem of optimal occupation is formulated as a Markov decision process in which the admissible actions are occupation strategies and the cost is a function of the strategy and the number of occupied sites. A partial order on the state-action pairs is used to obtain a comparison result for stationary policies and qualitative results concerning monotonicity of the value function for the n-stage problem (n ≤ ∞). The optimal policies are partially characterized when the cost depends on the action only through the total number of occupation attempts made.


Author(s):  
Sultan Javed Majeed ◽  
Marcus Hutter

Temporal-difference (TD) learning is an attractive, computationally efficient framework for model- free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.


Sign in / Sign up

Export Citation Format

Share Document