Train Scheduling with Deep Q-Network: A Feasibility Test

We consider an investment problem where observing and trading are only possible at random times. In addition, we introduce drawdown constraints which require that the investor's wealth does not fall under a prior fixed percentage of its running maximum. The financial market consists of a riskless bond and a stock which is driven by a Lévy process. Moreover, a general utility function is assumed. In this setting we solve the investment problem using a related limsup Markov decision process. We show that the value function can be characterized as the unique fixed point of the Bellman equation and verify the existence of an optimal stationary policy. Under some mild assumptions the value function can be approximated by the value function of a contracting Markov decision process. We are able to use Howard's policy improvement algorithm for computing the value function as well as an optimal policy. These results are illustrated in a numerical example.

Download Full-text

On Optimal Terminal Wealth Problems with Random Trading Times and Drawdown Constraints

Advances in Applied Probability ◽

10.1239/aap/1396360106 ◽

2014 ◽

Vol 46 (1) ◽

pp. 121-138 ◽

Cited By ~ 2

Author(s):

Ulrich Rieder ◽

Marc Wittlinger

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Value Function ◽

Stationary Policy ◽

General Utility ◽

Investment Problem ◽

Markov Decision ◽

Optimal Stationary Policy ◽

Running Maximum ◽

The Value Function

We consider an investment problem where observing and trading are only possible at random times. In addition, we introduce drawdown constraints which require that the investor's wealth does not fall under a prior fixed percentage of its running maximum. The financial market consists of a riskless bond and a stock which is driven by a Lévy process. Moreover, a general utility function is assumed. In this setting we solve the investment problem using a related limsup Markov decision process. We show that the value function can be characterized as the unique fixed point of the Bellman equation and verify the existence of an optimal stationary policy. Under some mild assumptions the value function can be approximated by the value function of a contracting Markov decision process. We are able to use Howard's policy improvement algorithm for computing the value function as well as an optimal policy. These results are illustrated in a numerical example.

Download Full-text

Extended Model Formulation of the Proportional Lot-Sizing and Scheduling Problem with Lost Demand Costs

Decision Making in Manufacturing and Services ◽

10.7494/dmms.2011.5.1.49 ◽

2011 ◽

Vol 5 (1) ◽

pp. 49-56

Author(s):

Waldemar Kaczmarczyk

Keyword(s):

Lot Sizing ◽

Valid Inequalities ◽

Mixed Integer ◽

Extended Model ◽

Scheduling Problem ◽

Scheduling Problems ◽

Model Formulation ◽

Mip Models ◽

Planning Problems ◽

Lot Sizing And Scheduling

We consider mixed-integer linear programming (MIP) models of production planning problems known as the small bucket lot-sizing and scheduling problems. We present an application of a class of valid inequalities to the case with lost demand (stock-out) costs. Presented results of numerical experiments made for the the Proportional Lot-sizing and Scheduling Problem (PLSP) confirm benefits of such extended model formulation.

Download Full-text

Empirical Q-Value Iteration

Stochastic Systems ◽

10.1287/stsy.2019.0062 ◽

2020 ◽

Author(s):

Dileep Kalathil ◽

Vivek S. Borkar ◽

Rahul Jain

Keyword(s):

Rate Of Convergence ◽

Stochastic Approximation ◽

Value Function ◽

Iteration Algorithm ◽

Value Iteration ◽

Q Value ◽

Complexity Bound ◽

Discounted Cost ◽

Q Learning ◽

Markov Decision

We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms.

Download Full-text

A Moreau-Yosida regularization for Markov decision processes

Proyecciones (Antofagasta) ◽

10.22199/issn.0717-6279-2021-01-0008 ◽

2020 ◽

Vol 40 (1) ◽

pp. 117-137

Author(s):

R. Israel Ortega-Gutiérrez ◽

H. Cruz-Suárez

Keyword(s):

Markov Decision Process ◽

Markov Decision Processes ◽

Optimal Policy ◽

Decision Process ◽

Value Function ◽

Decision Processes ◽

Original Process ◽

Optimal Value ◽

Markov Decision ◽

Yosida Regularization

This paper addresses a class of sequential optimization problems known as Markov decision processes. These kinds of processes are considered on Euclidean state and action spaces with the total expected discounted cost as the objective function. The main goal of the paper is to provide conditions to guarantee an adequate Moreau-Yosida regularization for Markov decision processes (named the original process). In this way, a new Markov decision process that conforms to the Markov control model of the original process except for the cost function induced via the Moreau-Yosida regularization is established. Compared to the original process, this new discounted Markov decision process has richer properties, such as the differentiability of its optimal value function, strictly convexity of the value function, uniqueness of optimal policy, and the optimal value function and the optimal policy of both processes, are the same. To complement the theory presented, an example is provided.

Download Full-text

An Offloading Algorithm based on Markov Decision Process in Mobile Edge Computing System

International Journal of Circuits, Systems and Signal Processing ◽

10.46300/9106.2022.16.15 ◽

2022 ◽

Vol 16 ◽

pp. 115-121

Author(s):

Bingxin Yao ◽

Bin Wu ◽

Siyun Wu ◽

Yin Ji ◽

Danggui Chen ◽

...

Keyword(s):

Energy Consumption ◽

Markov Decision Process ◽

Decision Process ◽

Value Function ◽

Wireless Channel ◽

Edge Computing ◽

Iteration Algorithm ◽

Mobile Edge Computing ◽

Markov Decision ◽

The Value Function

In this paper, an offloading algorithm based on Markov Decision Process (MDP) is proposed to solve the multi-objective offloading decision problem in Mobile Edge Computing (MEC) system. The feature of the algorithm is that MDP is used to make offloading decision. The number of tasks in the task queue, the number of accessible edge clouds and Signal-Noise-Ratio (SNR) of the wireless channel are taken into account in the state space of the MDP model. The offloading delay and energy consumption are considered to define the value function of the MDP model, i.e. the objective function. To maximize the value function, Value Iteration Algorithm is used to obtain the optimal offloading policy. According to the policy, tasks of mobile terminals (MTs) are offloaded to the edge cloud or central cloud, or executed locally. The simulation results show that the proposed algorithm can effectively reduce the offloading delay and energy consumption.

Download Full-text

The Convergence of a Cooperation Markov Decision Process System

Entropy ◽

10.3390/e22090955 ◽

2020 ◽

Vol 22 (9) ◽

pp. 955

Author(s):

Xiaoling Mo ◽

Daoyun Xu ◽

Zufeng Fu

Keyword(s):

Markov Decision Process ◽

Decision Process ◽

Optimal Strategy ◽

Value Function ◽

Single Agent ◽

Two Agents ◽

Markov Decision ◽

Process System ◽

Multi Agent ◽

Game Environment

In a general Markov decision progress system, only one agent’s learning evolution is considered. However, considering the learning evolution of a single agent in many problems has some limitations, more and more applications involve multi-agent. There are two types of cooperation, game environment among multi-agent. Therefore, this paper introduces a Cooperation Markov Decision Process (CMDP) system with two agents, which is suitable for the learning evolution of cooperative decision between two agents. It is further found that the value function in the CMDP system also converges in the end, and the convergence value is independent of the choice of the value of the initial value function. This paper presents an algorithm for finding the optimal strategy pair (πk0,πk1) in the CMDP system, whose fundamental task is to find an optimal strategy pair and form an evolutionary system CMDP(πk0,πk1). Finally, an example is given to support the theoretical results.

Download Full-text

Optimal Occupation in the Complete Graph

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964800002989 ◽

1993 ◽

Vol 7 (3) ◽

pp. 369-385 ◽

Cited By ~ 1

Author(s):

Kyle Siegrist

Keyword(s):

Markov Decision Process ◽

Complete Graph ◽

Decision Process ◽

Value Function ◽

Comparison Result ◽

State Action ◽

Optimal Policies ◽

Markov Decision ◽

The Cost ◽

The Value Function

We consider N sites (N ≤ ∞), each of which may be either occupied or unoccupied. Time is discrete, and at each time unit a set of occupied sites may attempt to capture a previously unoccupied site. The attempt will be successful with a probability that depends on the number of sites making the attempt, in which case the new site will also be occupied. A benefit is gained when new sites are occupied, but capture attempts are costly. The problem of optimal occupation is formulated as a Markov decision process in which the admissible actions are occupation strategies and the cost is a function of the strategy and the number of occupied sites. A partial order on the state-action pairs is used to obtain a comparison result for stationary policies and qualitative results concerning monotonicity of the value function for the n-stage problem (n ≤ ∞). The optimal policies are partially characterized when the cost depends on the action only through the total number of occupation attempts made.

Download Full-text

On Q-learning Convergence for Non-Markov Decision Processes

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/353 ◽

2018 ◽

Author(s):

Sultan Javed Majeed ◽

Marcus Hutter

Keyword(s):

Markov Decision Processes ◽

Value Function ◽

Decision Processes ◽

Q Value ◽

Computationally Efficient ◽

Q Learning ◽

Model Free ◽

Internal States ◽

Finite State ◽

Markov Decision

Temporal-difference (TD) learning is an attractive, computationally efficient framework for model- free reinforcement learning. Q-learning is one of the most widely used TD learning technique that enables an agent to learn the optimal action-value function, i.e. Q-value function. Contrary to its widespread use, Q-learning has only been proven to converge on Markov Decision Processes (MDPs) and Q-uniform abstractions of finite-state MDPs. On the other hand, most real-world problems are inherently non-Markovian: the full true state of the environment is not revealed by recent observations. In this paper, we investigate the behavior of Q-learning when applied to non-MDP and non-ergodic domains which may have infinitely many underlying states. We prove that the convergence guarantee of Q-learning can be extended to a class of such non-MDP problems, in particular, to some non-stationary domains. We show that state-uniformity of the optimal Q-value function is a necessary and sufficient condition for Q-learning to converge even in the case of infinitely many internal states.

Download Full-text