Strong 0-discount optimal policies in a Markov decision process with a Borel state space

1995 ◽  
Vol 42 (1) ◽  
pp. 93-108 ◽  
Author(s):  
A. A. Yushkevich
Author(s):  
Hyungseok Song ◽  
Hyeryung Jang ◽  
Hai H. Tran ◽  
Se-eun Yoon ◽  
Kyunghwan Son ◽  
...  

We consider the Markov Decision Process (MDP) of selecting a subset of items at each step, termed the Select-MDP (S-MDP). The large state and action spaces of S-MDPs make them intractable to solve with typical reinforcement learning (RL) algorithms especially when the number of items is huge. In this paper, we present a deep RL algorithm to solve this issue by adopting the following key ideas. First, we convert the original S-MDP into an Iterative Select-MDP (IS-MDP), which is equivalent to the S-MDP in terms of optimal actions. IS-MDP decomposes a joint action of selecting K items simultaneously into K iterative selections resulting in the decrease of actions at the expense of an exponential increase of states. Second, we overcome this state space explosion by exploiting a special symmetry in IS-MDPs with novel weight shared Q-networks, which provably maintain sufficient expressive power. Various experiments demonstrate that our approach works well even when the item space is large and that it scales to environments with item spaces different from those used in training.


2010 ◽  
Vol 190 (1) ◽  
pp. 289-309 ◽  
Author(s):  
Lars Relund Nielsen ◽  
Erik Jørgensen ◽  
Søren Højsgaard

2019 ◽  
Vol 11 (7) ◽  
pp. 2060
Author(s):  
Yu Wu ◽  
Bo Zeng ◽  
Siming Huang

In this paper, a home service problem is studied, where a capacitated vehicle collects customers’ parcels in one pick-up tour. We consider a situation where customers, who have scheduled their services in advance, may call to cancel their appointments, and customers, who do not have appointments, also need to be visited if they request for services as long as the capacity is allowed. To handle those changes that occurred over the tour, a dynamic strategy will be needed to guide the vehicle to visit customers in an efficient way. Aimed at minimizing the vehicle’s total expected travel distance, we model this problem as a multi-dimensional Markov Decision Process (MDP) with finite exponential scale state space. We exactly solve this MDP via dynamic programming, where the computing complexity is exponential. In order to avoid complexity continually increasing, we aim to develop a fast looking-up method for one already-examined state’s record. Although generally this will result in a huge waste of memory, by exploiting critical structural properties of the state space, we obtain an O ( 1 ) looking-up method without any waste of memory. Computational experiments demonstrate the effectiveness of our model and the developed solution method. For larger instances, two well-performed heuristics are proposed.


Author(s):  
Takeshi Tateyama ◽  
◽  
Seiichi Kawata ◽  
Yoshiki Shimomura ◽  
◽  
...  

k-certainty exploration method, an efficient reinforcement learning algorithm, is not applied to environments whose state space is continuous because continuous state space must be changed to discrete state space. Our purpose is to construct discrete semi-Markov decision process (SMDP) models of such environments using growing cell structures to autonomously divide continuous state space then usingk-certainty exploration method to construct SMDP models. Multiagentk-certainty exploration method is then used to improve exploration efficiency. Mobile robot simulation demonstrated our proposal's usefulness and efficiency.


1993 ◽  
Vol 7 (3) ◽  
pp. 369-385 ◽  
Author(s):  
Kyle Siegrist

We consider N sites (N ≤ ∞), each of which may be either occupied or unoccupied. Time is discrete, and at each time unit a set of occupied sites may attempt to capture a previously unoccupied site. The attempt will be successful with a probability that depends on the number of sites making the attempt, in which case the new site will also be occupied. A benefit is gained when new sites are occupied, but capture attempts are costly. The problem of optimal occupation is formulated as a Markov decision process in which the admissible actions are occupation strategies and the cost is a function of the strategy and the number of occupied sites. A partial order on the state-action pairs is used to obtain a comparison result for stationary policies and qualitative results concerning monotonicity of the value function for the n-stage problem (n ≤ ∞). The optimal policies are partially characterized when the cost depends on the action only through the total number of occupation attempts made.


Sign in / Sign up

Export Citation Format

Share Document