Ride-Hailing Order Dispatching at DiDi via Reinforcement Learning

Zhiwei (Tony) Qin; Xiaocheng Tang; Yan Jiao; Fan Zhang; Zhe Xu; Hongtu Zhu; Jieping Ye

doi:10.1287/inte.2020.1047

Inverse reinforcement learning in contextual MDPs

Machine Learning ◽

10.1007/s10994-021-05984-x ◽

2021 ◽

Author(s):

Stav Belogolovsky ◽

Philip Korsunsky ◽

Shie Mannor ◽

Chen Tessler ◽

Tom Zahavy

Keyword(s):

Reinforcement Learning ◽

Optimization Problem ◽

Decision Processes ◽

Inverse Reinforcement Learning ◽

Convex Optimization Problem ◽

Reward Function ◽

Dynamic Treatment Regime ◽

Markov Decision ◽

Dynamic Treatment ◽

Recorded Data

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.

Download Full-text

Temporal concatenation for Markov decision processes

Probability in the Engineering and Informational Sciences ◽

10.1017/s0269964821000206 ◽

2021 ◽

pp. 1-28

Author(s):

Ruiyang Song ◽

Kuang Xu

Keyword(s):

Markov Decision Processes ◽

Large Scale ◽

Optimal Solution ◽

Upper Bounds ◽

Black Box ◽

Decision Processes ◽

Optimal Solutions ◽

Wide Range ◽

Markov Decision ◽

Speed Up

We propose and analyze a temporal concatenation heuristic for solving large-scale finite-horizon Markov decision processes (MDP), which divides the MDP into smaller sub-problems along the time horizon and generates an overall solution by simply concatenating the optimal solutions from these sub-problems. As a “black box” architecture, temporal concatenation works with a wide range of existing MDP algorithms. Our main results characterize the regret of temporal concatenation compared to the optimal solution. We provide upper bounds for general MDP instances, as well as a family of MDP instances in which the upper bounds are shown to be tight. Together, our results demonstrate temporal concatenation's potential of substantial speed-up at the expense of some performance degradation.

Download Full-text

Proactive Optimization of Intelligent-Well Production Using Stochastic Gradient-Based Algorithms

SPE Reservoir Evaluation & Engineering ◽

10.2118/178918-pa ◽

2016 ◽

Vol 19 (02) ◽

pp. 239-252 ◽

Cited By ~ 12

Author(s):

Morteza Haghighat Sefat ◽

Khafiz M. Muradov ◽

Ahmed H. Elsheikh ◽

David R. Davies

Keyword(s):

Large Scale ◽

Optimization Problem ◽

Stochastic Gradient ◽

Real Field ◽

Directional Derivatives ◽

Optimization Approach ◽

Layer By Layer ◽

Monitoring And Control ◽

Ensemble Size ◽

Local Optima

Summary The popularity of intelligent wells (I-wells), which provide layer-by-layer monitoring and control capability of production and injection, is growing. However, the number of available techniques for optimal control of I-wells is limited (Sarma et al. 2006; Alghareeb et al. 2009; Almeida et al. 2010; Grebenkin and Davies 2012). Currently, most of the I-wells that are equipped with interval control valves (ICVs) are operated to enhance the current production and to resolve problems associated with breakthrough of the unfavorable phase. This reactive strategy is unlikely to deliver the long-term optimum production. On the other side, the proactive-control strategy of I-wells, with its ambition to provide the optimum control for the entire well's production life, has the potential to maximize the cumulative oil production. This strategy, however, results in a high-dimensional, nonlinear, and constrained optimization problem. This study provides guidelines on selecting a suitable proactive optimization approach, by use of state-of-the-art stochastic gradient-approximation algorithms. A suitable optimization approach increases the practicality of proactive optimization for real field models under uncertain operational and subsurface conditions. We evaluate the simultaneous-perturbation stochastic approximation (SPSA) method (Spall 1992) and the ensemble-based optimization (EnOpt) method (Chen et al. 2009). In addition, we present a new derivation of the EnOpt by use of the concept of directional derivatives. The numerical results show that both SPSA and EnOpt methods can provide a fast solution to a large-scale and multiple I-well proactive optimization problem. A criterion for tuning the algorithms is proposed and the performance of both methods is compared for several test cases. The used methodology for estimating the gradient is shown to affect the application area of each algorithm. SPSA provides a rough estimate of the gradient and performs better in search environments, characterized by several local optima, especially with a large ensemble size. EnOpt was found to provide a smoother estimation of the gradient, resulting in a more-robust algorithm to the choice of the tuning parameters, and a better performance with a small ensemble size. Moreover, the final optimum operation obtained by EnOpt is smoother. Finally, the obtained criteria are used to perform proactive optimization of ICVs in a real field.

Download Full-text

On the Sensitivity of Local Flexibility Markets to Forecast Error: A Bi-Level Optimization Approach

Energies ◽

10.3390/en13081959 ◽

2020 ◽

Vol 13 (8) ◽

pp. 1959

Author(s):

Delaram Azari ◽

Shahab Shariat Torbaghan ◽

Hans Cappon ◽

Karel J. Keesman ◽

Madeleine Gibescu ◽

...

Keyword(s):

Distribution System ◽

Large Scale ◽

Optimization Problem ◽

Distribution Networks ◽

Second Order ◽

Optimization Approach ◽

Second Order Cone ◽

Local Flexibility ◽

Level Problem ◽

Relaxation Error

The large-scale integration of intermittent distributed energy resources has led to increased uncertainty in the planning and operation of distribution networks. The optimal flexibility dispatch is a recently introduced, power flow-based method that a distribution system operator can use to effectively determine the amount of flexibility it needs to procure from the controllable resources available on the demand side. However, the drawback of this method is that the optimal flexibility dispatch is inexact due to the relaxation error inherent in the second-order cone formulation. In this paper we propose a novel bi-level optimization problem, where the upper level problem seeks to minimize the relaxation error and the lower level solves the earlier introduced convex second-order cone optimal flexibility dispatch (SOC-OFD) problem. To make the problem tractable, we introduce an innovative reformulation to recast the bi-level problem as a non-linear, single level optimization problem which results in no loss of accuracy. We subsequently investigate the sensitivity of the optimal flexibility schedules and the locational flexibility prices with respect to uncertainty in load forecast and flexibility ranges of the demand response providers which are input parameters to the problem. The sensitivity analysis is performed based on the perturbed Karush–Kuhn–Tucker (KKT) conditions. We investigate the feasibility and scalability of the proposed method in three case studies of standardized 9-bus, 30-bus, and 300-bus test systems. Simulation results in terms of local flexibility prices are interpreted in economic terms and show the effectiveness of the proposed approach.

Download Full-text

A Multi-Step Reinforcement Learning Algorithm

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.44-47.3611 ◽

2010 ◽

Vol 44-47 ◽

pp. 3611-3615 ◽

Cited By ~ 1

Author(s):

Zhi Cong Zhang ◽

Kai Shun Hu ◽

Hui Yu Huang ◽

Shuai Li ◽

Shao Yong Zhao

Keyword(s):

Reinforcement Learning ◽

Markov Decision Process ◽

Decision Process ◽

Large Scale ◽

Learning Algorithm ◽

Machine Learning Method ◽

Learning Method ◽

K Value ◽

Markov Decision ◽

Action Value

Reinforcement learning (RL) is a state or action value based machine learning method which approximately solves large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP). A multi-step RL algorithm called Sarsa(,k) is proposed, which is a compromised variation of Sarsa and Sarsa(). It is equivalent to Sarsa if k is 1 and is equivalent to Sarsa() if k is infinite. Sarsa(,k) adjust its performance by setting k value. Two forms of Sarsa(,k), forward view Sarsa(,k) and backward view Sarsa(,k), are constructed and proved equivalent in off-line updating.

Download Full-text

Security-Aware Data Offloading and Resource Allocation For MEC Systems: A Deep Reinforcement Learning

10.36227/techrxiv.13635065 ◽

2021 ◽

Author(s):

Ibrahim Elgendy ◽

Ammar Muthanna ◽

Mohammad Hammoudeh ◽

Hadil Ahmed Shaiba ◽

Devrim Unal ◽

...

Keyword(s):

Resource Allocation ◽

Large Scale ◽

Optimal Solution ◽

Security Requirements ◽

Optimization Approach ◽

Shared Resources ◽

Data Offloading ◽

Daily Lives ◽

Industrial Iot ◽

Iot Devices

The Internet of Things (IoT) is permeating our daily lives where it can provide data collection tools and important measurement to inform our decisions. In addition, they are continually generating massive amounts of data and exchanging essential messages over networks for further analysis. The promise of low communication latency, security enhancement and the efficient utilization of bandwidth leads to the new shift change from Mobile Cloud Computing (MCC) towards Mobile Edge Computing (MEC). In this study, we propose an advanced deep reinforcement resource allocation and securityaware data offloading model that considers the computation and radio resources of industrial IoT devices to guarantee that shared resources between multiple users are utilized in an efficient way. This model is formulated as an optimization problem with the goal of decreasing the consumption of energy and computation delay. This type of problem is NP-hard, due to the curseof-dimensionality challenge, thus, a deep learning optimization approach is presented to find an optimal solution. Additionally, an AES-based cryptographic approach is implemented as a security layer to satisfy data security requirements. Experimental evaluation results show that the proposed model can reduce offloading overhead by up to 13.2% and 64.7% in comparison with full offloading and local execution while scaling well for large-scale devices.

Download Full-text

Risk-Sensitive Reinforcement Learning Applied to Control under Constraints

Journal of Artificial Intelligence Research ◽

10.1613/jair.1666 ◽

2005 ◽

Vol 24 ◽

pp. 81-108 ◽

Cited By ~ 65

Author(s):

P. Geibel ◽

F. Wysotzki

Keyword(s):

Reinforcement Learning ◽

Value Function ◽

Learning Algorithm ◽

Optimal Solution ◽

Feed Tank ◽

Model Free ◽

Constrained Problem ◽

Risk Sensitive ◽

Markov Decision ◽

The Value Function

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.

Download Full-text

Structural optimization considering the probabilistic system response

Theoretical and Applied Mechanics ◽

10.2298/tam0404361p ◽

2004 ◽

Vol 31 (3-4) ◽

pp. 361-394 ◽

Cited By ~ 11

Author(s):

M. Papadrakakis ◽

N.D. Lagaros ◽

V. Plevris

Keyword(s):

Structural Optimization ◽

Design Optimization ◽

Robust Design ◽

Large Scale ◽

Optimization Problem ◽

Structural Parameters ◽

Dominant Role ◽

Robust Design Optimization ◽

Design Parameters ◽

System Response

In engineering problems, the randomness and uncertainties are inherent and the scatter of structural parameters from their nominal ideal values is unavoidable. In Reliability Based Design Optimization (RBDO) and Robust Design Optimization (RDO) the uncertainties play a dominant role in the formulation of the structural optimization problem. In an RBDO problem additional non deterministic constraint functions are considered while an RDO formulation leads to designs with a state of robustness, so that their performance is the least sensitive to the variability of the uncertain variables. In the first part of this study a metamodel assisted RBDO methodology is examined for large scale structural systems. In the second part an RDO structural problem is considered. The task of robust design optimization of structures is formulated as a multi-criteria optimization problem, in which the design variables of the optimization problem, together with other design parameters such as the modulus of elasticity and the yield stress are considered as random variables with a mean value equal to their nominal value. .

Download Full-text

Edge Caching for D2D Enabled Hierarchical Wireless Networks with Deep Reinforcement Learning

Wireless Communications and Mobile Computing ◽

10.1155/2019/2561069 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 5

Author(s):

Wenkai Li ◽

Chenyang Wang ◽

Ding Li ◽

Bin Hu ◽

Xiaofei Wang ◽

...

Keyword(s):

Decision Process ◽

Large Scale ◽

Optimization Problem ◽

Base Stations ◽

Learning Framework ◽

Markov Decision ◽

Low Efficiency ◽

Replacement Problem ◽

User Device ◽

Edge Caching

Edge caching is a promising method to deal with the traffic explosion problem towards future network. In order to satisfy the demands of user requests, the contents can be proactively cached locally at the proximity to users (e.g., base stations or user device). Recently, some learning-based edge caching optimizations are discussed. However, most of the previous studies explore the influence of dynamic and constant expanding action and caching space, leading to unpracticality and low efficiency. In this paper, we study the edge caching optimization problem by utilizing the Double Deep Q-network (Double DQN) learning framework to maximize the hit rate of user requests. Firstly, we obtain the Device-to-Device (D2D) sharing model by considering both online and offline factors and then we formulate the optimization problem, which is proved as NP-hard. Then the edge caching replacement problem is derived by Markov decision process (MDP). Finally, an edge caching strategy based on Double DQN is proposed. The experimental results based on large-scale actual traces show the effectiveness of the proposed framework.

Download Full-text

Security-Aware Data Offloading and Resource Allocation For MEC Systems: A Deep Reinforcement Learning

10.36227/techrxiv.13635065.v1 ◽

2021 ◽

Author(s):

Ibrahim Elgendy ◽

Ammar Muthanna ◽

Mohammad Hammoudeh ◽

Hadil Ahmed Shaiba ◽

Devrim Unal ◽

...

Keyword(s):

Resource Allocation ◽

Large Scale ◽

Optimal Solution ◽

Security Requirements ◽

Optimization Approach ◽

Shared Resources ◽

Data Offloading ◽

Daily Lives ◽

Industrial Iot ◽

Iot Devices

The Internet of Things (IoT) is permeating our daily lives where it can provide data collection tools and important measurement to inform our decisions. In addition, they are continually generating massive amounts of data and exchanging essential messages over networks for further analysis. The promise of low communication latency, security enhancement and the efficient utilization of bandwidth leads to the new shift change from Mobile Cloud Computing (MCC) towards Mobile Edge Computing (MEC). In this study, we propose an advanced deep reinforcement resource allocation and securityaware data offloading model that considers the computation and radio resources of industrial IoT devices to guarantee that shared resources between multiple users are utilized in an efficient way. This model is formulated as an optimization problem with the goal of decreasing the consumption of energy and computation delay. This type of problem is NP-hard, due to the curseof-dimensionality challenge, thus, a deep learning optimization approach is presented to find an optimal solution. Additionally, an AES-based cryptographic approach is implemented as a security layer to satisfy data security requirements. Experimental evaluation results show that the proposed model can reduce offloading overhead by up to 13.2% and 64.7% in comparison with full offloading and local execution while scaling well for large-scale devices.

Download Full-text