Ride-Hailing Order Dispatching at DiDi via Reinforcement Learning

2020 ◽  
Vol 50 (5) ◽  
pp. 272-286
Author(s):  
Zhiwei (Tony) Qin ◽  
Xiaocheng Tang ◽  
Yan Jiao ◽  
Fan Zhang ◽  
Zhe Xu ◽  
...  

Order dispatching is instrumental to the marketplace engine of a large-scale ride-hailing platform, such as the DiDi platform, which continuously matches passenger trip requests to drivers at a scale of tens of millions per day. Because of the dynamic and stochastic nature of supply and demand in this context, the ride-hailing order-dispatching problem is challenging to solve for an optimal solution. Added to the complexity are considerations of system response time, reliability, and multiple objectives. In this paper, we describe how our approach to this optimization problem has evolved from a combinatorial optimization approach to one that encompasses a semi-Markov decision-process model and deep reinforcement learning. We discuss the various practical considerations of our solution development and real-world impact to the business.

2021 ◽  
Author(s):  
Stav Belogolovsky ◽  
Philip Korsunsky ◽  
Shie Mannor ◽  
Chen Tessler ◽  
Tom Zahavy

AbstractWe consider the task of Inverse Reinforcement Learning in Contextual Markov Decision Processes (MDPs). In this setting, contexts, which define the reward and transition kernel, are sampled from a distribution. In addition, although the reward is a function of the context, it is not provided to the agent. Instead, the agent observes demonstrations from an optimal policy. The goal is to learn the reward mapping, such that the agent will act optimally even when encountering previously unseen contexts, also known as zero-shot transfer. We formulate this problem as a non-differential convex optimization problem and propose a novel algorithm to compute its subgradients. Based on this scheme, we analyze several methods both theoretically, where we compare the sample complexity and scalability, and empirically. Most importantly, we show both theoretically and empirically that our algorithms perform zero-shot transfer (generalize to new and unseen contexts). Specifically, we present empirical experiments in a dynamic treatment regime, where the goal is to learn a reward function which explains the behavior of expert physicians based on recorded data of them treating patients diagnosed with sepsis.


Author(s):  
Ruiyang Song ◽  
Kuang Xu

We propose and analyze a temporal concatenation heuristic for solving large-scale finite-horizon Markov decision processes (MDP), which divides the MDP into smaller sub-problems along the time horizon and generates an overall solution by simply concatenating the optimal solutions from these sub-problems. As a “black box” architecture, temporal concatenation works with a wide range of existing MDP algorithms. Our main results characterize the regret of temporal concatenation compared to the optimal solution. We provide upper bounds for general MDP instances, as well as a family of MDP instances in which the upper bounds are shown to be tight. Together, our results demonstrate temporal concatenation's potential of substantial speed-up at the expense of some performance degradation.


2016 ◽  
Vol 19 (02) ◽  
pp. 239-252 ◽  
Author(s):  
Morteza Haghighat Sefat ◽  
Khafiz M. Muradov ◽  
Ahmed H. Elsheikh ◽  
David R. Davies

Summary The popularity of intelligent wells (I-wells), which provide layer-by-layer monitoring and control capability of production and injection, is growing. However, the number of available techniques for optimal control of I-wells is limited (Sarma et al. 2006; Alghareeb et al. 2009; Almeida et al. 2010; Grebenkin and Davies 2012). Currently, most of the I-wells that are equipped with interval control valves (ICVs) are operated to enhance the current production and to resolve problems associated with breakthrough of the unfavorable phase. This reactive strategy is unlikely to deliver the long-term optimum production. On the other side, the proactive-control strategy of I-wells, with its ambition to provide the optimum control for the entire well's production life, has the potential to maximize the cumulative oil production. This strategy, however, results in a high-dimensional, nonlinear, and constrained optimization problem. This study provides guidelines on selecting a suitable proactive optimization approach, by use of state-of-the-art stochastic gradient-approximation algorithms. A suitable optimization approach increases the practicality of proactive optimization for real field models under uncertain operational and subsurface conditions. We evaluate the simultaneous-perturbation stochastic approximation (SPSA) method (Spall 1992) and the ensemble-based optimization (EnOpt) method (Chen et al. 2009). In addition, we present a new derivation of the EnOpt by use of the concept of directional derivatives. The numerical results show that both SPSA and EnOpt methods can provide a fast solution to a large-scale and multiple I-well proactive optimization problem. A criterion for tuning the algorithms is proposed and the performance of both methods is compared for several test cases. The used methodology for estimating the gradient is shown to affect the application area of each algorithm. SPSA provides a rough estimate of the gradient and performs better in search environments, characterized by several local optima, especially with a large ensemble size. EnOpt was found to provide a smoother estimation of the gradient, resulting in a more-robust algorithm to the choice of the tuning parameters, and a better performance with a small ensemble size. Moreover, the final optimum operation obtained by EnOpt is smoother. Finally, the obtained criteria are used to perform proactive optimization of ICVs in a real field.


Energies ◽  
2020 ◽  
Vol 13 (8) ◽  
pp. 1959
Author(s):  
Delaram Azari ◽  
Shahab Shariat Torbaghan ◽  
Hans Cappon ◽  
Karel J. Keesman ◽  
Madeleine Gibescu ◽  
...  

The large-scale integration of intermittent distributed energy resources has led to increased uncertainty in the planning and operation of distribution networks. The optimal flexibility dispatch is a recently introduced, power flow-based method that a distribution system operator can use to effectively determine the amount of flexibility it needs to procure from the controllable resources available on the demand side. However, the drawback of this method is that the optimal flexibility dispatch is inexact due to the relaxation error inherent in the second-order cone formulation. In this paper we propose a novel bi-level optimization problem, where the upper level problem seeks to minimize the relaxation error and the lower level solves the earlier introduced convex second-order cone optimal flexibility dispatch (SOC-OFD) problem. To make the problem tractable, we introduce an innovative reformulation to recast the bi-level problem as a non-linear, single level optimization problem which results in no loss of accuracy. We subsequently investigate the sensitivity of the optimal flexibility schedules and the locational flexibility prices with respect to uncertainty in load forecast and flexibility ranges of the demand response providers which are input parameters to the problem. The sensitivity analysis is performed based on the perturbed Karush–Kuhn–Tucker (KKT) conditions. We investigate the feasibility and scalability of the proposed method in three case studies of standardized 9-bus, 30-bus, and 300-bus test systems. Simulation results in terms of local flexibility prices are interpreted in economic terms and show the effectiveness of the proposed approach.


2010 ◽  
Vol 44-47 ◽  
pp. 3611-3615 ◽  
Author(s):  
Zhi Cong Zhang ◽  
Kai Shun Hu ◽  
Hui Yu Huang ◽  
Shuai Li ◽  
Shao Yong Zhao

Reinforcement learning (RL) is a state or action value based machine learning method which approximately solves large-scale Markov Decision Process (MDP) or Semi-Markov Decision Process (SMDP). A multi-step RL algorithm called Sarsa(,k) is proposed, which is a compromised variation of Sarsa and Sarsa(). It is equivalent to Sarsa if k is 1 and is equivalent to Sarsa() if k is infinite. Sarsa(,k) adjust its performance by setting k value. Two forms of Sarsa(,k), forward view Sarsa(,k) and backward view Sarsa(,k), are constructed and proved equivalent in off-line updating.


2021 ◽  
Author(s):  
Ibrahim Elgendy ◽  
Ammar Muthanna ◽  
Mohammad Hammoudeh ◽  
Hadil Ahmed Shaiba ◽  
Devrim Unal ◽  
...  

The Internet of Things (IoT) is permeating our daily lives where it can provide data collection tools and important measurement to inform our decisions. In addition, they are continually generating massive amounts of data and exchanging essential messages over networks for further analysis. The promise of low communication latency, security enhancement and the efficient utilization of bandwidth leads to the new shift change from Mobile Cloud Computing (MCC) towards Mobile Edge Computing (MEC). In this study, we propose an advanced deep reinforcement resource allocation and securityaware data offloading model that considers the computation and radio resources of industrial IoT devices to guarantee that shared resources between multiple users are utilized in an efficient way. This model is formulated as an optimization problem with the goal of decreasing the consumption of energy and computation delay. This type of problem is NP-hard, due to the curseof-dimensionality challenge, thus, a deep learning optimization approach is presented to find an optimal solution. Additionally, an AES-based cryptographic approach is implemented as a security layer to satisfy data security requirements. Experimental evaluation results show that the proposed model can reduce offloading overhead by up to 13.2% and 64.7% in comparison with full offloading and local execution while scaling well for large-scale devices.


2005 ◽  
Vol 24 ◽  
pp. 81-108 ◽  
Author(s):  
P. Geibel ◽  
F. Wysotzki

In this paper, we consider Markov Decision Processes (MDPs) with error states. Error states are those states entering which is undesirable or dangerous. We define the risk with respect to a policy as the probability of entering such a state when the policy is pursued. We consider the problem of finding good policies whose risk is smaller than some user-specified threshold, and formalize it as a constrained MDP with two criteria. The first criterion corresponds to the value function originally given. We will show that the risk can be formulated as a second criterion function based on a cumulative return, whose definition is independent of the original value function. We present a model free, heuristic reinforcement learning algorithm that aims at finding good deterministic policies. It is based on weighting the original value function and the risk. The weight parameter is adapted in order to find a feasible solution for the constrained problem that has a good performance with respect to the value function. The algorithm was successfully applied to the control of a feed tank with stochastic inflows that lies upstream of a distillation column. This control task was originally formulated as an optimal control problem with chance constraints, and it was solved under certain assumptions on the model to obtain an optimal solution. The power of our learning algorithm is that it can be used even when some of these restrictive assumptions are relaxed.


2004 ◽  
Vol 31 (3-4) ◽  
pp. 361-394 ◽  
Author(s):  
M. Papadrakakis ◽  
N.D. Lagaros ◽  
V. Plevris

In engineering problems, the randomness and uncertainties are inherent and the scatter of structural parameters from their nominal ideal values is unavoidable. In Reliability Based Design Optimization (RBDO) and Robust Design Optimization (RDO) the uncertainties play a dominant role in the formulation of the structural optimization problem. In an RBDO problem additional non deterministic constraint functions are considered while an RDO formulation leads to designs with a state of robustness, so that their performance is the least sensitive to the variability of the uncertain variables. In the first part of this study a metamodel assisted RBDO methodology is examined for large scale structural systems. In the second part an RDO structural problem is considered. The task of robust design optimization of structures is formulated as a multi-criteria optimization problem, in which the design variables of the optimization problem, together with other design parameters such as the modulus of elasticity and the yield stress are considered as random variables with a mean value equal to their nominal value. .


2019 ◽  
Vol 2019 ◽  
pp. 1-12 ◽  
Author(s):  
Wenkai Li ◽  
Chenyang Wang ◽  
Ding Li ◽  
Bin Hu ◽  
Xiaofei Wang ◽  
...  

Edge caching is a promising method to deal with the traffic explosion problem towards future network. In order to satisfy the demands of user requests, the contents can be proactively cached locally at the proximity to users (e.g., base stations or user device). Recently, some learning-based edge caching optimizations are discussed. However, most of the previous studies explore the influence of dynamic and constant expanding action and caching space, leading to unpracticality and low efficiency. In this paper, we study the edge caching optimization problem by utilizing the Double Deep Q-network (Double DQN) learning framework to maximize the hit rate of user requests. Firstly, we obtain the Device-to-Device (D2D) sharing model by considering both online and offline factors and then we formulate the optimization problem, which is proved as NP-hard. Then the edge caching replacement problem is derived by Markov decision process (MDP). Finally, an edge caching strategy based on Double DQN is proposed. The experimental results based on large-scale actual traces show the effectiveness of the proposed framework.


2021 ◽  
Author(s):  
Ibrahim Elgendy ◽  
Ammar Muthanna ◽  
Mohammad Hammoudeh ◽  
Hadil Ahmed Shaiba ◽  
Devrim Unal ◽  
...  

The Internet of Things (IoT) is permeating our daily lives where it can provide data collection tools and important measurement to inform our decisions. In addition, they are continually generating massive amounts of data and exchanging essential messages over networks for further analysis. The promise of low communication latency, security enhancement and the efficient utilization of bandwidth leads to the new shift change from Mobile Cloud Computing (MCC) towards Mobile Edge Computing (MEC). In this study, we propose an advanced deep reinforcement resource allocation and securityaware data offloading model that considers the computation and radio resources of industrial IoT devices to guarantee that shared resources between multiple users are utilized in an efficient way. This model is formulated as an optimization problem with the goal of decreasing the consumption of energy and computation delay. This type of problem is NP-hard, due to the curseof-dimensionality challenge, thus, a deep learning optimization approach is presented to find an optimal solution. Additionally, an AES-based cryptographic approach is implemented as a security layer to satisfy data security requirements. Experimental evaluation results show that the proposed model can reduce offloading overhead by up to 13.2% and 64.7% in comparison with full offloading and local execution while scaling well for large-scale devices.


Sign in / Sign up

Export Citation Format

Share Document