Simulator Pre-Screening of Underprepared Drivers Prior to Licensing On-Road Examination: Clustering of Virtual Driving Test Time Series Data (Preprint)

Mapping Intimacies ◽

10.2196/preprints.13995 ◽

2019 ◽

Author(s):

David Grethlein ◽

Flaura Koplin Winston ◽

Elizabeth Walshe ◽

Sean Tanner ◽

Venk Kandadai ◽

...

Keyword(s):

Time Series ◽

Risk Ratio ◽

Standard Method ◽

Domain Knowledge ◽

Time Series Data ◽

Series Data ◽

Test Time ◽

False Alarms ◽

Clustering Method ◽

Time Series Clustering

BACKGROUND A large Midwestern state commissioned a virtual driving test (VDT) to assess driving skills preparedness before the on-road examination (ORE). Since July 2017, a pilot deployment of the VDT in state licensing centers (VDT pilot) has collected both VDT and ORE data from new license applicants with the aim of creating a scoring algorithm that could predict those who were underprepared. OBJECTIVE Leveraging data collected from the VDT pilot, this study aimed to develop and conduct an initial evaluation of a novel machine learning (ML)–based classifier using limited domain knowledge and minimal feature engineering to reliably predict applicant pass/fail on the ORE. Such methods, if proven useful, could be applicable to the classification of other time series data collected within medical and other settings. METHODS We analyzed an initial dataset that comprised 4308 drivers who completed both the VDT and the ORE, in which 1096 (25.4%) drivers went on to fail the ORE. We studied 2 different approaches to constructing feature sets to use as input to ML algorithms: the standard method of reducing the time series data to a set of manually defined variables that summarize driving behavior and a novel approach using time series clustering. We then fed these representations into different ML algorithms to compare their ability to predict a driver’s ORE outcome (pass/fail). RESULTS The new method using time series clustering performed similarly compared with the standard method in terms of overall accuracy for predicting pass or fail outcome (76.1% vs 76.2%) and area under the curve (0.656 vs 0.682). However, the time series clustering slightly outperformed the standard method in differentially predicting failure on the ORE. The novel clustering method yielded a risk ratio for failure of 3.07 (95% CI 2.75-3.43), whereas the standard variables method yielded a risk ratio for failure of 2.68 (95% CI 2.41-2.99). In addition, the time series clustering method with logistic regression produced the lowest ratio of false alarms (those who were predicted to fail but went on to pass the ORE; 27.2%). CONCLUSIONS Our results provide initial evidence that the clustering method is useful for feature construction in classification tasks involving time series data when resources are limited to create multiple, domain-relevant variables.

Download Full-text

Simulator Pre-Screening of Underprepared Drivers Prior to Licensing On-Road Examination: Clustering of Virtual Driving Test Time Series Data

Journal of Medical Internet Research ◽

10.2196/13995 ◽

2020 ◽

Vol 22 (6) ◽

pp. e13995

Author(s):

David Grethlein ◽

Flaura Koplin Winston ◽

Elizabeth Walshe ◽

Sean Tanner ◽

Venk Kandadai ◽

...

Keyword(s):

Time Series ◽

Risk Ratio ◽

Standard Method ◽

Domain Knowledge ◽

Time Series Data ◽

Series Data ◽

Test Time ◽

False Alarms ◽

Clustering Method ◽

Time Series Clustering

Background A large Midwestern state commissioned a virtual driving test (VDT) to assess driving skills preparedness before the on-road examination (ORE). Since July 2017, a pilot deployment of the VDT in state licensing centers (VDT pilot) has collected both VDT and ORE data from new license applicants with the aim of creating a scoring algorithm that could predict those who were underprepared. Objective Leveraging data collected from the VDT pilot, this study aimed to develop and conduct an initial evaluation of a novel machine learning (ML)–based classifier using limited domain knowledge and minimal feature engineering to reliably predict applicant pass/fail on the ORE. Such methods, if proven useful, could be applicable to the classification of other time series data collected within medical and other settings. Methods We analyzed an initial dataset that comprised 4308 drivers who completed both the VDT and the ORE, in which 1096 (25.4%) drivers went on to fail the ORE. We studied 2 different approaches to constructing feature sets to use as input to ML algorithms: the standard method of reducing the time series data to a set of manually defined variables that summarize driving behavior and a novel approach using time series clustering. We then fed these representations into different ML algorithms to compare their ability to predict a driver’s ORE outcome (pass/fail). Results The new method using time series clustering performed similarly compared with the standard method in terms of overall accuracy for predicting pass or fail outcome (76.1% vs 76.2%) and area under the curve (0.656 vs 0.682). However, the time series clustering slightly outperformed the standard method in differentially predicting failure on the ORE. The novel clustering method yielded a risk ratio for failure of 3.07 (95% CI 2.75-3.43), whereas the standard variables method yielded a risk ratio for failure of 2.68 (95% CI 2.41-2.99). In addition, the time series clustering method with logistic regression produced the lowest ratio of false alarms (those who were predicted to fail but went on to pass the ORE; 27.2%). Conclusions Our results provide initial evidence that the clustering method is useful for feature construction in classification tasks involving time series data when resources are limited to create multiple, domain-relevant variables.

Download Full-text

Distance variable improvement of time-series big data stream evaluation

Journal Of Big Data ◽

10.1186/s40537-020-00359-w ◽

2020 ◽

Vol 7 (1) ◽

Author(s):

Ari Wibisono ◽

Petrus Mursanto ◽

Jihan Adibah ◽

Wendy D. W. T. Bayu ◽

May Iffah Rizki ◽

...

Keyword(s):

Time Series ◽

Standard Deviation ◽

Standard Method ◽

Time Series Data ◽

Series Data ◽

Percentage Error ◽

Traffic Demand ◽

Incremental Model ◽

Chernoff Bound ◽

The Mean

Abstract Real-time information mining of a big dataset consisting of time series data is a very challenging task. For this purpose, we propose using the mean distance and the standard deviation to enhance the accuracy of the existing fast incremental model tree with the drift detection (FIMT-DD) algorithm. The standard FIMT-DD algorithm uses the Hoeffding bound as its splitting criterion. We propose the further use of the mean distance and standard deviation, which are used to split a tree more accurately than the standard method. We verify our proposed method using the large Traffic Demand Dataset, which consists of 4,000,000 instances; Tennet’s big wind power plant dataset, which consists of 435,268 instances; and a road weather dataset, which consists of 30,000,000 instances. The results show that our proposed FIMT-DD algorithm improves the accuracy compared to the standard method and Chernoff bound approach. The measured errors demonstrate that our approach results in a lower Mean Absolute Percentage Error (MAPE) in every stage of learning by approximately 2.49% compared with the Chernoff Bound method and 19.65% compared with the standard method.

Download Full-text

A Novel Heuristic Approach for the Simultaneous Selection of the Optimal Clustering Method and Its Internal Parameters for Time Series Data

Advances in Intelligent Systems and Computing - 14th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2019) ◽

10.1007/978-3-030-20055-8_17 ◽

2019 ◽

pp. 179-189 ◽

Cited By ~ 1

Author(s):

Adriana Navajas-Guerrero ◽

Diana Manjarres ◽

Eva Portillo ◽

Itziar Landa-Torres

Keyword(s):

Time Series ◽

Time Series Data ◽

Heuristic Approach ◽

Series Data ◽

Clustering Method ◽

Internal Parameters ◽

Simultaneous Selection ◽

Selection Of

Download Full-text

A Review of Subsequence Time Series Clustering

The Scientific World JOURNAL ◽

10.1155/2014/312521 ◽

2014 ◽

Vol 2014 ◽

pp. 1-19 ◽

Cited By ~ 20

Author(s):

Seyedjamal Zolhavarieh ◽

Saeed Aghabozorgi ◽

Ying Wah Teh

Keyword(s):

Time Series ◽

Pattern Recognition ◽

Time Series Data ◽

State Of The Art ◽

Dna Recognition ◽

Series Data ◽

Time Series Clustering ◽

Future Studies ◽

Literature Reviews ◽

Open Issue

Clustering of subsequence time series remains an open issue in time series clustering. Subsequence time series clustering is used in different fields, such as e-commerce, outlier detection, speech recognition, biological systems, DNA recognition, and text mining. One of the useful fields in the domain of subsequence time series clustering is pattern recognition. To improve this field, a sequence of time series data is used. This paper reviews some definitions and backgrounds related to subsequence time series clustering. The categorization of the literature reviews is divided into three groups: preproof, interproof, and postproof period. Moreover, various state-of-the-art approaches in performing subsequence time series clustering are discussed under each of the following categories. The strengths and weaknesses of the employed methods are evaluated as potential issues for future studies.

Download Full-text

A novel clustering method on time series data

Expert Systems with Applications ◽

10.1016/j.eswa.2011.03.081 ◽

2011 ◽

Vol 38 (9) ◽

pp. 11891-11900 ◽

Cited By ~ 54

Author(s):

Xiaohang Zhang ◽

Jiaqi Liu ◽

Yu Du ◽

Tingjie Lv

Keyword(s):

Time Series ◽

Time Series Data ◽

Series Data ◽

Clustering Method

Download Full-text

Clustering Methodology for Time Series Mining

Scientific Journal of Riga Technical University Computer Sciences ◽

10.2478/v10143-010-0011-0 ◽

2009 ◽

Vol 40 (1) ◽

pp. 81-86

Author(s):

Pēteris Grabusts ◽

Arkady Borisov

Keyword(s):

Time Series ◽

Time Series Analysis ◽

Clustering Algorithm ◽

Time Series Data ◽

Similarity Measures ◽

Longest Common Subsequence ◽

Series Data ◽

Time Series Clustering ◽

Series Analysis ◽

Time Series Mining

Clustering Methodology for Time Series MiningA time series is a sequence of real data, representing the measurements of a real variable at time intervals. Time series analysis is a sufficiently well-known task; however, in recent years research has been carried out with the purpose to try to use clustering for the intentions of time series analysis. The main motivation for representing a time series in the form of clusters is to better represent the main characteristics of the data. The central goal of the present research paper was to investigate clustering methodology for time series data mining, to explore the facilities of time series similarity measures and to use them in the analysis of time series clustering results. More complicated similarity measures include Longest Common Subsequence method (LCSS). In this paper, two tasks have been completed. The first task was to define time series similarity measures. It has been established that LCSS method gives better results in the detection of time series similarity than the Euclidean distance. The second task was to explore the facilities of the classical k-means clustering algorithm in time series clustering. As a result of the experiment a conclusion has been drawn that the results of time series clustering with the help of k-means algorithm correspond to the results obtained with LCSS method, thus the clustering results of the specific time series are adequate.

Download Full-text

Detecting Variability in Massive Astronomical Time-series Data. III. Variable Candidates in the SuperWASP DR1 Found by Multiple Clustering Algorithms and a Consensus Clustering Method

The Astronomical Journal ◽

10.3847/1538-3881/aae263 ◽

2018 ◽

Vol 156 (5) ◽

pp. 201

Author(s):

Min-Su Shin ◽

Seo-Won Chang ◽

Hahn Yi ◽

Dae-Won Kim ◽

Myung-Jin Kim ◽

...

Keyword(s):

Time Series ◽

Time Series Data ◽

Clustering Algorithms ◽

Series Data ◽

Consensus Clustering ◽

Clustering Method ◽

Astronomical Time ◽

Astronomical Time Series

Download Full-text

Effective resolving of false-alarms in subsequence matching of time-series data: a method and its performance results

IEEE International Conference on Systems, Man and Cybernetics ◽

10.1109/icsmc.2002.1167959 ◽

2005 ◽

Author(s):

Sang-Wook Kim ◽

Se-Bong Oh

Keyword(s):

Time Series ◽

Time Series Data ◽

Series Data ◽

False Alarms ◽

Subsequence Matching ◽

Performance Results

Download Full-text

VAR Model Based Clustering Method for Multivariate Time Series Data

Journal of Mathematical Sciences ◽

10.1007/s10958-019-04201-4 ◽

2019 ◽

Vol 237 (6) ◽

pp. 754-765 ◽

Cited By ~ 1

Author(s):

S. Deb

Keyword(s):

Time Series ◽

Time Series Data ◽

Multivariate Time Series ◽

Series Data ◽

Var Model ◽

Clustering Method ◽

Model Based Clustering ◽

Model Based

Download Full-text

A new methodology for customer behavior analysis using time series clustering

Kybernetes ◽

10.1108/k-09-2018-0506 ◽

2019 ◽

Vol ahead-of-print (ahead-of-print) ◽

Cited By ~ 2

Author(s):

Hossein Abbasimehr ◽

Mostafa Shabani

Keyword(s):

Time Series ◽

Dynamic Behavior ◽

Time Series Data ◽

Customer Segmentation ◽

Customer Behavior ◽

Customer Relationship ◽

Series Data ◽

Time Series Clustering ◽

Content Type ◽

Over Time

Purpose The purpose of this paper is to propose a new methodology that handles the issue of the dynamic behavior of customers over time. Design/methodology/approach A new methodology is presented based on time series clustering to extract dominant behavioral patterns of customers over time. This methodology is implemented using bank customers’ transactions data which are in the form of time series data. The data include the recency (R), frequency (F) and monetary (M) attributes of businesses that are using the point-of-sale (POS) data of a bank. This data were obtained from the data analysis department of the bank. Findings After carrying out an empirical study on the acquired transaction data of 2,531 business customers that are using POS devices of the bank, the dominant trends of behavior are discovered using the proposed methodology. The obtained trends were analyzed from the marketing viewpoint. Based on the analysis of the monetary attribute, customers were divided into four main segments, including high-value growing customers, middle-value growing customers, prone to churn and churners. For each resulted group of customers with a distinctive trend, effective and practical marketing recommendations were devised to improve the bank relationship with that group. The prone-to-churn segment contains most of the customers; therefore, the bank should conduct interesting promotions to retain this segment. Practical implications The discovered trends of customer behavior and proposed marketing recommendations can be helpful for banks in devising segment-specific marketing strategies as they illustrate the dynamic behavior of customers over time. The obtained trends are visualized so that they can be easily interpreted and used by banks. This paper contributes to the literature on customer relationship management (CRM) as the proposed methodology can be effectively applied to different businesses to reveal trends in customer behavior. Originality/value In the current business condition, customer behavior is changing continually over time and customers are churning due to the reduced switching costs. Therefore, choosing an effective customer segmentation methodology which can consider the dynamic behaviors of customers is essential for every business. This paper proposes a new methodology to capture customer dynamic behavior using time series clustering on time-ordered data. This is an improvement over previous studies, in which static segmentation approaches have often been adopted. To the best of the authors’ knowledge, this is the first study that combines the recency, frequency, and monetary model and time series clustering to reveal trends in customer behavior.

Download Full-text