scholarly journals Simulation Study on the Electricity Data Streams Time Series Clustering

Energies ◽  
2020 ◽  
Vol 13 (4) ◽  
pp. 924 ◽  
Author(s):  
Krzysztof Gajowniczek ◽  
Marcin Bator ◽  
Tomasz Ząbkowski ◽  
Arkadiusz Orłowski ◽  
Chu Kiong Loo

Currently, thanks to the rapid development of wireless sensor networks and network traffic monitoring, the data stream is gradually becoming one of the most popular data generating processes. The data stream is different from traditional static data. Cluster analysis is an important technology for data mining, which is why many researchers pay attention to grouping streaming data. In the literature, there are many data stream clustering techniques, unfortunately, very few of them try to solve the problem of clustering data streams coming from multiple sources. In this article, we present an algorithm with a tree structure for grouping data streams (in the form of a time series) that have similar properties and behaviors. We have evaluated our algorithm over real multivariate data streams generated by smart meter sensors—the Irish Commission for Energy Regulation data set. There were several measures used to analyze the various characteristics of a tree-like clustering structure (computer science perspective) and also measures that are important from a business standpoint. The proposed method was able to cluster the flows of data and has identified the customers with similar behavior during the analyzed period.

2020 ◽  
Vol 8 (4) ◽  
pp. 63-73
Author(s):  
Sikha Bagui ◽  
Katie Jin

This survey performs a thorough enumeration and analysis of existing methods for data stream processing. It is a survey of the challenges facing streaming data. The challenges addressed are preprocessing of streaming data, detection and dealing with concept drifts in streaming data, data reduction in the face of data streams, approximate queries and blocking operations in streaming data.


2021 ◽  
Author(s):  
Christian Nordahl ◽  
Veselka Boeva ◽  
Håkan Grahn ◽  
Marie Persson Netz

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5829 ◽  
Author(s):  
Jen-Wei Huang ◽  
Meng-Xun Zhong ◽  
Bijay Prasad Jaysawal

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.


2021 ◽  
Vol 15 (02) ◽  
pp. 33-41
Author(s):  
Wendy Osborn

In this paper, the problem of query processing in spatial data streams is explored, with a focus on the spatial join operation. Although the spatial join has been utilized in many proposed centralized and distributed query processing strategies, for its application to spatial data streams the spatial join operation has received very little attention. One identified limitation with existing strategies is that a bounded region of space (i.e., spatial extent) from which the spatial objects are generated needs to be known in advance. However, this information may not be available. Therefore, two strategies for spatial data stream join processing are proposed where the spatial extent of the spatial object stream is not required to be known in advance. Both strategies estimate the common region that is shared by two or more spatial data streams in order to process the spatial join. An evaluation of both strategies includes a comparison with a recently proposed approach in which the spatial extent of the data set is known. Experimental results show that one of the strategies performs very well at estimating the common region of space using only incoming objects on the spatial data streams. Other limitations of this work are also identified.


2021 ◽  
Author(s):  
◽  
Murugaraj Odiathevar

<p><b>Anomaly Detection is an important aspect of many application domains. It refers to the problem of finding patterns in data that do not conform to expected behaviour. Hence, understanding of expected behaviour well is fundamental to performing effective anomaly detection. However, data profiles constantly evolve in certain domains such as computer networks. In other domains such as traffic monitoring and healthcare, data are distributed and are either too large or there are privacy concerns in transmitting them to a central location. These situations pose a challenge to obtain an accurate understanding of non-anomalous profiles. Changing profiles undermine existing anomaly detection models and make them less effective. Training a robust model with data from multiple sources is also challenging. Moreover, in real world scenarios, it is not apparent how an anomaly detection model can be built to address the problem.</b></p> <p>This thesis focuses on the building of a robust anomaly detection system where data profiles evolve and/or are distributed. It proposes a novel Online Offline Framework to separate existing expected behaviour, new possible expected behaviour and anomalies in streaming data. It also addresses the distributed scenario using a theoretically sound fully Bayesian approach. These methods improve performances of anomaly detection systems and work well with biased and uneven data partitions.</p> <p>The proposed methods are validated using real world data in three different domains. This thesis identifies the implementation difficulties in these domains and produces three novel methodologies to address each of the core anomaly detection problems.</p>


Author(s):  
Taegong Kim ◽  
Cheong Hee Park

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.


2013 ◽  
Vol 10 (5) ◽  
pp. 1580-1586
Author(s):  
V.sidda Reddy ◽  
Dr T.V. Rao ◽  
Dr A. Govardhan

Data Stream Mining algorithms performs under constraints called space used and time taken, which is due to the streaming property. The relaxation in these constraints is inversely proportional to the streaming speed of the data. Since the caching and mining the streaming-data is sensitive, here in this paper a scalable, memory efficient caching and frequent itemset mining model is devised. The proposed model is an incremental approach that builds single level multi node trees called bushes from each window of the streaming data; henceforth we refer this proposed algorithm as a Tree (bush) based Incremental Frequent Itemset Mining (TIFIM) over data streams.


We have real-time data everywhere and every day. Most of the data comes from IoT sensors, data from GPS positions, web transactions and social media updates. Real time data is typically generated in a continuous fashion. Such real-time data are called Data streams. Data streams are transient and there is very little time to process each item in the stream. It is a great challenge to do analytics on rapidly flowing high velocity data. Another issue is the percentage of incoming data that is considered for analytics. Higher the percentage greater would be the accuracy. Considering these two issues, the proposed work is intended to find a better solution by gaining insight on real-time streaming data with minimum response time and greater accuracy. This paper combines the two technology giants TensorFlow and Apache Kafka. is used to handle the real-time streaming data since TensorFlow supports analytics support with deep learning algorithms. The Training and Testing is done on Uber connected vehicle public data set RideAustin. The experimental result of RideAustin shows the predicted failure under each type of vehicle parameter. The comparative analysis showed 16% improvement over the traditional Machine Learning algorithm.


Author(s):  
Mona Mohamed ◽  
Sahar Ghanem ◽  
Magdy Nagi

Privacy-preserving data publishing have been studied widely on static data. However, many recent applications generate data streams that are real-time, unbounded, rapidly changing, and distributed in nature. Recently, few work addressed k-anonymity and l-diversity for data streams. Their model implied that if the stream is distributed, it is collected at a central site for anonymization. In this paper, we propose a novel distributed model where distributed streams are first anonymized by distributed (collecting) sites before merging and releasing. Our approach extends Continuously Anonymizing STreaming data via adaptive cLustEring (CASTLE) [4], a cluster-based approach that provides both k-anonymity and l-diversity for centralized data streams. The main idea is for each site to construct its local clustering model and exchange this local view with other sites to globally construct approximately the same clustering view. The approach is heuristic in a sense that not every update to the local view is sent, instead triggering events are selected for exchanging cluster information. Extensive experiments on a real data set are performed to study the introduced Information Loss (IL) on different settings. First, the impact of the different parameters on IL are quantified. Then k-anonymity and l-diversity are compared in terms of messaging cost and IL. Finally, the effectiveness of the proposed distributed model is studied by comparing the introduced IL to the IL of the centralized model (as a lower bound) and to a distributed model with no communication (as an upper bound). The experimental results show that the main contributing factor to IL is the number of attributes in the quasi-identifier (50%-75%) and the number of sites contributed about 1% and this proves the scalability of the proposed approach. In addition, providing l-diversity is shown to introduce about 25% increase in IL when compared to k-anonymity. Moreover, 35% reduction in IL is achieved by messaging cost (in bytes) of about 0.3% of the data set size


Author(s):  
Joshua Plasse ◽  
Henrique Hoeltgebaum ◽  
Niall M. Adams

AbstractSequentially detecting multiple changepoints in a data stream is a challenging task. Difficulties relate to both computational and statistical aspects, and in the latter, specifying control parameters is a particular problem. Choosing control parameters typically relies on unrealistic assumptions, such as the distributions generating the data, and their parameters, being known. This is implausible in the streaming paradigm, where several changepoints will exist. Further, current literature is mostly concerned with streams of continuous-valued observations, and focuses on detecting a single changepoint. There is a dearth of literature dedicated to detecting multiple changepoints in transition matrices, which arise from a sequence of discrete states. This paper makes the following contributions: a complete framework is developed for adaptively and sequentially estimating a Markov transition matrix in the streaming data setting. A change detection method is then developed, using a novel moment matching technique, which can effectively monitor for multiple changepoints in a transition matrix. This adaptive detection and estimation procedure for transition matrices, referred to as ADEPT-M, is compared to several change detectors on synthetic data streams, and is implemented on two real-world data streams – one consisting of over nine million HTTP web requests, and the other being a well-studied electricity market data set.


Sign in / Sign up

Export Citation Format

Share Document