scholarly journals EMM-CLODS: An Effective Microcluster and Minimal Pruning CLustering-Based Technique for Detecting Outliers in Data Streams

Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-20
Author(s):  
Mohamed Jaward Bah ◽  
Hongzhi Wang ◽  
Li-Hui Zhao ◽  
Ji Zhang ◽  
Jie Xiao

Detecting outliers in data streams is a challenging problem since, in a data stream scenario, scanning the data multiple times is unfeasible, and the incoming streaming data keep evolving. Over the years, a common approach to outlier detection is using clustering-based methods, but these methods have inherent challenges and drawbacks. These include to effectively cluster sparse data points which has to do with the quality of clustering methods, dealing with continuous fast-incoming data streams, high memory and time consumption, and lack of high outlier detection accuracy. This paper aims at proposing an effective clustering-based approach to detect outliers in evolving data streams. We propose a new method called Effective Microcluster and Minimal pruning CLustering-based method for Outlier detection in Data Streams (EMM-CLODS). It is a clustering-based outlier detection approach that detects outliers in evolving data streams by first applying microclustering technique to cluster dense data points and effectively handle objects within a sliding window according to the relevance of their status to their respective neighbors or position. The analysis from our experimental studies on both synthetic and real-world datasets shows that the technique performs well with minimal memory and time consumption when compared to the other baseline algorithms, making it a very promising technique in dealing with outlier detection problems in data streams.

Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5829 ◽  
Author(s):  
Jen-Wei Huang ◽  
Meng-Xun Zhong ◽  
Bijay Prasad Jaysawal

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.


2021 ◽  
Author(s):  
Christian Nordahl ◽  
Veselka Boeva ◽  
Håkan Grahn ◽  
Marie Persson Netz

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.


2021 ◽  
Author(s):  
Saihua Cai ◽  
Jinfu Chen ◽  
Haibo Chen ◽  
Chi Zhang ◽  
Qian Li ◽  
...  

Abstract Existing association-based outlier detection approaches were proposed to seek for potential outliers from huge full set of uncertain data streams ($UDS$), but could not effectively process the small scale of $UDS$ that satisfies preset constraints; thus, they were time consuming. To solve this problem, this paper proposes a novel minimal rare pattern-based outlier detection approach, namely Constrained Minimal Rare Pattern-based Outlier Detection (CMRP-OD), to discover outliers from small sets of $UDS$ that satisfy the user-preset succinct or convertible monotonic constraints. First, two concepts of ‘maximal probability’ and ‘support cap’ are proposed to compress the scale of extensible patterns, and then the matrix is designed to store the information of each valid pattern to reduce the scanning times of $UDS$, thus decreasing the time consumption. Second, more factors that can influence the determination of outlier are considered in the design of deviation indices, thus increasing the detection accuracy. Extensive experiments show that compared with the state-of-the-art approaches, CMRP-OD approach has at least 10% improvement on detection accuracy, and its time cost is also almost reduced half.


2020 ◽  
Vol 11 (2) ◽  
pp. 19-36
Author(s):  
Umesh Kokate ◽  
Arviand V. Deshpande ◽  
Parikshit N. Mahalle

Evolution of data in the data stream environment generates patterns at different time instances. The cluster formation changes with respect to time because of the behaviour and members of clusters. Data stream clustering (DSC) allows us to investigate the changes of the group behaviour. These changes in the behaviour of the group members over time lead to formation of new clusters and may make old clusters extinct. Also, these extinct old clusters may recur over time. The problem is to identify and record these change patterns of evolving data streams. The knowledge obtained from these change patterns is then used for trends analysis over evolving data streams. In order to address this flexible clustering requirement, density-based clustering method is proposed to dynamically cluster evolving data streams. The decay factor identifies formation of new clusters and diminishing of older clusters on arrival of data points. This indicates trends in evolving data streams.


2015 ◽  
Vol 2015 ◽  
pp. 1-19 ◽  
Author(s):  
Junchang Xin ◽  
Zhiqiong Wang ◽  
Mei Bai ◽  
Guoren Wang

Reverse skyline queries have been used in many real-world applications such as business planning, market analysis, and environmental monitoring. In this paper, we investigated how to efficiently evaluate continuous reverse skyline queries over sliding windows. We first theoretically analyzed the inherent properties of reverse skyline on data streams and proposed a novel pruning technique to reduce the number of data points preserved for processing continuous reverse skyline queries. Then, an efficient approach, called Semidominance Based Reverse Skyline (SDRS), was proposed to process continuous reverse skyline queries. Moreover, an extension was also proposed to handlen-of-Nand(n1,n2)-of-Nreverse skyline queries. Our extensive experimental studies have demonstrated the efficiency as well as effectiveness of the proposed approach with various experimental settings.


Author(s):  
Taegong Kim ◽  
Cheong Hee Park

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.


Author(s):  
Nguyen Thanh Tam ◽  
Matthias Weidlich ◽  
Duong Chi Thang ◽  
Hongzhi Yin ◽  
Nguyen Quoc Viet Hung

Today's social platforms, such as Twitter and Facebook, continuously generate massive volumes of data. The resulting data streams exceed any reasonable limit for permanent storage, especially since data is often redundant, overlapping, sparse, and generally of low value. This calls for means to retain solely a small fraction of the data in an online manner. In this paper, we propose techniques to effectively decide which data to retain, such that the induced loss of information, the regret of neglecting certain data, is minimized. These techniques enable not only efficient processing of massive streaming data, but are also adaptive and address the dynamic nature of social media. Experiments on large-scale real-world datasets illustrate the feasibility of our approach in terms of both, runtime and information quality.


2021 ◽  
Author(s):  
Mebarka Allaoui ◽  
Mohammed Lamine Kherfi ◽  
Abdelhakim Cheriet ◽  
Abdelhamid Bouchachia

In this paper, we introduce a novel algorithm that unifies manifold embedding and clustering (UEC) which efficiently predicts clustering assignments of the high dimensional data points in a new embedding space. The algorithm is based on a bi-objective optimisation problem combining embedding and clustering loss functions. Such original formulation will allow to simultaneously preserve the original structure of the data in the embedding space and produce better clustering assignments. The experimental results using a number of real-world datasets show that UEC is competitive with the state-of-art clustering methods.


2018 ◽  
Vol 8 (8) ◽  
pp. 1248 ◽  
Author(s):  
Haiqing Yao ◽  
Xiuwen Fu ◽  
Yongsheng Yang ◽  
Octavian Postolache

Outlier detection has attracted a wide range of attention for its broad applications, such as fault diagnosis and intrusion detection, among which the outlier analysis in data streams with high uncertainty and infinity is more challenging. Recent major work of outlier detection has focused on principle research of the local outlier factor, and there are few studies on incremental updating strategies, which are vital to outlier detection in data streams. In this paper, a novel incremental local outlier detection approach is introduced to dynamically evaluate the local outlier in the data stream. An extended local neighborhood consisting of k nearest neighbors, reverse nearest neighbors and shared nearest neighbors is estimated for each data. The theoretical evidence of algorithm complexity for the insertion of new data and deletion of old data in the composite neighborhood shows that the amount of affected data in the incremental calculation is finite. Finally, experiments performed on both synthetic and real datasets verify its scalability and outlier detection accuracy. All results show that the proposed approach has comparable performance with state-of-the-art k nearest neighbor-based methods.


Sign in / Sign up

Export Citation Format

Share Document