EMM-CLODS: An Effective Microcluster and Minimal Pruning CLustering-Based Technique for Detecting Outliers in Data Streams

Complexity ◽

10.1155/2021/9178461 ◽

2021 ◽

Vol 2021 ◽

pp. 1-20

Author(s):

Mohamed Jaward Bah ◽

Hongzhi Wang ◽

Li-Hui Zhao ◽

Ji Zhang ◽

Jie Xiao

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Experimental Studies ◽

Streaming Data ◽

Detection Accuracy ◽

Clustering Methods ◽

Time Consumption ◽

Data Points ◽

Evolving Data ◽

Real World Datasets

Detecting outliers in data streams is a challenging problem since, in a data stream scenario, scanning the data multiple times is unfeasible, and the incoming streaming data keep evolving. Over the years, a common approach to outlier detection is using clustering-based methods, but these methods have inherent challenges and drawbacks. These include to effectively cluster sparse data points which has to do with the quality of clustering methods, dealing with continuous fast-incoming data streams, high memory and time consumption, and lack of high outlier detection accuracy. This paper aims at proposing an effective clustering-based approach to detect outliers in evolving data streams. We propose a new method called Effective Microcluster and Minimal pruning CLustering-based method for Outlier detection in Data Streams (EMM-CLODS). It is a clustering-based outlier detection approach that detects outliers in evolving data streams by first applying microclustering technique to cluster dense data points and effectively handle objects within a sliding window according to the relevance of their status to their respective neighbors or position. The analysis from our experimental studies on both synthetic and real-world datasets shows that the technique performs well with minimal memory and time consumption when compared to the other baseline algorithms, making it a very promising technique in dealing with outlier detection problems in data streams.

Download Full-text

TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data Streams

Sensors ◽

10.3390/s20205829 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5829 ◽

Cited By ~ 1

Author(s):

Jen-Wei Huang ◽

Meng-Xun Zhong ◽

Bijay Prasad Jaysawal

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

State Of The Art ◽

Streaming Data ◽

Current State ◽

Data Points ◽

Local Outlier ◽

Time Aware ◽

Over Time

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.

Download Full-text

Statistical hierarchical clustering algorithm for outlier detection in evolving data streams

Machine Learning ◽

10.1007/s10994-020-05905-4 ◽

2020 ◽

Author(s):

Dalibor Krleža ◽

Boris Vrdoljak ◽

Mario Brčić

Keyword(s):

Outlier Detection ◽

Hierarchical Clustering ◽

Data Streams ◽

Clustering Algorithm ◽

Hierarchical Clustering Algorithm ◽

Evolving Data

Download Full-text

EvolveCluster: an evolutionary clustering algorithm for streaming data

Evolving Systems ◽

10.1007/s12530-021-09408-y ◽

2021 ◽

Author(s):

Christian Nordahl ◽

Veselka Boeva ◽

Håkan Grahn ◽

Marie Persson Netz

Keyword(s):

Data Streams ◽

Data Stream ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Streaming Data ◽

Evolutionary Clustering ◽

Stream Clustering ◽

The Past ◽

Data Stream Clustering ◽

Evolving Data

AbstractData has become an integral part of our society in the past years, arriving faster and in larger quantities than before. Traditional clustering algorithms rely on the availability of entire datasets to model them correctly and efficiently. Such requirements are not possible in the data stream clustering scenario, where data arrives and needs to be analyzed continuously. This paper proposes a novel evolutionary clustering algorithm, entitled EvolveCluster, capable of modeling evolving data streams. We compare EvolveCluster against two other evolutionary clustering algorithms, PivotBiCluster and Split-Merge Evolutionary Clustering, by conducting experiments on three different datasets. Furthermore, we perform additional experiments on EvolveCluster to further evaluate its capabilities on clustering evolving data streams. Our results show that EvolveCluster manages to capture evolving data stream behaviors and adapts accordingly.

Download Full-text

Minimal Rare Pattern-Based Outlier Detection Approach For Uncertain Data Streams Under Monotonic Constraints

The Computer Journal ◽

10.1093/comjnl/bxab139 ◽

2021 ◽

Author(s):

Saihua Cai ◽

Jinfu Chen ◽

Haibo Chen ◽

Chi Zhang ◽

Qian Li ◽

...

Keyword(s):

Outlier Detection ◽

Data Streams ◽

State Of The Art ◽

Uncertain Data ◽

Small Scale ◽

Detection Accuracy ◽

Detection Approach ◽

The Matrix ◽

Uncertain Data Streams

Abstract Existing association-based outlier detection approaches were proposed to seek for potential outliers from huge full set of uncertain data streams ($UDS$), but could not effectively process the small scale of $UDS$ that satisfies preset constraints; thus, they were time consuming. To solve this problem, this paper proposes a novel minimal rare pattern-based outlier detection approach, namely Constrained Minimal Rare Pattern-based Outlier Detection (CMRP-OD), to discover outliers from small sets of $UDS$ that satisfy the user-preset succinct or convertible monotonic constraints. First, two concepts of ‘maximal probability’ and ‘support cap’ are proposed to compress the scale of extensible patterns, and then the matrix is designed to store the information of each valid pattern to reduce the scanning times of $UDS$, thus decreasing the time consumption. Second, more factors that can influence the determination of outlier are considered in the design of deviation indices, thus increasing the detection accuracy. Extensive experiments show that compared with the state-of-the-art approaches, CMRP-OD approach has at least 10% improvement on detection accuracy, and its time cost is also almost reduced half.

Download Full-text

Density-Based Clustering Method for Trends Analysis Using Evolving Data Stream

International Journal of Synthetic Emotions ◽

10.4018/ijse.2020070102 ◽

2020 ◽

Vol 11 (2) ◽

pp. 19-36

Author(s):

Umesh Kokate ◽

Arviand V. Deshpande ◽

Parikshit N. Mahalle

Keyword(s):

Data Streams ◽

Data Stream ◽

Cluster Formation ◽

Clustering Method ◽

Density Based Clustering ◽

Trends Analysis ◽

Data Points ◽

Data Stream Clustering ◽

Evolving Data ◽

Over Time

Evolution of data in the data stream environment generates patterns at different time instances. The cluster formation changes with respect to time because of the behaviour and members of clusters. Data stream clustering (DSC) allows us to investigate the changes of the group behaviour. These changes in the behaviour of the group members over time lead to formation of new clusters and may make old clusters extinct. Also, these extinct old clusters may recur over time. The problem is to identify and record these change patterns of evolving data streams. The knowledge obtained from these change patterns is then used for trends analysis over evolving data streams. In order to address this flexible clustering requirement, density-based clustering method is proposed to dynamically cluster evolving data streams. The decay factor identifies formation of new clusters and diminishing of older clusters on arrival of data points. This indicates trends in evolving data streams.

Download Full-text

Reverse Skyline Computation over Sliding Windows

Mathematical Problems in Engineering ◽

10.1155/2015/649271 ◽

2015 ◽

Vol 2015 ◽

pp. 1-19 ◽

Cited By ~ 1

Author(s):

Junchang Xin ◽

Zhiqiong Wang ◽

Mei Bai ◽

Guoren Wang

Keyword(s):

Real World ◽

Data Streams ◽

Experimental Studies ◽

Market Analysis ◽

Skyline Queries ◽

Sliding Windows ◽

Real World Applications ◽

Data Points ◽

Pruning Technique ◽

Reverse Skyline

Reverse skyline queries have been used in many real-world applications such as business planning, market analysis, and environmental monitoring. In this paper, we investigated how to efficiently evaluate continuous reverse skyline queries over sliding windows. We first theoretically analyzed the inherent properties of reverse skyline on data streams and proposed a novel pruning technique to reduce the number of data points preserved for processing continuous reverse skyline queries. Then, an efficient approach, called Semidominance Based Reverse Skyline (SDRS), was proposed to process continuous reverse skyline queries. Moreover, an extension was also proposed to handlen-of-Nand(n1,n2)-of-Nreverse skyline queries. Our extensive experimental studies have demonstrated the efficiency as well as effectiveness of the proposed approach with various experimental settings.

Download Full-text

Anomaly Pattern Detection in Streaming Data Based on the Transformation to Multiple Binary-Valued Data Streams

Journal of Artificial Intelligence and Soft Computing Research ◽

10.2478/jaiscr-2022-0002 ◽

2021 ◽

Vol 12 (1) ◽

pp. 19-27

Author(s):

Taegong Kim ◽

Cheong Hee Park

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

Detection Method ◽

Binary Classification ◽

Streaming Data ◽

Pattern Detection ◽

Detection Methods ◽

Anomaly Pattern ◽

Isolation Forest

Abstract Anomaly pattern detection in a data stream aims to detect a time point where outliers begin to occur abnormally. Recently, a method for anomaly pattern detection has been proposed based on binary classification for outliers and statistical tests in the data stream of binary labels of normal or an outlier. It showed that an anomaly pattern can be detected accurately even when outlier detection performance is relatively low. However, since the anomaly pattern detection method is based on the binary classification for outliers, most well-known outlier detection methods, with the output of real-valued outlier scores, can not be used directly. In this paper, we propose an anomaly pattern detection method in a data stream using the transformation to multiple binary-valued data streams from real-valued outlier scores. By using three outlier detection methods, Isolation Forest(IF), Autoencoder-based outlier detection, and Local outlier factor(LOF), the proposed anomaly pattern detection method is tested using artificial and real data sets. The experimental results show that anomaly pattern detection using Isolation Forest gives the best performance.

Download Full-text

Retaining Data from Streams of Social Platforms with Minimal Regret

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/397 ◽

2017 ◽

Author(s):

Nguyen Thanh Tam ◽

Matthias Weidlich ◽

Duong Chi Thang ◽

Hongzhi Yin ◽

Nguyen Quoc Viet Hung

Keyword(s):

Social Media ◽

Data Streams ◽

Large Scale ◽

Information Quality ◽

Streaming Data ◽

Dynamic Nature ◽

Permanent Storage ◽

Reasonable Limit ◽

Efficient Processing ◽

Real World Datasets

Today's social platforms, such as Twitter and Facebook, continuously generate massive volumes of data. The resulting data streams exceed any reasonable limit for permanent storage, especially since data is often redundant, overlapping, sparse, and generally of low value. This calls for means to retain solely a small fraction of the data in an online manner. In this paper, we propose techniques to effectively decide which data to retain, such that the induced loss of information, the regret of neglecting certain data, is minimized. These techniques enable not only efficient processing of massive streaming data, but are also adaptive and address the dynamic nature of social media. Experiments on large-scale real-world datasets illustrate the feasibility of our approach in terms of both, runtime and information quality.

Download Full-text

Unified Embedding and Clustering

10.36227/techrxiv.16926754 ◽

2021 ◽

Author(s):

Mebarka Allaoui ◽

Mohammed Lamine Kherfi ◽

Abdelhakim Cheriet ◽

Abdelhamid Bouchachia

Keyword(s):

Loss Functions ◽

High Dimensional ◽

Original Structure ◽

Original Formulation ◽

Clustering Methods ◽

Manifold Embedding ◽

Data Points ◽

Real World Datasets ◽

State Of Art ◽

Novel Algorithm

In this paper, we introduce a novel algorithm that unifies manifold embedding and clustering (UEC) which efficiently predicts clustering assignments of the high dimensional data points in a new embedding space. The algorithm is based on a bi-objective optimisation problem combining embedding and clustering loss functions. Such original formulation will allow to simultaneously preserve the original structure of the data in the embedding space and produce better clustering assignments. The experimental results using a number of real-world datasets show that UEC is competitive with the state-of-art clustering methods.

Download Full-text

An Incremental Local Outlier Detection Method in the Data Stream

Applied Sciences ◽

10.3390/app8081248 ◽

2018 ◽

Vol 8 (8) ◽

pp. 1248 ◽

Cited By ~ 4

Author(s):

Haiqing Yao ◽

Xiuwen Fu ◽

Yongsheng Yang ◽

Octavian Postolache

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

Nearest Neighbor ◽

Nearest Neighbors ◽

Detection Accuracy ◽

K Nearest Neighbor ◽

Major Work ◽

Wide Range ◽

Local Outlier

Outlier detection has attracted a wide range of attention for its broad applications, such as fault diagnosis and intrusion detection, among which the outlier analysis in data streams with high uncertainty and infinity is more challenging. Recent major work of outlier detection has focused on principle research of the local outlier factor, and there are few studies on incremental updating strategies, which are vital to outlier detection in data streams. In this paper, a novel incremental local outlier detection approach is introduced to dynamically evaluate the local outlier in the data stream. An extended local neighborhood consisting of k nearest neighbors, reverse nearest neighbors and shared nearest neighbors is estimated for each data. The theoretical evidence of algorithm complexity for the insertion of new data and deletion of old data in the composite neighborhood shows that the amount of affected data in the incremental calculation is finite. Finally, experiments performed on both synthetic and real datasets verify its scalability and outlier detection accuracy. All results show that the proposed approach has comparable performance with state-of-the-art k nearest neighbor-based methods.

Download Full-text