scholarly journals Designing a Streaming Algorithm for Outlier Detection in Data Mining—An Incrementa Approach

Sensors ◽  
2020 ◽  
Vol 20 (5) ◽  
pp. 1261 ◽  
Author(s):  
Kangqing Yu ◽  
Wei Shi ◽  
Nicola Santoro

To design an algorithm for detecting outliers over streaming data has become an important task in many common applications, arising in areas such as fraud detections, network analysis, environment monitoring and so forth. Due to the fact that real-time data may arrive in the form of streams rather than batches, properties such as concept drift, temporal context, transiency, and uncertainty need to be considered. In addition, data processing needs to be incremental with limited memory resource, and scalable. These facts create big challenges for existing outlier detection algorithms in terms of their accuracies when they are implemented in an incremental fashion, especially in the streaming environment. To address these problems, we first propose C_KDE_WR, which uses sliding window and kernel function to process the streaming data online, and reports its results demonstrating high throughput on handling real-time streaming data, implemented in a CUDA framework on Graphics Processing Unit (GPU). We also present another algorithm, C_LOF, based on a very popular and effective outlier detection algorithm called Local Outlier Factor (LOF) which unfortunately works only on batched data. Using a novel incremental approach that compensates the drawback of high complexity in LOF, we show how to implement it in a streaming context and to obtain results in a timely manner. Like C_KDE_WR, C_LOF also employs sliding-window and statistical-summary to help making decision based on the data in the current window. It also addresses all those challenges of streaming data as addressed in C_KDE_WR. In addition, we report the comparative evaluation on the accuracy of C_KDE_WR with the state-of-the-art SOD_GPU using Precision, Recall and F-score metrics. Furthermore, a t-test is also performed to demonstrate the significance of the improvement. We further report the testing results of C_LOF on different parameter settings and drew ROC and PR curve with their area under the curve (AUC) and Average Precision (AP) values calculated respectively. Experimental results show that C_LOF can overcome the masquerading problem, which often exists in outlier detection on streaming data. We provide complexity analysis and report experiment results on the accuracy of both C_KDE_WR and C_LOF algorithms in order to evaluate their effectiveness as well as their efficiencies.

Author(s):  
Prasanna Lakshmi Kompalli

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.


Author(s):  
Suresh P. ◽  
Keerthika P. ◽  
Sathiyamoorthi V. ◽  
Logeswaran K. ◽  
Manjula Devi R. ◽  
...  

Cloud computing and big data analytics are the key parts of smart city development that can create reliable, secure, healthier, more informed communities while producing tremendous data to the public and private sectors. Since the various sectors of smart cities generate enormous amounts of streaming data from sensors and other devices, storing and analyzing this huge real-time data typically entail significant computing capacity. Most smart city solutions use a combination of core technologies such as computing, storage, databases, data warehouses, and advanced technologies such as analytics on big data, real-time streaming data, artificial intelligence, machine learning, and the internet of things (IoT). This chapter presents a theoretical and experimental perspective on the smart city services such as smart healthcare, water management, education, transportation and traffic management, and smart grid that are offered using big data management and cloud-based analytics services.


2016 ◽  
Vol 7 (3) ◽  
pp. 38-55
Author(s):  
Srinivasa K.G. ◽  
Ganesh Hegde ◽  
Kushagra Mishra ◽  
Mohammad Nabeel Siddiqui ◽  
Abhishek Kumar ◽  
...  

With the advancement of portable devices and sensors, there has been a need to build a universal framework, which can serve as a nodal point to aggregate data from different kinds of devices and sensors. We propose a unified framework that will provide a robust set of guidelines for sensors with varied degree of complexities connected to common set of System-on-Chip (SoC). These will help to monitor, control and visualize real time data coming from different type of sensors connected to these SoCs. We have defined a set of APIs, which will help the sensors to register with the server. These APIs will be the standard to which the sensors will comply while streaming data when connected to the client platforms.


2019 ◽  
Vol 15 (12) ◽  
pp. 155014771989454
Author(s):  
Hao Luo ◽  
Kexin Sun ◽  
Junlu Wang ◽  
Chengfeng Liu ◽  
Linlin Ding ◽  
...  

With the development of streaming data processing technology, real-time event monitoring and querying has become a hot issue in this field. In this article, an investigation based on coal mine disaster events is carried out, and a new anti-aliasing model for abnormal events is proposed, as well as a multistage identification method. Coal mine micro-seismic signal is of great importance in the investigation of vibration characteristic, attenuation law, and disaster assessment of coal mine disasters. However, as affected by factors like geological structure and energy losses, the micro-seismic signals of the same kind of disasters may produce data drift in the time domain transmission, such as weak or enhanced signals, which affects the accuracy of the identification of abnormal events (“the coal mine disaster events”). The current mine disaster event monitoring method is a lagged identification, which is based on monitoring a series of sensors with a 10-s-long data waveform as the monitoring unit. The identification method proposed in this article first takes advantages of the dynamic time warping algorithm, which is widely applied in the field of audio recognition, to build an anti-aliasing model and identifies whether the perceived data are disaster signal based on the similarity fitting between them and the template waveform of historical disaster data, and second, since the real-time monitoring data are continuous streaming data, it is necessary to identify the start point of the disaster waveform before the identification of the disaster signal. Therefore, this article proposes a strategy based on a variable sliding window to align two waveforms, locating the start point of perceptual disaster wave and template wave by gradually sliding the perceptual window, which can guarantee the accuracy of the matching. Finally, this article proposes a multistage identification mechanism based on the sliding window matching strategy and the characteristics of the waveforms of coal mine disasters, adjusting the early warning level according to the identification extent of the disaster signal, which increases the early warning level gradually with the successful result of the matching of 1/ N size of the template, and the piecewise aggregate approximation method is used to optimize the calculation process. Experimental results show that the method proposed in this article is more accurate and be used in real time.


2020 ◽  
Vol 12 (23) ◽  
pp. 10175
Author(s):  
Fatima Abdullah ◽  
Limei Peng ◽  
Byungchul Tak

The volume of streaming sensor data from various environmental sensors continues to increase rapidly due to wider deployments of IoT devices at much greater scales than ever before. This, in turn, causes massive increase in the fog, cloud network traffic which leads to heavily delayed network operations. In streaming data analytics, the ability to obtain real time data insight is crucial for computational sustainability for many IoT enabled applications such as environmental monitors, pollution and climate surveillance, traffic control or even E-commerce applications. However, such network delays prevent us from achieving high quality real-time data analytics of environmental information. In order to address this challenge, we propose the Fog Sampling Node Selector (Fossel) technique that can significantly reduce the IoT network and processing delays by algorithmically selecting an optimal subset of fog nodes to perform the sensor data sampling. In addition, our technique performs a simple type of query executions within the fog nodes in order to further reduce the network delays by processing the data near the data producing devices. Our extensive evaluations show that Fossel technique outperforms the state-of-the-art in terms of latency reduction as well as in bandwidth consumption, network usage and energy consumption.


2019 ◽  
Vol 6 (1) ◽  
pp. 157-163 ◽  
Author(s):  
Jie Lu ◽  
Anjin Liu ◽  
Yiliao Song ◽  
Guangquan Zhang

Abstract Data-driven decision-making ($$\mathrm {D^3}$$D3M) is often confronted by the problem of uncertainty or unknown dynamics in streaming data. To provide real-time accurate decision solutions, the systems have to promptly address changes in data distribution in streaming data—a phenomenon known as concept drift. Past data patterns may not be relevant to new data when a data stream experiences significant drift, thus to continue using models based on past data will lead to poor prediction and poor decision outcomes. This position paper discusses the basic framework and prevailing techniques in streaming type big data and concept drift for $$\mathrm {D^3}$$D3M. The study first establishes a technical framework for real-time $$\mathrm {D^3}$$D3M under concept drift and details the characteristics of high-volume streaming data. The main methodologies and approaches for detecting concept drift and supporting $$\mathrm {D^3}$$D3M are highlighted and presented. Lastly, further research directions, related methods and procedures for using streaming data to support decision-making in concept drift environments are identified. We hope the observations in this paper could support researchers and professionals to better understand the fundamentals and research directions of $$\mathrm {D^3}$$D3M in streamed big data environments.


Author(s):  
Srinivasa K.G. ◽  
Ganesh Hegde ◽  
Kushagra Mishra ◽  
Mohammad Nabeel Siddiqui ◽  
Abhishek Kumar ◽  
...  

With the advancement of portable devices and sensors, there has been a need to build a universal framework, which can serve as a nodal point to aggregate data from different kinds of devices and sensors. We propose a unified framework that will provide a robust set of guidelines for sensors with varied degree of complexities connected to common set of System-on-Chip (SoC). These will help to monitor, control and visualize real time data coming from different type of sensors connected to these SoCs. We have defined a set of APIs, which will help the sensors to register with the server. These APIs will be the standard to which the sensors will comply while streaming data when connected to the client platforms.


Sign in / Sign up

Export Citation Format

Share Document