On Frequency Estimation and Detection of Heavy Hitters in Data Streams

Federica Ventruto; Marco Pulimeno; Massimo Cafaro; Italo Epicoco

doi:10.3390/fi12090158

On Frequency Estimation and Detection of Heavy Hitters in Data Streams

Future Internet ◽

10.3390/fi12090158 ◽

2020 ◽

Vol 12 (9) ◽

pp. 158

Author(s):

Federica Ventruto ◽

Marco Pulimeno ◽

Massimo Cafaro ◽

Italo Epicoco

Keyword(s):

Data Streams ◽

Data Stream ◽

High Probability ◽

State Of The Art ◽

Frequency Estimation ◽

Large Set ◽

Full List ◽

Heavy Hitters ◽

Succinct Data Structure ◽

Heavy Hitter

A stream can be thought of as a very large set of data, sometimes even infinite, which arrives sequentially and must be processed without the possibility of being stored. In fact, the memory available to the algorithm is limited and it is not possible to store the whole stream of data which is instead scanned upon arrival and summarized through a succinct data structure in order to maintain only the information of interest. Two of the main tasks related to data stream processing are frequency estimation and heavy hitter detection. The frequency estimation problem requires estimating the frequency of each item, that is the number of times or the weight with which each appears in the stream, while heavy hitter detection means the detection of all those items with a frequency higher than a fixed threshold. In this work we design and analyze ACMSS, an algorithm for frequency estimation and heavy hitter detection, and compare it against the state of the art ASketch algorithm. We show that, given the same budgeted amount of memory, for the task of frequency estimation our algorithm outperforms ASketch with regard to accuracy. Furthermore, we show that, under the assumptions stated by its authors, ASketch may not be able to report all of the heavy hitters whilst ACMSS will provide with high probability the full list of heavy hitters.

Download Full-text

TADILOF: Time Aware Density-Based Incremental Local Outlier Detection in Data Streams

Sensors ◽

10.3390/s20205829 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5829 ◽

Cited By ~ 1

Author(s):

Jen-Wei Huang ◽

Meng-Xun Zhong ◽

Bijay Prasad Jaysawal

Keyword(s):

Outlier Detection ◽

Data Streams ◽

Data Stream ◽

State Of The Art ◽

Streaming Data ◽

Current State ◽

Data Points ◽

Local Outlier ◽

Time Aware ◽

Over Time

Outlier detection in data streams is crucial to successful data mining. However, this task is made increasingly difficult by the enormous growth in the quantity of data generated by the expansion of Internet of Things (IoT). Recent advances in outlier detection based on the density-based local outlier factor (LOF) algorithms do not consider variations in data that change over time. For example, there may appear a new cluster of data points over time in the data stream. Therefore, we present a novel algorithm for streaming data, referred to as time-aware density-based incremental local outlier detection (TADILOF) to overcome this issue. In addition, we have developed a means for estimating the LOF score, termed "approximate LOF," based on historical information following the removal of outdated data. The results of experiments demonstrate that TADILOF outperforms current state-of-the-art methods in terms of AUC while achieving similar performance in terms of execution time. Moreover, we present an application of the proposed scheme to the development of an air-quality monitoring system.

Download Full-text

Recurring concept memory management in data streams: exploiting data stream concept evolution to improve performance and transparency

Data Mining and Knowledge Discovery ◽

10.1007/s10618-021-00736-w ◽

2021 ◽

Author(s):

Ben Halstead ◽

Yun Sing Koh ◽

Patricia Riddle ◽

Russel Pears ◽

Mykola Pechenizkiy ◽

...

Keyword(s):

Data Streams ◽

Data Stream ◽

Memory Management ◽

Improve Performance ◽

Concept Evolution

Download Full-text

Regularized Training and Tight Certification for Randomized Smoothed Classifier with Provable Robustness

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5798 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3858-3865

Author(s):

Huijie Feng ◽

Chunpeng Wu ◽

Guoyang Chen ◽

Weifeng Zhang ◽

Yang Ning

Keyword(s):

Neural Network ◽

High Probability ◽

Deep Neural Network ◽

State Of The Art ◽

Computationally Efficient ◽

Base Classifier ◽

Training Scheme ◽

Adversarial Training ◽

Gaussian Perturbation ◽

Probabilistic Robustness

Recently smoothing deep neural network based classifiers via isotropic Gaussian perturbation is shown to be an effective and scalable way to provide state-of-the-art probabilistic robustness guarantee against ℓ2 norm bounded adversarial perturbations. However, how to train a good base classifier that is accurate and robust when smoothed has not been fully investigated. In this work, we derive a new regularized risk, in which the regularizer can adaptively encourage the accuracy and robustness of the smoothed counterpart when training the base classifier. It is computationally efficient and can be implemented in parallel with other empirical defense methods. We discuss how to implement it under both standard (non-adversarial) and adversarial training scheme. At the same time, we also design a new certification algorithm, which can leverage the regularization effect to provide tighter robustness lower bound that holds with high probability. Our extensive experimentation demonstrates the effectiveness of the proposed training and certification approaches on CIFAR-10 and ImageNet datasets.

Download Full-text

Analysis of Data Stream Processing At Edge Layer for Internet of Things

Journal of ISMAC - June 2019 ◽

10.36548/jismac.2020.1.003 ◽

2020 ◽

Vol 2 (1) ◽

pp. 26-37

Author(s):

Dr. Pasumponpandian

Keyword(s):

Internet Of Things ◽

Data Streams ◽

Data Stream ◽

Smart Cities ◽

Stream Processing ◽

Middle Layer ◽

Cloud Services ◽

Decentralized Systems ◽

Data Stream Processing ◽

Edge Layer

The progress of internet of things at a rapid pace and simultaneous development of the technologies and the processing capabilities has paved way for the development of decentralized systems that are relying on cloud services. Though the decentralized systems are founded on cloud complexities still prevail in transferring all the information’s that are been sensed through the IOT devices to the cloud. This because of the huge streams of information’s gathered by certain applications and the expectation to have a timely response, incurring minimized delay, computing energy and enhanced reliability. So this kind of decentralization has led to the development of middle layer between the cloud and the IOT, and was termed as the Edge layer, meaning bringing down the service of the cloud to the user edge. The paper puts forth the analysis of the data stream processing in the edge layer taking in the complexities involved in the computing the data streams of IOT in an edge layer and puts forth the real time analytics in the edge layer to examine the data streams of the internet of things offering a data- driven insight for parking system in the smart cities.

Download Full-text

Analyzing Compositionality-Sensitivity of NLI Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016867 ◽

2019 ◽

Vol 33 ◽

pp. 6867-6874 ◽

Cited By ~ 1

Author(s):

Yixin Nie ◽

Yicheng Wang ◽

Mohit Bansal

Keyword(s):

Natural Language ◽

High Probability ◽

State Of The Art ◽

Model Performance ◽

Analysis Tool ◽

Bag Of Words ◽

Compositional Semantics ◽

Sensitivity Testing ◽

Performance Loss ◽

Future Work

Success in natural language inference (NLI) should require a model to understand both lexical and compositional semantics. However, through adversarial evaluation, we find that several state-of-the-art models with diverse architectures are over-relying on the former and fail to use the latter. Further, this compositionality unawareness is not reflected via standard evaluation on current datasets. We show that removing RNNs in existing models or shuffling input words during training does not induce large performance loss despite the explicit removal of compositional information. Therefore, we propose a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone (i.e., on which a bag-of-words model gives a high probability to one wrong label), hence revealing the models’ actual compositionality awareness. We show that this setup not only highlights the limited compositional ability of current NLI models, but also differentiates model performance based on design, e.g., separating shallow bag-of-words models from deeper, linguistically-grounded tree-based models. Our evaluation setup is an important analysis tool: complementing currently existing adversarial and linguistically driven diagnostic evaluations, and exposing opportunities for future work on evaluating models’ compositional understanding.

Download Full-text

Non-stationary Data Stream Analysis: State-of-the-Art Challenges and Solutions

Proceeding of International Conference on Computational Science and Applications - Algorithms for Intelligent Systems ◽

10.1007/978-981-15-0790-8_8 ◽

2020 ◽

pp. 67-80

Author(s):

Varsha S. Khandekar ◽

Pravin Srinath

Keyword(s):

Data Stream ◽

State Of The Art ◽

Data Stream Analysis

Download Full-text

Knowledge Discovery From Evolving Data Streams

Advances in Business Information Systems and Analytics - Machine Learning Techniques for Improved Business Analytics ◽

10.4018/978-1-5225-3534-8.ch002 ◽

2019 ◽

pp. 19-39

Author(s):

Prasanna Lakshmi Kompalli

Keyword(s):

Real Time ◽

Data Streams ◽

Data Stream ◽

Concept Drift ◽

Data Stream Mining ◽

Time Data ◽

Stream Mining ◽

New Challenges ◽

Mining Data Streams ◽

Different Sources

Data coming from different sources is referred to as data streams. Data stream mining is an online learning technique where each data point must be processed as the data arrives and discarded as the processing is completed. Progress of technologies has resulted in the monitoring these data streams in real time. Data streams has created many new challenges to the researchers in real time. The main features of this type of data are they are fast flowing, large amounts of data which are continuous and growing in nature, and characteristics of data might change in course of time which is termed as concept drift. This chapter addresses the problems in mining data streams with concept drift. Due to which, isolating the correct literature would be a grueling task for researchers and practitioners. This chapter tries to provide a solution as it would be an amalgamation of all techniques used for data stream mining with concept drift.

Download Full-text

A Tour of Lattice-Based Skyline Algorithms

Handbook of Research on Investigations in Artificial Life Research and Development - Advances in Computational Intelligence and Robotics ◽

10.4018/978-1-5225-5396-0.ch006 ◽

2018 ◽

pp. 96-122

Author(s):

Markus Endres ◽

Lena Rudenko

Keyword(s):

Real Time ◽

Data Streams ◽

High Performance ◽

State Of The Art ◽

Experimental Results ◽

Lattice Structures ◽

Skyline Query ◽

Basic Concepts ◽

Generic Index

A skyline query retrieves all objects in a dataset that are not dominated by other objects according to some given criteria. There exist many skyline algorithms which can be classified into generic, index-based, and lattice-based algorithms. This chapter takes a tour through lattice-based skyline algorithms. It summarizes the basic concepts and properties, presents high-performance parallel approaches, shows how one overcomes the low-cardinality restriction of lattice structures, and finally presents an application on data streams for real-time skyline computation. Experimental results on synthetic and real datasets show that lattice-based algorithms outperform state-of-the-art skyline techniques, and additionally have a linear runtime complexity.

Download Full-text

Exploring Calendar-Based Pattern Mining in Data Streams

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development ◽

10.4018/978-1-60566-748-5.ch016 ◽

2010 ◽

pp. 342-360

Author(s):

Rodrigo Salvador Monteiro ◽

Geraldo Zimbrão ◽

Holger Schwarz ◽

Bernhard Mitschang ◽

Jano Moreira de Souza

Keyword(s):

Data Warehouse ◽

Data Streams ◽

Data Stream ◽

Pattern Mining ◽

A Priori ◽

Frequent Itemsets ◽

Detailed Data ◽

Series Of Experiments ◽

Working Day

Calendar-based pattern mining aims at identifying patterns on specific calendar partitions. Potential calendar partitions are for example: every Monday, every first working day of each month, every holiday. Providing flexible mining capabilities for calendar-based partitions is especially challenging in a data stream scenario. The calendar partitions of interest are not known a priori and at each point in time only a subset of the detailed data is available. The authors show how a data warehouse approach can be applied to this problem. The data warehouse that keeps track of frequent itemsets holding on different partitions of the original stream has low storage requirements. Nevertheless, it allows to derive sets of patterns that are complete and precise. Furthermore, the authors demonstrate the effectiveness of their approach by a series of experiments.

Download Full-text

Learning from Weak-Label Data: A Deep Forest Expedition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6092 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6251-6258

Author(s):

Qian-Wei Wang ◽

Liang Yang ◽

Yu-Feng Li

Keyword(s):

Cross Validation ◽

State Of The Art ◽

Frequency Estimation ◽

Ground Truth ◽

Layer By Layer ◽

Label Data ◽

Information Layer ◽

Deep Forest ◽

Class Labels ◽

Number Of Classes

Weak-label learning deals with the problem where each training example is associated with multiple ground-truth labels simultaneously but only partially provided. This circumstance is frequently encountered when the number of classes is very large or when there exists a large ambiguity between class labels, and significantly influences the performance of multi-label learning. In this paper, we propose LCForest, which is the first tree ensemble based deep learning method for weak-label learning. Rather than formulating the problem as a regularized framework, we employ the recently proposed cascade forest structure, which processes information layer-by-layer, and endow it with the ability of exploiting from weak-label data by a concise and highly efficient label complement structure. Specifically, in each layer, the label vector of each instance from testing-fold is modified with the predictions of random forests trained with the corresponding training-fold. Since the ground-truth label matrix is inaccessible, we can not estimate the performance via cross-validation directly. In order to control the growth of cascade forest, we adopt label frequency estimation and the complement flag mechanism. Experiments show that the proposed LCForest method compares favorably against the existing state-of-the-art multi-label and weak-label learning methods.

Download Full-text