Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

Journal of Artificial Intelligence Research ◽

10.1613/jair.453 ◽

1998 ◽

Vol 8 ◽

pp. 67-91 ◽

Cited By ~ 93

Author(s):

A. Moore ◽

M. S. Lee

Keyword(s):

Machine Learning ◽

Data Structures ◽

Rule Learning ◽

Worst Case ◽

Sufficient Statistics ◽

Frequent Sets ◽

Efficient Machine ◽

Real World Datasets ◽

Selection Algorithms ◽

New Algorithms

This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.

Download Full-text

Efficient machine learning for attack detection

it - Information Technology ◽

10.1515/itit-2020-0015 ◽

2020 ◽

Vol 62 (5-6) ◽

pp. 279-286

Author(s):

Christian Wressnegger

Keyword(s):

Machine Learning ◽

Computer Security ◽

Data Structures ◽

Linear Time ◽

Data Representation ◽

Attack Detection ◽

Probabilistic Data ◽

Efficient Machine ◽

The Right ◽

Over Time

AbstractDetecting and fending off attacks on computer systems is an enduring problem in computer security. In light of a plethora of different threats and the growing automation used by attackers, we are in urgent need of more advanced methods for attack detection. Manually crafting detection rules is by no means feasible at scale, and automatically generated signatures often lack context, such that they fall short in detecting slight variations of known threats.In the thesis “Efficient Machine Learning for Attack Detection” [35], we address the necessity of advanced attack detection. For the effective application of machine learning in this domain, a periodic retraining over time is crucial. We show that with the right data representation, efficient algorithms for mining substring statistics, and implementations based on probabilistic data structures, training the underlying model for establishing an higher degree of automation for defenses can be achieved in linear time.

Download Full-text

Anomaly Detection in Market Data Structures Via Machine Learning Algorithms

SSRN Electronic Journal ◽

10.2139/ssrn.3516028 ◽

2020 ◽

Author(s):

Dirk Röder ◽

Henning Mueller

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Data Structures ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Market Data

Download Full-text

An efficient machine learning model for malicious activities recognition in water‐based industrial internet of things

Security and Privacy ◽

10.1002/spy2.154 ◽

2021 ◽

Author(s):

Gamal E. I. Selim ◽

Ezz El‐Din Hemdan ◽

Ahmed M. Shehata ◽

Nawal A. El‐Fishawy

Keyword(s):

Machine Learning ◽

Internet Of Things ◽

Learning Model ◽

Industrial Internet Of Things ◽

Industrial Internet ◽

Machine Learning Model ◽

Water Based ◽

Efficient Machine

Download Full-text

Analyzing the Interplay Between Random Shuffling and Storage Devices for Efficient Machine Learning

2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) ◽

10.1109/ispass51385.2021.00050 ◽

2021 ◽

Author(s):

Zhi-Lin Ke ◽

Hsiang-Yun Cheng ◽

Chia-Lin Yang ◽

Han-Wei Huang

Keyword(s):

Machine Learning ◽

Storage Devices ◽

Efficient Machine ◽

And Storage

Download Full-text

An Efficient Machine Learning Framework for Stress Prediction via Sensor Integrated Keyboard Data

IEEE Access ◽

10.1109/access.2021.3094334 ◽

2021 ◽

pp. 1-1

Author(s):

P.B. Pankajavalli ◽

G.S. Karthick ◽

R. Sakthivel

Keyword(s):

Machine Learning ◽

Learning Framework ◽

Stress Prediction ◽

Efficient Machine

Download Full-text

Machine Learning Methods Applied to the Prediction of Pseudo-nitzschia spp. Blooms in the Galician Rias Baixas (NW Spain)

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10040199 ◽

2021 ◽

Vol 10 (4) ◽

pp. 199

Author(s):

Francisco M. Bellas Aláez ◽

Jesus M. Torres Palenzuela ◽

Evangelos Spyrakos ◽

Luis González Vilas

Keyword(s):

Machine Learning ◽

Performance Metrics ◽

Prediction Models ◽

Support Vector ◽

False Alarms ◽

Learning Approaches ◽

Learning Methods ◽

Machine Learning Methods ◽

Rías Baixas ◽

New Algorithms

This work presents new prediction models based on recent developments in machine learning methods, such as Random Forest (RF) and AdaBoost, and compares them with more classical approaches, i.e., support vector machines (SVMs) and neural networks (NNs). The models predict Pseudo-nitzschia spp. blooms in the Galician Rias Baixas. This work builds on a previous study by the authors (doi.org/10.1016/j.pocean.2014.03.003) but uses an extended database (from 2002 to 2012) and new algorithms. Our results show that RF and AdaBoost provide better prediction results compared to SVMs and NNs, as they show improved performance metrics and a better balance between sensitivity and specificity. Classical machine learning approaches show higher sensitivities, but at a cost of lower specificity and higher percentages of false alarms (lower precision). These results seem to indicate a greater adaptation of new algorithms (RF and AdaBoost) to unbalanced datasets. Our models could be operationally implemented to establish a short-term prediction system.

Download Full-text

MicroRNA expression classification for pediatric multiple sclerosis identification

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-021-03091-2 ◽

2021 ◽

Author(s):

Gabriella Casalino ◽

Giovanna Castellano ◽

Arianna Consiglio ◽

Nicoletta Nuzziello ◽

Gennaro Vessio

Keyword(s):

Machine Learning ◽

Multiple Sclerosis ◽

Expression Profiles ◽

Healthy Children ◽

Multifactorial Diseases ◽

Hyperactivity Disorder ◽

Pediatric Multiple Sclerosis ◽

Mirna Expression Profiles ◽

Selection Algorithms ◽

Expression Classification

Abstract MicroRNAs (miRNAs) are a set of short non-coding RNAs that play significant regulatory roles in cells. The study of miRNA data produced by Next-Generation Sequencing techniques can be of valid help for the analysis of multifactorial diseases, such as Multiple Sclerosis (MS). Although extensive studies have been conducted on young adults affected by MS, very little work has been done to investigate the pathogenic mechanisms in pediatric patients, and none from a machine learning perspective. In this work, we report the experimental results of a classification study aimed at evaluating the effectiveness of machine learning methods in automatically distinguishing pediatric MS from healthy children, based on their miRNA expression profiles. Additionally, since Attention Deficit Hyperactivity Disorder (ADHD) shares some cognitive impairments with pediatric MS, we also included patients affected by ADHD in our study. Encouraging results were obtained with an artificial neural network model based on a set of features automatically selected by feature selection algorithms. The results obtained show that models developed on automatically selected features overcome models based on a set of features selected by human experts. Developing an automatic predictive model can support clinicians in early MS diagnosis and provide new insights that can help find novel molecular pathways involved in MS disease.

Download Full-text

Competitive Caching with Machine Learned Advice

Journal of the ACM ◽

10.1145/3447579 ◽

2021 ◽

Vol 68 (4) ◽

pp. 1-25

Author(s):

Thodoris Lykouris ◽

Sergei Vassilvitskii

Keyword(s):

Online Algorithms ◽

Empirical Evaluation ◽

Optimal Solution ◽

Poor Performance ◽

Machine Learning Algorithms ◽

Average Error ◽

Generalization Error ◽

Worst Case ◽

Future Events ◽

Real World Datasets

Traditional online algorithms encapsulate decision making under uncertainty, and give ways to hedge against all possible future events, while guaranteeing a nearly optimal solution, as compared to an offline optimum. On the other hand, machine learning algorithms are in the business of extrapolating patterns found in the data to predict the future, and usually come with strong guarantees on the expected generalization error. In this work, we develop a framework for augmenting online algorithms with a machine learned predictor to achieve competitive ratios that provably improve upon unconditional worst-case lower bounds when the predictor has low error. Our approach treats the predictor as a complete black box and is not dependent on its inner workings or the exact distribution of its errors. We apply this framework to the traditional caching problem—creating an eviction strategy for a cache of size k . We demonstrate that naively following the oracle’s recommendations may lead to very poor performance, even when the average error is quite low. Instead, we show how to modify the Marker algorithm to take into account the predictions and prove that this combined approach achieves a competitive ratio that both (i) decreases as the predictor’s error decreases and (ii) is always capped by O (log k ), which can be achieved without any assistance from the predictor. We complement our results with an empirical evaluation of our algorithm on real-world datasets and show that it performs well empirically even when using simple off-the-shelf predictions.

Download Full-text

Cocrystal Prediction Using Machine Learning Models and Descriptors

Applied Sciences ◽

10.3390/app11031323 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1323

Author(s):

Medard Edmund Mswahili ◽

Min-Jeong Lee ◽

Gati Lother Martin ◽

Junghyun Kim ◽

Paul Kim ◽

...

Keyword(s):

Machine Learning ◽

Academic Research ◽

Pharmaceutical Research ◽

Machine Learning Techniques ◽

Learning Models ◽

Pharmaceutical Ingredients ◽

Learning Techniques ◽

Comparable Performance ◽

Selection Algorithms ◽

Machine Learning Models

Cocrystals are of much interest in industrial application as well as academic research, and screening of suitable coformers for active pharmaceutical ingredients is the most crucial and challenging step in cocrystal development. Recently, machine learning techniques are attracting researchers in many fields including pharmaceutical research such as quantitative structure-activity/property relationship. In this paper, we develop machine learning models to predict cocrystal formation. We extract descriptor values from simplified molecular-input line-entry system (SMILES) of compounds and compare the machine learning models by experiments with our collected data of 1476 instances. As a result, we found that artificial neural network shows great potential as it has the best accuracy, sensitivity, and F1 score. We also found that the model achieved comparable performance with about half of the descriptors chosen by feature selection algorithms. We believe that this will contribute to faster and more accurate cocrystal development.

Download Full-text

Predicting breast cancer survivability based on machine learning and features selection algorithms: a comparative study

Journal of Ambient Intelligence and Humanized Computing ◽

10.1007/s12652-020-02590-y ◽

2020 ◽

Author(s):

Sahar A. El_Rahman

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Comparative Study ◽

Features Selection ◽

Selection Algorithms

Download Full-text