Efficient Distributed Preprocessing Model for Machine Learning-Based Anomaly Detection over Large-Scale Cybersecurity Datasets

Xavier Larriva-Novo; Mario Vega-Barbas; Víctor A. Villagrá; Diego Rivera; Manuel Álvarez-Campana; Julio Berrocal

doi:10.3390/app10103430

Efficient Distributed Preprocessing Model for Machine Learning-Based Anomaly Detection over Large-Scale Cybersecurity Datasets

Applied Sciences ◽

10.3390/app10103430 ◽

2020 ◽

Vol 10 (10) ◽

pp. 3430

Author(s):

Xavier Larriva-Novo ◽

Mario Vega-Barbas ◽

Víctor A. Villagrá ◽

Diego Rivera ◽

Manuel Álvarez-Campana ◽

...

Keyword(s):

Machine Learning ◽

Information Society ◽

Large Scale ◽

Cyber Attacks ◽

Machine Learning Techniques ◽

Daily Lives ◽

Pervasive Technology ◽

Machine Learning Model ◽

Provision Of Services ◽

Tree Algorithms

New computational and technological paradigms that currently guide developments in the information society, i.e., Internet of things, pervasive technology, or Ubicomp, favor the appearance of new intrusion vectors that can directly affect people’s daily lives. This, together with advances in techniques and methods used for developing new cyber-attacks, exponentially increases the number of cyber threats which affect the information society. Because of this, the development and improvement of technology that assists cybersecurity experts to prevent and detect attacks arose as a fundamental pillar in the field of cybersecurity. Specifically, intrusion detection systems are now a fundamental tool in the provision of services through the internet. However, these systems have certain limitations, i.e., false positives, real-time analytics, etc., which require their operation to be supervised. Therefore, it is necessary to offer architectures and systems that favor an efficient analysis of the data handled by these tools. In this sense, this paper presents a new model of data preprocessing based on a novel distributed computing architecture focused on large-scale datasets such as UGR’16. In addition, the paper analyzes the use of machine learning techniques in order to improve the response and efficiency of the proposed preprocessing model. Thus, the solution developed achieves good results in terms of computer performance. Finally, the proposal shows the adequateness of decision tree algorithms for training a machine learning model by using a large dataset when compared with a multilayer perceptron neural network.

Download Full-text

Upscaling the porosity–permeability relationship of a microporous carbonate for Darcy-scale flow with machine learning

Scientific Reports ◽

10.1038/s41598-021-82029-2 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

H. P. Menke ◽

J. Maes ◽

S. Geiger

Keyword(s):

Machine Learning ◽

Large Scale ◽

Volume Fraction ◽

Flow Behavior ◽

Integrated Approach ◽

Machine Learning Techniques ◽

Machine Learning Model ◽

Large Scale Flow ◽

Stochastic Representations ◽

Relationship Of

AbstractThe permeability of a pore structure is typically described by stochastic representations of its geometrical attributes (e.g. pore-size distribution, porosity, coordination number). Database-driven numerical solvers for large model domains can only accurately predict large-scale flow behavior when they incorporate upscaled descriptions of that structure. The upscaling is particularly challenging for rocks with multimodal porosity structures such as carbonates, where several different type of structures (e.g. micro-porosity, cavities, fractures) are interacting. It is the connectivity both within and between these fundamentally different structures that ultimately controls the porosity–permeability relationship at the larger length scales. Recent advances in machine learning techniques combined with both numerical modelling and informed structural analysis have allowed us to probe the relationship between structure and permeability much more deeply. We have used this integrated approach to tackle the challenge of upscaling multimodal and multiscale porous media. We present a novel method for upscaling multimodal porosity–permeability relationships using machine learning based multivariate structural regression. A micro-CT image of Estaillades limestone was divided into small 603 and 1203 sub-volumes and permeability was computed using the Darcy–Brinkman–Stokes (DBS) model. The microporosity–porosity–permeability relationship from Menke et al. (Earth Arxiv, https://doi.org/10.31223/osf.io/ubg6p, 2019) was used to assign permeability values to the cells containing microporosity. Structural attributes (porosity, phase connectivity, volume fraction, etc.) of each sub-volume were extracted using image analysis tools and then regressed against the solved DBS permeability using an Extra-Trees regression model to derive an upscaled porosity–permeability relationship. Ten test cases of 3603 voxels were then modeled using Darcy-scale flow with this machine learning predicted upscaled porosity–permeability relationship and benchmarked against full DBS simulations, a numerically upscaled Darcy flow model, and a Kozeny–Carman model. All numerical simulations were performed using GeoChemFoam, our in-house open source pore-scale simulator based on OpenFOAM. We found good agreement between the full DBS simulations and both the numerical and machine learning upscaled models, with the machine learning model being 80 times less computationally expensive. The Kozeny–Carman model was a poor predictor of upscaled permeability in all cases.

Download Full-text

Different firm responses to the COVID-19 pandemic shocks: machine-learning evidence on the Vietnamese labor market

International Journal of Emerging Markets ◽

10.1108/ijoem-02-2021-0292 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Lam Hoang Viet Le ◽

Toan Luu Duc Huynh ◽

Bryan S. Weber ◽

Bao Khac Quoc Nguyen

Keyword(s):

Machine Learning ◽

Labor Market ◽

Large Scale ◽

Government Support ◽

Policy Implications ◽

Machine Learning Techniques ◽

Firm Characteristics ◽

Data Set ◽

Content Type ◽

Firm Responses

PurposeThis paper aims to identify the disproportionate impacts of the COVID-19 pandemic on labor markets.Design/methodology/approachThe authors conduct a large-scale survey on 16,000 firms from 82 industries in Ho Chi Minh City, Vietnam, and analyze the data set by using different machine-learning methods.FindingsFirst, job loss and reduction in state-owned enterprises have been significantly larger than in other types of organizations. Second, employees of foreign direct investment enterprises suffer a significantly lower labor income than those of other groups. Third, the adverse effects of the COVID-19 pandemic on the labor market are heterogeneous across industries and geographies. Finally, firms with high revenue in 2019 are more likely to adopt preventive measures, including the reduction of labor forces. The authors also find a significant correlation between firms' revenue and labor reduction as traditional econometrics and machine-learning techniques suggest.Originality/valueThis study has two main policy implications. First, although government support through taxes has been provided, the authors highlight evidence that there may be some additional benefit from targeting firms that have characteristics associated with layoffs or other negative labor responses. Second, the authors provide information that shows which firm characteristics are associated with particular labor market responses such as layoffs, which may help target stimulus packages. Although the COVID-19 pandemic affects most industries and occupations, heterogeneous firm responses suggest that there could be several varieties of targeted policies-targeting firms that are likely to reduce labor forces or firms likely to face reduced revenue. In this paper, the authors outline several industries and firm characteristics which appear to more directly be reducing employee counts or having negative labor responses which may lead to more cost–effect stimulus.

Download Full-text

Math proficiency prediction in computer-based international large-scale assessments using a multi-class machine learning model

10.1109/sisy52375.2021.9582522 ◽

2021 ◽

Author(s):

Aleksandar Pejic ◽

Piroska Stanic Molcer ◽

Kristian Gulaci

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Model ◽

Machine Learning Model ◽

Computer Based ◽

Large Scale Assessments ◽

Math Proficiency

Download Full-text

A Fast Machine Learning Model for Large-Scale Estimation of Annual Solar Irradiation on Rooftops

Proceedings of the ISES Solar World Congress 2019 ◽

10.18086/swc.2019.45.12 ◽

2019 ◽

Author(s):

Alina Walch ◽

Roberto Castello ◽

Nahid Mohajeri ◽

Jean-Louis Scartezzini

Keyword(s):

Machine Learning ◽

Large Scale ◽

Learning Model ◽

Solar Irradiation ◽

Scale Estimation ◽

Machine Learning Model

Download Full-text

Machine Learning-Based Prediction Model for Papillary Thyroid Carcinoma Recurrence

10.21203/rs.3.rs-113105/v1 ◽

2020 ◽

Author(s):

Young Min Park ◽

Byung-Joo Lee

Keyword(s):

Machine Learning ◽

Prediction Model ◽

Tumor Size ◽

Large Scale ◽

Prediction Models ◽

Prognostic Significance ◽

Disease Recurrence ◽

Machine Learning Techniques ◽

Papillary Thyroid ◽

Recurrence Prediction

Abstract Background: This study analyzed the prognostic significance of nodal factors, including the number of metastatic LNs and LNR, in patients with PTC, and attempted to construct a disease recurrence prediction model using machine learning techniques.Methods: We retrospectively analyzed clinico-pathologic data from 1040 patients diagnosed with papillary thyroid cancer between 2003 and 2009. Results: We analyzed clinico-pathologic factors related to recurrence through logistic regression analysis. Among the factors that we included, only sex and tumor size were significantly correlated with disease recurrence. Parameters such as age, sex, tumor size, tumor multiplicity, ETE, ENE, pT, pN, ipsilateral central LN metastasis, contralateral central LNs metastasis, number of metastatic LNs, and LNR were input for construction of a machine learning prediction model. The performance of five machine learning models related to recurrence prediction was compared based on accuracy. The Decision Tree model showed the best accuracy at 95%, and the lightGBM and stacking model together showed 93% accuracy. Conclusions: We confirmed that all machine learning prediction models showed an accuracy of 90% or more for predicting disease recurrence in PTC. Large-scale multicenter clinical studies should be performed to improve the performance of our prediction models and verify their clinical effectiveness.

Download Full-text

Detecting Pressure Anomalies While Drilling Using a Machine Learning Hybrid Approach

10.2118/204035-ms ◽

2021 ◽

Author(s):

Aurore Lafond ◽

Maurice Ringer ◽

Florian Le Blay ◽

Jiaxu Liu ◽

Ekaterina Millan ◽

...

Keyword(s):

Machine Learning ◽

Data Quality ◽

Real Time ◽

Large Scale ◽

Hybrid Approach ◽

Physical Models ◽

Training Data ◽

Digital Data ◽

Machine Learning Techniques ◽

New System

Abstract Abnormal surface pressure is typically the first indicator of a number of problematic events, including kicks, losses, washouts and stuck pipe. These events account for 60–70% of all drilling-related nonproductive time, so their early and accurate detection has the potential to save the industry billions of dollars. Detecting these events today requires an expert user watching multiple curves, which can be costly, and subject to human errors. The solution presented in this paper is aiming at augmenting traditional models with new machine learning techniques, which enable to detect these events automatically and help the monitoring of the drilling well. Today’s real-time monitoring systems employ complex physical models to estimate surface standpipe pressure while drilling. These require many inputs and are difficult to calibrate. Machine learning is an alternative method to predict pump pressure, but this alone needs significant labelled training data, which is often lacking in the drilling world. The new system combines these approaches: a machine learning framework is used to enable automated learning while the physical models work to compensate any gaps in the training data. The system uses only standard surface measurements, is fully automated, and is continuously retrained while drilling to ensure the most accurate pressure prediction. In addition, a stochastic (Bayesian) machine learning technique is used, which enables not only a prediction of the pressure, but also the uncertainty and confidence of this prediction. Last, the new system includes a data quality control workflow. It discards periods of low data quality for the pressure anomaly detection and enables to have a smarter real-time events analysis. The new system has been tested on historical wells using a new test and validation framework. The framework runs the system automatically on large volumes of both historical and simulated data, to enable cross-referencing the results with observations. In this paper, we show the results of the automated test framework as well as the capabilities of the new system in two specific case studies, one on land and another offshore. Moreover, large scale statistics enlighten the reliability and the efficiency of this new detection workflow. The new system builds on the trend in our industry to better capture and utilize digital data for optimizing drilling.

Download Full-text

Evaluating Performance of Scalable Fair Clustering Machine Learning Techniques in Detecting Cyber Attacks in Industrial Control Systems

10.1007/978-3-030-74753-4_7 ◽

2022 ◽

pp. 105-116

Author(s):

Akansha Handa ◽

Prabhat Semwal

Keyword(s):

Machine Learning ◽

Control Systems ◽

Cyber Attacks ◽

Machine Learning Techniques ◽

Industrial Control ◽

Industrial Control Systems ◽

Learning Techniques

Download Full-text

Leveraging Machine Learning in Financial Fraud Forensics in the Age of Cybersecurity

10.4018/978-1-7998-8386-9.ch010 ◽

2022 ◽

pp. 220-249

Author(s):

Md Ariful Haque ◽

Sachin Shetty

Keyword(s):

Machine Learning ◽

Financial Institutions ◽

Machine Learning Techniques ◽

Cyber Attack ◽

Financial Industry ◽

Security Breaches ◽

Financial Gain ◽

Learning Techniques ◽

Machine Learning Model ◽

Machine Learning Models

Financial sectors are lucrative cyber-attack targets because of their immediate financial gain. As a result, financial institutions face challenges in developing systems that can automatically identify security breaches and separate fraudulent transactions from legitimate transactions. Today, organizations widely use machine learning techniques to identify any fraudulent behavior in customers' transactions. However, machine learning techniques are often challenging because of financial institutions' confidentiality policy, leading to not sharing the customer transaction data. This chapter discusses some crucial challenges of handling cybersecurity and fraud in the financial industry and building machine learning-based models to address those challenges. The authors utilize an open-source e-commerce transaction dataset to illustrate the forensic processes by creating a machine learning model to classify fraudulent transactions. Overall, the chapter focuses on how the machine learning models can help detect and prevent fraudulent activities in the financial sector in the age of cybersecurity.

Download Full-text

Machine learning techniques applied to detect cyber attacks on web applications

Logic Journal of IGPL ◽

10.1093/jigpal/jzu038 ◽

2014 ◽

Vol 23 (1) ◽

pp. 45-56 ◽

Cited By ~ 22

Author(s):

M. Chora ◽

R. Kozik

Keyword(s):

Machine Learning ◽

Web Applications ◽

Cyber Attacks ◽

Machine Learning Techniques ◽

Learning Techniques

Download Full-text

Bitcoin and Cybersecurity: Temporal Dissection of Blockchain Data to Unveil Changes in Entity Behavioral Patterns

Applied Sciences ◽

10.3390/app9235003 ◽

2019 ◽

Vol 9 (23) ◽

pp. 5003 ◽

Cited By ~ 1

Author(s):

Francesco Zola ◽

Jan Lukas Bruse ◽

Maria Eguimendia ◽

Mikel Galar ◽

Raul Orduna Urrutia

Keyword(s):

Machine Learning ◽

Classification Performance ◽

Cyber Attacks ◽

Behavioral Changes ◽

Batch Size ◽

Time Interval ◽

Behavioral Patterns ◽

Machine Learning Model ◽

Good Classification Performance ◽

Learning Principles

The Bitcoin network not only is vulnerable to cyber-attacks but currently represents the most frequently used cryptocurrency for concealing illicit activities. Typically, Bitcoin activity is monitored by decreasing anonymity of its entities using machine learning-based techniques, which consider the whole blockchain. This entails two issues: first, it increases the complexity of the analysis requiring higher efforts and, second, it may hide network micro-dynamics important for detecting short-term changes in entity behavioral patterns. The aim of this paper is to address both issues by performing a “temporal dissection” of the Bitcoin blockchain, i.e., dividing it into smaller temporal batches to achieve entity classification. The idea is that a machine learning model trained on a certain time-interval (batch) should achieve good classification performance when tested on another batch if entity behavioral patterns are similar. We apply cascading machine learning principles—a type of ensemble learning applying stacking techniques—introducing a “k-fold cross-testing” concept across batches of varying size. Results show that blockchain batch size used for entity classification could be reduced for certain classes (Exchange, Gambling, and eWallet) as classification rates did not vary significantly with batch size; suggesting that behavioral patterns did not change significantly over time. Mixer and Market class detection, however, can be negatively affected. A deeper analysis of Mining Pool behavior showed that models trained on recent data perform better than models trained on older data, suggesting that “typical” Mining Pool behavior may be represented better by recent data. This work provides a first step towards uncovering entity behavioral changes via temporal dissection of blockchain data.

Download Full-text