EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms

Sašo Karakatič

doi:10.3390/math8060900

EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms

Mathematics ◽

10.3390/math8060900 ◽

2020 ◽

Vol 8 (6) ◽

pp. 900

Author(s):

Sašo Karakatič

Keyword(s):

Machine Learning ◽

Heuristic Algorithms ◽

Optimization Algorithms ◽

Imbalanced Data ◽

Application Programming Interface ◽

Data Preprocessing ◽

Supervised Machine Learning ◽

Data Sampling ◽

Imbalanced Data Sets ◽

Under Sampling

The quality of machine learning models can suffer when inappropriate data is used, which is especially prevalent in high-dimensional and imbalanced data sets. Data preparation and preprocessing can mitigate some problems and can thus result in better models. The use of meta-heuristic and nature-inspired methods for data preprocessing has become common, but these approaches are still not readily available to practitioners with a simple and extendable application programming interface (API). In this paper the EvoPreprocess open-source Python framework, that preprocesses data with the use of evolutionary and nature-inspired optimization algorithms, is presented. The main problems addressed by the framework are data sampling (simultaneous over- and under-sampling data instances), feature selection and data weighting for supervised machine learning problems. EvoPreprocess framework provides a simple object-oriented and parallelized API of the preprocessing tasks and can be used with scikit-learn and imbalanced-learn Python machine learning libraries. The framework uses self-adaptive well-known nature-inspired meta-heuristic algorithms and can easily be extended with custom optimization and evaluation strategies. The paper presents the architecture of the framework, its use, experiment results and comparison to other common preprocessing approaches.

Download Full-text

Detection of fraudulent credit card transactions: A comparative analysis of data sampling and classification techniques

Journal of Physics Conference Series ◽

10.1088/1742-6596/2161/1/012072 ◽

2022 ◽

Vol 2161 (1) ◽

pp. 012072

Author(s):

Konduri Praveen Mahesh ◽

Shaik Ashar Afrouz ◽

Anu Shaju Areeckal

Keyword(s):

Machine Learning ◽

Credit Card ◽

Research Problem ◽

Machine Learning Algorithms ◽

Support Vector ◽

Unbalanced Data ◽

Learning Approaches ◽

Data Sampling ◽

Sampled Data ◽

Under Sampling

Abstract Every year there is an increasing loss of a huge amount of money due to fraudulent credit card transactions. Recently there is a focus on using machine learning algorithms to identify fraud transactions. The number of fraud cases to non-fraud transactions is very low. This creates a skewed or unbalanced data, which poses a challenge to training the machine learning models. The availability of a public dataset for this research problem is scarce. The dataset used for this work is obtained from Kaggle. In this paper, we explore different sampling techniques such as under-sampling, Synthetic Minority Oversampling Technique (SMOTE) and SMOTE-Tomek, to work on the unbalanced data. Classification models, such as k-Nearest Neighbour (KNN), logistic regression, random forest and Support Vector Machine (SVM), are trained on the sampled data to detect fraudulent credit card transactions. The performance of the various machine learning approaches are evaluated for its precision, recall and F1-score. The classification results obtained is promising and can be used for credit card fraud detection.

Download Full-text

Fluids and lithofacies prediction based on integration of well-log data and seismic inversion: a machine learning approach

Geophysics ◽

10.1190/geo2020-0521.1 ◽

2021 ◽

pp. 1-67

Author(s):

Luanxiao Zhao ◽

Caifeng Zou ◽

Yuanyuan Chen ◽

Wenlong Shen ◽

Yirong Wang ◽

...

Keyword(s):

Machine Learning ◽

Seismic Data ◽

Domain Knowledge ◽

Reservoir Characterization ◽

Model Building ◽

Imbalanced Data ◽

Seismic Inversion ◽

Well Log ◽

Spatial Constraints ◽

Imbalanced Data Sets

Seismic prediction of fluid and lithofacies distributions is of great interest to reservoir characterization, geological model building, and flow unit delineation. Inferring fluids and lithofacies from seismic data under the framework of machine learning is commonly subject to issues of limited features, imbalanced data sets, and spatial constraints. As a consequence, an XGBoost based workflow, which takes feature engineering, data balancing, and spatial constraints into account, is proposed to predict the fluid and lithofacies distribution by integrating well-log and seismic data. The constructed feature set based on simple mathematical operations and domain knowledge outperforms the benchmark group consisting of conventional elastic attributes of P-impedance and Vp/Vs ratio. A radial basis function characterizing the weights of training samples according to the distances from the available wells to the target region is developed to impose spatial constraints on the model training process, significantly improving the prediction accuracy and reliability of gas sandstone. The strategy combining the synthetic minority oversampling technique (SMOTE) and spatial constraints further increases the F1 score of gas sandstone and also benefits the overall prediction performance of all the facies. The application of the combined strategy on prestack seismic inversion results generates a more geologically reasonable spatial distribution of fluids, thus verifying the robustness and effectiveness of the proposed workflow.

Download Full-text

Machine learning from imbalanced data sets for astronomical object classification

2011 International Conference of Soft Computing and Pattern Recognition (SoCPaR) ◽

10.1109/socpar.2011.6089283 ◽

2011 ◽

Cited By ~ 1

Author(s):

Jorge de la Calleja ◽

Antonio Benitez ◽

Ma. Auxilio Medina ◽

Olac Fuentes

Keyword(s):

Machine Learning ◽

Imbalanced Data ◽

Object Classification ◽

Data Sets ◽

Imbalanced Data Sets ◽

Astronomical Object

Download Full-text

Predicting Depression from Smartphone Behavioral Markers Using Machine Learning Methods, Hyper-parameter Optimization, and Feature Importance Analysis: An Exploratory Study (Preprint)

10.2196/preprints.26540 ◽

2020 ◽

Author(s):

Kennedy Opoku Asare ◽

Yannik Terhorst ◽

Julio Vega ◽

Ella Peltonen ◽

Eemil Lagerspetz ◽

...

Keyword(s):

Machine Learning ◽

Age Distribution ◽

Area Under The Curve ◽

Imbalanced Data ◽

Positive Association ◽

Assessment Methods ◽

Supervised Machine Learning ◽

Significant Positive Association ◽

Depression Assessment ◽

Importance Analysis

BACKGROUND Depression is a prevalent mental health challenge. Current depression assessment methods using self-reported and clinician-administered questionnaires have limitations. Instrumenting smartphones to passively and continuously collect moment by moment datasets to quantify human behaviours that have the potential to augment current depression assessment methods for early diagnosis, scalable, and longitudinal monitoring of depression. OBJECTIVE The objective of this study is to investigate the feasibility of predicting depression with human behaviours quantified from a smartphone datasets, and to identify behaviours that can influence depression. METHODS Smartphone datasets and self-reported eight-item Patient Health Questionnaire (PHQ-8) depression assessments were collected from 629 participants in an exploratory longitudinal study over an average 22.1 days (SD =17.90, min= 8, max=86). We quantified 22 regularity, entropy, and standard deviation behavioural markers from the smartphone usage data. We explore the linear relationship between the behavioural features and depression using correlation and bivariate linear mixed models (LMM). We leverage 5 supervised machine learning (ML) algorithms with hyperparameter optimization, nested cross-validation, and imbalanced data handling to predict depression. Finally, with the Permutation Importance method, we find influential behavioural markers in predicting depression. RESULTS Of the 629 participants from at least 56 countries, 10.96% were females, 86.80% males, 2.22% non-binary. For participants’ age distribution; 11.61% were between 18–24 years, 32.43% 25–34, 24.80% 35–44, 26.39% 45–64 and 4.77% were 65 years and over. Of the 1374 PHQ-8 assessments 83.19% were non-depressed, 16.81% were depressed, based on PHQ-8 cut off. Significant positive Pearson’s correlation was found between screen status normalised entropy and depression (r=0.14, P<.001). LMM demonstrates intra-class correlation of 0.7584 and significant positive association between screen status normalised entropy and depression (beta=.48, P=0.03). The best ML algorithms obtained precision (85.55%–92.50%), recall (92.19%–94.38%), F1 (88.73%–93.41%), area under the curve receiver operating characteristic AUC (94.68%–98.83%), Cohen’s kappa (86.61%–92.21%), and accuracy (96.44%–97.97%). Including age group and gender as predictors improved the ML performances. Screen and Internet connectivity features were the most influential in predicting depression. CONCLUSIONS Our findings demonstrate that behavioural markers indicative of depression can be unobtrusively identified from smartphone sensors’ data. Traditional assessment of depression can be augmented with behavioural markers from smartphones for depression diagnosis and monitoring.

Download Full-text

Optimized machine learning algorithm for intrusion detection

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v24.i1.pp590-599 ◽

2021 ◽

Vol 24 (1) ◽

pp. 590

Author(s):

Royida A. Ibrahem Alhayali ◽

Mohammad Aljanabi ◽

Ahmed Hussein Ali ◽

Mostafa Abdulghfoor Mohammed ◽

Tole Sutikno

Keyword(s):

Machine Learning ◽

Intrusion Detection ◽

Learning Algorithm ◽

Detection System ◽

Optimization Algorithms ◽

Feature Subset Selection ◽

Supervised Machine Learning ◽

Support Vector ◽

Feature Subset ◽

Training Time

Intrusion detection is mainly achieved by using optimization algorithms. The need for optimization algorithms for intrusion detection is necessitated by the increasing number of features in audit data, as well as the performance failure of the human-based smart intrusion detection system (IDS) in terms of their prolonged training time and classification accuracy. This article presents an improved intrusion detection technique for binary classification. The proposal is a combination of different optimizers, including Rao optimization algorithm, extreme learning machine (ELM), support vector machine (SVM), and logistic regression (LR) (for feature selection & weighting), as well as a hybrid Rao-SVM algorithm with supervised machine learning (ML) techniques for feature subset selection (FSS). The process of selecting the least number of features without sacrificing the FSS accuracy was considered a multi-objective optimization problem. The algorithm-specific, parameter-less concept of the proposed Rao-SVM was also explored in this study. The KDDCup 99 and CICIDS 2017 were used as the intrusion dataset for the experiments, where significant improvements were noted with the new Rao-SVM compared to the other algorithms. Rao-SVM presented better results than many existing works by reaching 100% accuracy for KDDCup 99 dataset and 97% for CICIDS dataset.

Download Full-text

A Digital Oilfield Comprehensive Study: Automated Intelligent Production Network Optimization

10.2118/205735-ms ◽

2021 ◽

Author(s):

Aulia Ahmad Naufal ◽

Sabrina Metra

Keyword(s):

Machine Learning ◽

Network Model ◽

Application Programming Interface ◽

Production Optimization ◽

Supervised Machine Learning ◽

Production Data ◽

Production Network ◽

Daily Data ◽

Individual Branch ◽

Set Up

Abstract Production optimization on a network level has been proven to be an effective method to maximize production potential of a field with low capital. But as it stands, it is a heavy process to start along with its several challenges such as data quality issues, tedious plus repetitive work processes to deploy and reuse a complete network model. Leveraging technologies from a flow assurance simulator, python Application Programming Interface (API) toolkit, open-source machine learning packages in python, and a commercial visualization dashboard, this paper proposed a series of workflows to simplify model deployment and set up an automatic advisory system to provide insight as a mean to justify an engineer’s day to-day engineering decision. A total of three steps was prepared to achieve field-level automated optimization system. First, is the creation of digital twin of well and network model. To eliminate potential data errors, reduce time consumed, and to merge various part of the model into one, a scalable python script was made. Second, an automated calibration workflow is created as performance issues also arises for individual branch calibration matching. Hence a combination of technologies was utilized to automate daily data acquisition and model update from production database and run a supervised machine learning model to continuously calibrate the network model. The last one is creating the customizable optimization workflow based of field KPIs, which results are derived from daily optimization run. The results are available in a personalized network surveillance dashboard accessible for engineers to create rapid decisions. From the first and second steps, time consumed was reduced from 30 minutes/well to 10 minutes/well in bulk well modelling workflow and from 2 hours to 10 minutes for the network model merge with the assumption of 100 wells in one network. It would also greatly increase data integrity and consistency issues as it eliminates wearisome input process. On the last step, the model was successfully updated with the latest production data and the well IPRs’ Liquid PI, reservoir pressure, and holdup factor are predicted from ML with more than 90% accuracy. As result delivery, the surveillance dashboard will be populated daily with the network production data, flowing parameters, and operation recommendations. It is estimated more than 90% time is saved from manual individual runs to digital comprehensive optimization.

Download Full-text

Uncertainty Based Under-Sampling for Learning Naive Bayes Classifiers Under Imbalanced Data Sets

IEEE Access ◽

10.1109/access.2019.2961784 ◽

2020 ◽

Vol 8 ◽

pp. 2122-2133 ◽

Cited By ~ 4

Author(s):

Christos K. Aridas ◽

Stamatis Karlos ◽

Vasileios G. Kanas ◽

Nikos Fazakis ◽

Sotiris B. Kotsiantis

Keyword(s):

Naive Bayes ◽

Imbalanced Data ◽

Naïve Bayes ◽

Data Sets ◽

Imbalanced Data Sets ◽

Under Sampling

Download Full-text

Application of parallel distributed genetics-based machine learning to imbalanced data sets

2012 IEEE International Conference on Fuzzy Systems ◽

10.1109/fuzz-ieee.2012.6251192 ◽

2012 ◽

Cited By ~ 3

Author(s):

Yusuke Nojima ◽

Shingo Mihara ◽

Hisao Ishibuchi

Keyword(s):

Machine Learning ◽

Imbalanced Data ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

Hybrid Balancing Technique Using GRSOM and Bootstrap Algorithms for Classifiers with Imbalanced Data

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.931-932.1375 ◽

2014 ◽

Vol 931-932 ◽

pp. 1375-1381

Author(s):

Sirorat Pattanapairoj ◽

Danaipong Chetchotsak ◽

Banchar Arnonkijpanich

Keyword(s):

Machine Learning ◽

Imbalanced Data ◽

Computation Time ◽

Original Data ◽

Data Sets ◽

Bootstrap Sampling ◽

Under Sampling ◽

Hybrid Data ◽

Time Required ◽

Sampling Approach

To deal with imbalanced data, this paper proposes a hybrid data balancing technique which incorporates both over and under-sampling approaches. This technique determines how much minority data should be grown as well as how much majority data should be reduced. In this manner, noise introduced to the data due to excessive over-sampling could be avoided. On top of that, the proposed data balancing technique helps to determine the appropriate size of the balanced data and thus computation time required for construction of classifiers would be more efficient. The data balancing technique over samples the minority data through GRSOM method and then under samples the majority data using the bootstrap sampling approach. GRSOM is used in this study because it grows new samples in a non-linear fashion and preserves the original data structure. Performance of the proposed method is tested using four data sets from UCI Machine Learning Repository. Once the data sets are balanced, the committee of classifiers is constructed using these balanced data. The experimental results reveal that our proposed data balancing method provides the best performance.

Download Full-text

Predicting Breast Cancer via Supervised Machine Learning Methods on Class Imbalanced Data

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2020.0110808 ◽

2020 ◽

Vol 11 (8) ◽

Author(s):

Keerthana Rajendran ◽

Manoj Jayabalan ◽

Vinesh Thiruchelvam

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Imbalanced Data ◽

Supervised Machine Learning ◽

Learning Methods ◽

Machine Learning Methods

Download Full-text