Decision Trees and Random Forests: Machine Learning Techniques to Classify Rare Events

Simon Hegelich

doi:10.18278/epa.2.1.7

Ecological Interactions and the Netflix Problem

10.1101/089771 ◽

2016 ◽

Cited By ~ 1

Author(s):

Philippe Desjardins-Proulx ◽

Idaline Laigle ◽

Timothée Poisot ◽

Dominique Gravel

Keyword(s):

Machine Learning ◽

Supervised Learning ◽

Random Forests ◽

Species Interactions ◽

Similarity Measures ◽

Theoretical Models ◽

Machine Learning Techniques ◽

Nearest Neighbour ◽

Ecological Interactions ◽

Learning Techniques

0AbstractSpecies interactions are a key component of ecosystems but we generally have an incomplete picture of who-eats-who in a given community. Different techniques have been devised to predict species interactions using theoretical models or abundances. Here, we explore the K nearest neighbour approach, with a special emphasis on recommendation, along with other machine learning techniques. Recommenders are algorithms developed for companies like Netflix to predict if a customer would like a product given the preferences of similar customers. These machine learning techniques are well-suited to study binary ecological interactions since they focus on positive-only data. We also explore how the K nearest neighbour approach can be used with both positive and negative information, in which case the goal of the algorithm is to fill missing entries from a matrix (imputation). By removing a prey from a predator, we find that recommenders can guess the missing prey around 50% of the times on the first try, with up to 881 possibilities. Traits do not improve significantly the results for the K nearest neighbour, although a simple test with a supervised learning approach (random forests) show we can predict interactions with high accuracy using only three traits per species. This result shows that binary interactions can be predicted without regard to the ecological community given only three variables: body mass and two variables for the species’ phylogeny. These techniques are complementary, as recommenders can predict interactions in the absence of traits, using only information about other species’ interactions, while supervised learning algorithms such as random forests base their predictions on traits only but do not exploit other species’ interactions. Further work should focus on developing custom similarity measures specialized to ecology to improve the KNN algorithms and using richer data to capture indirect relationships between species.

Download Full-text

Machine Learning Techniques Applied to Profile Mobile Banking Users in India

International Journal of Information Systems in the Service Sector ◽

10.4018/jisss.2013010105 ◽

2013 ◽

Vol 5 (1) ◽

pp. 82-92 ◽

Cited By ~ 8

Author(s):

M. Carr ◽

V. Ravi ◽

G. Sridharan Reddy ◽

D. Veranna

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Decision Tree ◽

Decision Trees ◽

Multilayer Perceptron ◽

Machine Learning Techniques ◽

Mobile Banking ◽

Classification Rules ◽

Learning Techniques ◽

Potential Customers

This paper profiles mobile banking users using machine learning techniques viz. Decision Tree, Logistic Regression, Multilayer Perceptron, and SVM to test a research model with fourteen independent variables and a dependent variable (adoption). A survey was conducted and the results were analysed using these techniques. Using Decision Trees the profile of the mobile banking adopter’s profile was identified. Comparing different machine learning techniques it was found that Decision Trees outperformed the Logistic Regression and Multilayer Perceptron and SVM. Out of all the techniques, Decision Tree is recommended for profiling studies because apart from obtaining high accurate results, it also yields ‘if–then’ classification rules. The classification rules provided here can be used to target potential customers to adopt mobile banking by offering them appropriate incentives.

Download Full-text

Development and Optimization of VGF-GaAs Crystal Growth Process Using Data Mining and Machine Learning Techniques

Crystals ◽

10.3390/cryst11101218 ◽

2021 ◽

Vol 11 (10) ◽

pp. 1218

Author(s):

Natasha Dropka ◽

Klaus Böttcher ◽

Martin Holena

Keyword(s):

Machine Learning ◽

Data Mining ◽

Crystal Growth ◽

Decision Trees ◽

Growth Process ◽

Training Data ◽

Machine Learning Techniques ◽

Interface Position ◽

Crystal Growth Process ◽

Learning Techniques

The aim of this study was to assess the ability of the various data mining and supervised machine learning techniques: correlation analysis, k-means clustering, principal component analysis and decision trees (regression and classification), to derive, optimize and understand the factors influencing VGF-GaAs growth. Training data were generated by Computational Fluid Dynamics (CFD) simulations and consisted of 130 datasets with 6 inputs (growth rate and power of 5 heaters) and 5 outputs (interface position and deflection, and temperatures at various positions in GaAs). Data mining results confirmed a good dispersion of the training data without the feasibility of a dimensionality reduction. Data clustering was observed in relation to the position of the crystallization front relative to the side heaters. Based on the statistical performance criteria and training results, decision trees identified the most decisive inputs and their ranges for a favorable interface shape and to keep GaAs temperature beyond limits for heavy arsenic evaporation. Decision trees are a recommendable machine learning technique with short training times and acceptable predictive accuracy based on small volume of CFD training data, capable of providing guidelines for understanding the crystal growth process, which is a prerequisite for the growth of low-cost, high-quality bulk crystals.

Download Full-text

Recognition of Soybean Diseases Using Machine Learning Techniques Based on Segmentation of Images Captured By UAVs

10.5753/wvc.2020.13476 ◽

2020 ◽

Author(s):

Gercina Da Silva ◽

Alessandro Ferreira ◽

Denilson Guilherme ◽

José Fernando Grigolli ◽

Vanessa Weber ◽

...

Keyword(s):

Machine Learning ◽

Computer Vision ◽

Computer Program ◽

Random Forests ◽

Machine Learning Techniques ◽

Target Spot ◽

Learning Techniques ◽

Image Dataset ◽

Segmentation Of Images ◽

Soybean Diseases

Soybean is an important product for the Brazilian economy, however it has factors that can limit its productive income, like the diseases that are generally difficult to control. Thus, this article aims to use a computer program to recognize diseases in images obtained by a UAV in a soybean plantation. The program is based on computer vision and machine learning, using the SLIC algorithm to segment the images into superpixels. To achieve the objective, after the segmentation of the images, an image dataset was created with the following classes: mildew, target spot, Asian rust, soil, straw and healthy leaves, totaling 22,140 images. Diagrammatic scales were used to assess disease severity. The disease recognition computer program explored four supervised learning techniques: SVM, J48, Random Forest and KNN. The techniques that obtained the best performance were SVM and Random Forests, taking into account the results obtained with all the evaluation metrics used. It was found that the program is efficient to differentiate the classes of diseases treated in this article.

Download Full-text

Detection of Loss Zones while Drilling Using Different Machine Learning Techniques

Journal of Energy Resources Technology ◽

10.1115/1.4051553 ◽

2021 ◽

pp. 1-29

Author(s):

Ahmed Alsaihati ◽

Mahmoud Abughaban ◽

Salaheldin Elkatatny ◽

Abdulazeez Abdulraheem

Keyword(s):

Machine Learning ◽

Support Vector Machines ◽

Random Forests ◽

Nearest Neighbors ◽

Machine Learning Techniques ◽

Support Vector ◽

K Nearest Neighbors ◽

Learning Techniques ◽

Vector Machines ◽

Testing Set

Abstract Fluid loss into formations is a common operational issue that is frequently encountered when drilling across naturally or induced fractured formations. This could pose significant operational risks, such as well-control, stuck pipe, and wellbore instability, which, in turn, lead to an increase of well time and cost. This research aims to use and evaluate different machine learning techniques, namely: support vector machines, random forests, and K-nearest neighbors in detecting loss circulation occurrences while drilling using solely drilling surface parameters. Actual field data of seven wells, which had suffered partial or severe loss circulation, were used to build predictive models, while Well-8 was used to compare the performance of the developed models. Different performance metrics were used to evaluate the performance of the developed models. Recall, precision, and F1-score measures were used to evaluate the ability of the developed model to detect loss circulation occurrences. The results showed the K-nearest neighbors classifier achieved a high F1-score of 0.912 in detecting loss circulation occurrence in the testing set, while the random forests was the second-best classifier with almost the same F1-score of 0.910. The support vector machines achieved an F1-score of 0.83 in predicting the loss circulation occurrence in the testing set. The K-nearest neighbors outperformed other models in detecting the loss circulation occurrences in Well-8 with an F1-score of 0.80. The main contribution of this research as compared to previous studies is that it identifies losses events based on real-time measurements of the active pit volume.

Download Full-text

Validation of machine learning techniques: decision trees and finite training set

Journal of Electronic Imaging ◽

10.1117/1.482630 ◽

1998 ◽

Vol 7 (1) ◽

pp. 94 ◽

Cited By ~ 1

Author(s):

Geoffrey A. W. West

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Machine Learning Techniques ◽

Training Set ◽

Learning Techniques

Download Full-text

Capacity Control in Indoor Spaces Using Machine Learning Techniques Together with BLE Technology

Journal of Sensor and Actuator Networks ◽

10.3390/jsan10020035 ◽

2021 ◽

Vol 10 (2) ◽

pp. 35

Author(s):

M. Encarnación Beato Gutiérrez ◽

Montserrat Mateos Sánchez ◽

Roberto Berjón Gallinas ◽

Ana M. Fermoso García

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Prediction Models ◽

Research Work ◽

Machine Learning Techniques ◽

Capacity Control ◽

The People ◽

Learning Techniques ◽

Enclosed Space ◽

Indoor Spaces

At present, capacity control in indoor spaces is critical in the current situation in which we are living in, due to the pandemic. In this work, we propose a new solution using machine learning techniques with BLE technology. This study presents a real experiment in a university environment and we study three different prediction models using machine learning techniques—specifically, logistic regression, decision trees and artificial neural networks. As a conclusion, the study shows that machine learning techniques, in particular decision trees, together with BLE technology, provide a solution to the problem. The contribution of this research work shows that the prediction model obtained is capable of detecting when the COVID capacity of an enclosed space is exceeded. In addition, it ensures that no false negatives are produced, i.e., all the people inside the laboratory will be correctly counted.

Download Full-text

Using machine learning techniques and different color spaces for the classification of Cape gooseberry (Physalis peruviana L.) fruits according to ripeness level

10.7287/peerj.preprints.26691 ◽

2019 ◽

Author(s):

Wilson Castro ◽

Jimy Oblitas ◽

Miguel De-la-Torre ◽

Carlos Cotrina ◽

Karen Bazán ◽

...

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Color Space ◽

Machine Learning Techniques ◽

Support Vector ◽

Color Spaces ◽

Learning Techniques ◽

Different Color ◽

Cape Gooseberry

The classification of fresh fruits according to their ripeness is typically a subjective and tedious task; consequently, there is growing interest in the use of non-contact techniques such as those based on computer vision and machine learning. In this paper, we propose the use of non-intrusive techniques for the classification of Cape gooseberry fruits. The proposal is based on the use of machine learning techniques combined with different color spaces. Given the success of techniques such as artificial neural networks,support vector machines, decision trees, and K-nearest neighbors in addressing classification problems, we decided to use these approaches in this research work. A sample of 926 Cape gooseberry fruits was obtained, and fruits were classified manually according to their level of ripeness into seven different classes. Images of each fruit were acquired in the RGB format through a system developed for this purpose. These images were preprocessed, filtered and segmented until the fruits were identified. For each piece of fruit, the median color parameter values in the RGB space were obtained, and these results were subsequently transformed into the HSV and L*a*b* color spaces. The values of each piece of fruit in the three color spaces and their corresponding degrees of ripeness were arranged for use in the creation, testing, and comparison of the developed classification models. The classification of gooseberry fruits by ripening level was found to be sensitive to both the color space used and the classification technique, e.g., the models based on decision trees are the most accurate, and the models based on the L*a*b* color space obtain the best mean accuracy. However, the model that best classifies the cape gooseberry fruits based on ripeness level is that resulting from the combination of the SVM technique and the RGB color space.

Download Full-text

Prediction of Mean Wave Overtopping Discharge Using Gradient Boosting Decision Trees

Water ◽

10.3390/w12061703 ◽

2020 ◽

Vol 12 (6) ◽

pp. 1703 ◽

Cited By ~ 3

Author(s):

Joost P. den Bieman ◽

Josefine M. Wilms ◽

Henk F. P. van den Boogaard ◽

Marcel R. A. van Gent

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Numerical Models ◽

Input Parameter ◽

Design Criterion ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Wave Overtopping ◽

Learning Techniques ◽

Machine Learning Model

Wave overtopping is an important design criterion for coastal structures such as dikes, breakwaters and promenades. Hence, the prediction of the expected wave overtopping discharge is an important research topic. Existing prediction tools consist of empirical overtopping formulae, machine learning techniques like neural networks, and numerical models. In this paper, an innovative machine learning method—gradient boosting decision trees—is applied to the prediction of mean wave overtopping discharges. This new machine learning model is trained using the CLASH wave overtopping database. Optimizations to its performance are realized by using feature engineering and hyperparameter tuning. The model is shown to outperform an existing neural network model by reducing the error on the prediction of the CLASH database by a factor of 2.8. The model predictions follow physically realistic trends for variations of important features, and behave regularly in regions of the input parameter space with little or no data coverage.

Download Full-text

Using machine learning techniques and different color spaces for the classification of Cape gooseberry (Physalis peruviana L.) fruits according to ripeness level

10.7287/peerj.preprints.26691v2 ◽

2019 ◽

Cited By ~ 1

Author(s):

Wilson Castro ◽

Jimy Oblitas ◽

Miguel De-la-Torre ◽

Carlos Cotrina ◽

Karen Bazán ◽

...

Keyword(s):

Machine Learning ◽

Decision Trees ◽

Color Space ◽

Machine Learning Techniques ◽

Support Vector ◽

Color Spaces ◽

Learning Techniques ◽

Different Color ◽

Cape Gooseberry

The classification of fresh fruits according to their ripeness is typically a subjective and tedious task; consequently, there is growing interest in the use of non-contact techniques such as those based on computer vision and machine learning. In this paper, we propose the use of non-intrusive techniques for the classification of Cape gooseberry fruits. The proposal is based on the use of machine learning techniques combined with different color spaces. Given the success of techniques such as artificial neural networks,support vector machines, decision trees, and K-nearest neighbors in addressing classification problems, we decided to use these approaches in this research work. A sample of 926 Cape gooseberry fruits was obtained, and fruits were classified manually according to their level of ripeness into seven different classes. Images of each fruit were acquired in the RGB format through a system developed for this purpose. These images were preprocessed, filtered and segmented until the fruits were identified. For each piece of fruit, the median color parameter values in the RGB space were obtained, and these results were subsequently transformed into the HSV and L*a*b* color spaces. The values of each piece of fruit in the three color spaces and their corresponding degrees of ripeness were arranged for use in the creation, testing, and comparison of the developed classification models. The classification of gooseberry fruits by ripening level was found to be sensitive to both the color space used and the classification technique, e.g., the models based on decision trees are the most accurate, and the models based on the L*a*b* color space obtain the best mean accuracy. However, the model that best classifies the cape gooseberry fruits based on ripeness level is that resulting from the combination of the SVM technique and the RGB color space.

Download Full-text