Impact of subsampling and tree depth on random forests

ESAIM Probability and Statistics ◽

10.1051/ps/2018008 ◽

2018 ◽

Vol 22 ◽

pp. 96-128 ◽

Cited By ~ 3

Author(s):

Roxane Duroux ◽

Erwan Scornet

Keyword(s):

Decision Trees ◽

Ensemble Learning ◽

Random Forests ◽

Mean Squared Error ◽

Mach Learn ◽

Learning Methods ◽

Data Set ◽

Squared Error ◽

Tree Construction ◽

Two Parameters

Random forests are ensemble learning methods introduced by Breiman [Mach. Learn. 45 (2001) 5–32] that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman’s forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we study the influence of two parameters, namely the subsampling rate and the tree depth, on Breiman’s forests performance. More precisely, we prove that quantile forests (a specific type of random forests) based on subsampling and quantile forests whose tree construction is terminated early have similar performances, as long as their respective parameters (subsampling rate and tree depth) are well chosen. Moreover, experiments show that a proper tuning of these parameters leads in most cases to an improvement of Breiman’s original forests in terms of mean squared error.

Download Full-text

Calibration-Based Estimators using Different Distance Measures under Two Auxiliary Variables: A Comparative Study

Journal of Modern Applied Statistical Methods ◽

10.22237/jmasm/1619481600 ◽

2021 ◽

Vol 19 (1) ◽

pp. 2-20

Author(s):

Piyush Kant Rai ◽

Alka Singh ◽

Muhammad Qasim

Keyword(s):

Mean Squared Error ◽

Real Life ◽

Distance Functions ◽

Distance Measures ◽

Auxiliary Variables ◽

Data Set ◽

Life Data ◽

Squared Error ◽

Real Life Data ◽

Relative Root

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.

Download Full-text

Automated Bale Mapping Using Machine Learning and Photogrammetry

Remote Sensing ◽

10.3390/rs13224675 ◽

2021 ◽

Vol 13 (22) ◽

pp. 4675

Author(s):

William Yamada ◽

Wei Zhao ◽

Matthew Digman

Keyword(s):

Lower Cost ◽

Mean Squared Error ◽

Spatial Clustering ◽

Mean Average Precision ◽

Average Precision ◽

Data Set ◽

Map Projection ◽

Squared Error ◽

Aerial Vehicle ◽

The Impact

An automatic method of obtaining geographic coordinates of bales using monovision un-crewed aerial vehicle imagery was developed utilizing a data set of 300 images with a 20-megapixel resolution containing a total of 783 labeled bales of corn stover and soybean stubble. The relative performance of image processing with Otsu’s segmentation, you only look once version three (YOLOv3), and region-based convolutional neural networks was assessed. As a result, the best option in terms of accuracy and speed was determined to be YOLOv3, with 80% precision, 99% recall, 89% F1 score, 97% mean average precision, and a 0.38 s inference time. Next, the impact of using lower-cost cameras was evaluated by reducing image quality to one megapixel. The lower-resolution images resulted in decreased performance, with 79% precision, 97% recall, 88% F1 score, 96% mean average precision, and 0.40 s inference time. Finally, the output of the YOLOv3 trained model, density-based spatial clustering, photogrammetry, and map projection were utilized to predict the geocoordinates of the bales with a root mean squared error of 2.41 m.

Download Full-text

Benchmarking Effectiveness and Efficiency of Deep Learning Models for Semantic Textual Similarity in the Clinical Domain: Validation Study

JMIR Medical Informatics ◽

10.2196/27386 ◽

2021 ◽

Vol 9 (12) ◽

pp. e27386

Author(s):

Qingyu Chen ◽

Alex Rankine ◽

Yifan Peng ◽

Elaheh Aghaarabi ◽

Zhiyong Lu

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Mean Squared Error ◽

Pearson Correlation ◽

Data Set ◽

Squared Error ◽

Real Time Applications ◽

Effectiveness And Efficiency ◽

Pearson Correlations

Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.

Download Full-text

Comparison of machine learning methods for crack localization

Acta et Commentationes Universitatis Tartuensis de Mathematica ◽

10.12697/acutm.2019.23.13 ◽

2019 ◽

Vol 23 (1) ◽

pp. 125-142

Author(s):

Helle Hein ◽

Ljubov Jaanuska

Keyword(s):

Machine Learning ◽

Random Forests ◽

Crack Depth ◽

Haar Wavelet ◽

Extensive Investigation ◽

Learning Methods ◽

Data Set ◽

Crack Location ◽

Machine Learning Methods ◽

Discrete Transform

In this paper, the Haar wavelet discrete transform, the artificial neural networks (ANNs), and the random forests (RFs) are applied to predict the location and severity of a crack in an Euler–Bernoulli cantilever subjected to the transverse free vibration. An extensive investigation into two data collection sets and machine learning methods showed that the depth of a crack is more difficult to predict than its location. The data set of eight natural frequency parameters produces more accurate predictions on the crack depth; meanwhile, the data set of eight Haar wavelet coefficients produces more precise predictions on the crack location. Furthermore, the analysis of the results showed that the ensemble of 50 ANN trained by Bayesian regularization and Levenberg–Marquardt algorithms slightly outperforms RF.

Download Full-text

RNN-ABC: A New Swarm Optimization Based Technique for Anomaly Detection

Computers ◽

10.3390/computers8030059 ◽

2019 ◽

Vol 8 (3) ◽

pp. 59 ◽

Cited By ~ 9

Author(s):

Ayyaz-Ul-Haq Qureshi ◽

Hadi Larijani ◽

Nhamoinesu Mtetwa ◽

Abbas Javed ◽

Jawad Ahmad

Keyword(s):

Intrusion Detection ◽

Mean Squared Error ◽

Detection System ◽

Data Set ◽

Bee Colony ◽

Squared Error ◽

Random Neural Network ◽

Prime Concern ◽

Gradient Descent Algorithm ◽

Memory Resources

The exponential growth of internet communications and increasing dependency of users upon software-based systems for most essential, everyday applications has raised the importance of network security. As attacks are on the rise, cybersecurity should be considered as a prime concern while developing new networks. In the past, numerous solutions have been proposed for intrusion detection; however, many of them are computationally expensive and require high memory resources. In this paper, we propose a new intrusion detection system using a random neural network and an artificial bee colony algorithm (RNN-ABC). The model is trained and tested with the benchmark NSL-KDD data set. Accuracy and other metrics, such as the sensitivity and specificity of the proposed RNN-ABC, are compared with the traditional gradient descent algorithm-based RNN. While the overall accuracy remains at 95.02%, the performance is also estimated in terms of mean of the mean squared error (MMSE), standard deviation of MSE (SDMSE), best mean squared error (BMSE), and worst mean squared error (WMSE) parameters, which further confirms the superiority of the proposed scheme over the traditional methods.

Download Full-text

Modelling lake trophic state: A random forest approach

10.7287/peerj.preprints.1319v3 ◽

2015 ◽

Author(s):

Jeffrey W Hollister ◽

W. Bryan Milstead ◽

Betty J. Kreakie

Keyword(s):

Water Quality ◽

Random Forests ◽

Trophic State ◽

Mean Squared Error ◽

Quality Data ◽

Full Model ◽

Water Quality Data ◽

Squared Error ◽

The Mean

Productivity of lentic ecosystems is well studied and it is widely accepted that as nutrient inputs increase, productivity increases and lakes transition from lower trophic state (e.g. oligotrophic) to higher trophic states (e.g. eutrophic). These broad trophic state classifications are good predictors of ecosystem condition, services, and disservices (e.g. recreation, aesthetics, and harmful algal blooms). While the relationship between nutrients and trophic state provides reliable predictions, it requires in situ water quality data in order to parameterize the model. This limits the application of these models to lakes with existing and, more importantly, available water quality data. To address this, we take advantage of the availability of a large national lakes water quality database (i.e. the National Lakes Assessment), land use/land cover data, lake morphometry data, other universally available data, and apply data mining approaches to predict trophic state. Using this data and random forests, we first model chlorophyll a, then classify the resultant predictions into trophic states. The full model estimates chlorophyll a with both in situ and universally available data. The mean squared error and adjusted R2 of this model was 0.09 and 0.8, respectively. The second model uses universally available GIS data only. The mean squared error was 0.22 and the adjusted R2 was 0.48. The accuracy of the trophic state classifications derived from the chlorophyll a predictions were 69% for the full model and 49% for the “GIS only” model. Random forests extend the usefulness of the class predictions by providing prediction probabilities for each lake. This allows us to make trophic state predictions and also indicate the level of uncertainity around those predictions. For the full model, these predicted class probabilites ranged from 0.42 to 1. For the GIS only model, they ranged from 0.33 to 0.96. It is our conclusion that in situ data are required for better predictions, yet GIS and universally available data provide trophic state predictions, with estimated uncertainty, that still have the potential for a broad array of applications. The source code and data for this manuscript are available from https://github.com/USEPA/LakeTrophicModelling.

Download Full-text

Improving enzyme optimum temperature prediction with resampling strategies and ensemble learning

10.1101/2020.05.06.081737 ◽

2020 ◽

Author(s):

Japheth E. Gado ◽

Gregg T. Beckham ◽

Christina M. Payne

Keyword(s):

Ensemble Learning ◽

Mean Squared Error ◽

Reaction Rates ◽

Training Data ◽

Machine Learning Method ◽

Thermostable Enzymes ◽

Squared Error ◽

Order Of Magnitude ◽

The Mean ◽

Python Package

ABSTRACTAccurate prediction of the optimal catalytic temperature (Topt) of enzymes is vital in biotechnology, as enzymes with high Topt values are desired for enhanced reaction rates. Recently, a machine-learning method (TOME) for predicting Topt was developed. TOME was trained on a normally-distributed dataset with a median Topt of 37°C and less than five percent of Topt values above 85°C, limiting the method’s predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on Topt values greater than 85°C is nearly an order of magnitude higher than the error on values between 30 and 50°C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high Topt values (>85°C) by 60% and increase the overall R2 value from 0.527 to 0.632. The revised method, TOMER, and the resampling strategies applied in this work are freely available to other researchers as a Python package on GitHub.

Download Full-text

Annual Rainfall Prediction of Various States in India using Linear Regression

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b4786.079220 ◽

2020 ◽

Vol 9 (2) ◽

pp. 951-954

Keyword(s):

Linear Regression ◽

Mean Squared Error ◽

Annual Rainfall ◽

Linear Regression Method ◽

Data Set ◽

Rainfall Prediction ◽

Box Plots ◽

Squared Error ◽

Scatter Plots ◽

Predicted Values

Rainfall prediction is a significant part in agriculture, so prediction of rainfall is essential for the best financial development of our nation. In this paper, we represent the linear regression method to predict the yearly rainfall in different states of India. To predict the estimate of yearly rainfall, the linear regression is implemented on the data set and the coefficients are used to predict the yearly rainfall based on the corresponding parameter values. Finally an estimate value of what the rainfall might be at a given values and places can be establish easily. In this paper, we demonstrate how to predict the yearly rainfall in all the states from the year 1901 to 2015 by using simple multi linear regression concepts. Then we train the model using train _test_ split and analyze various performance measures like Mean squared error, Root mean squared error, R^2 and we visualize the data using scatter plots, box plots, expected and predicted values

Download Full-text

Influence of Domain and Model Properties on the Reliability Estimates' Performance

International Journal of Data Warehousing and Mining ◽

10.4018/jdwm.2009080704 ◽

2009 ◽

Vol 5 (4) ◽

pp. 58-76

Author(s):

Zoran Bosnic ◽

Igor Kononenko

Keyword(s):

Machine Learning ◽

Regression Models ◽

Mean Squared Error ◽

Noisy Data ◽

Data Set ◽

Squared Error ◽

Reliability Estimates ◽

Risk Sensitive ◽

Average Accuracy ◽

Applications Of Machine Learning

In machine learning, the reliability estimates for individual predictions provide more information about individual prediction error than the average accuracy of predictive model (e.g. relative mean squared error). Such reliability estimates may represent decisive information in the risk-sensitive applications of machine learning (e.g. medicine, engineering, and business), where they enable the users to distinguish between more and less reliable predictions. In the authors’ previous work they proposed eight reliability estimates for individual examples in regression and evaluated their performance. The results showed that the performance of each estimate strongly varies depending on the domain and regression model properties. In this paper they empirically analyze the dependence of reliability estimates’ performance on the data set and model properties. They present the results which show that the reliability estimates perform better when used with more accurate regression models, in domains with greater number of examples and in domains with less noisy data.

Download Full-text

THE IMPROVED BPNN-NAR AND BPNN-NARMA MODELS ON MALAYSIAN AGGREGATE COST INDICES WITH OUTLYING DATA

Jurnal Teknologi ◽

10.11113/jt.v78.10024 ◽

2016 ◽

Vol 78 (12-3) ◽

Author(s):

Saadi Ahmad Kamaruddin ◽

Nor Azura Md Ghani ◽

Norazan Mohamed Ramli

Keyword(s):

Neural Network ◽

Time Series ◽

Mean Squared Error ◽

Ordinary Least Squares ◽

Series Data ◽

Backpropagation Neural Network ◽

Data Set ◽

Squared Error ◽

The Impact ◽

Nonlinear Autoregressive

Neurocomputing have been adapted in time series forecasting arena, but the presence of outliers that usually occur in data time series may be harmful to the data network training. This is because the ability to automatically find out any patterns without prior assumptions and loss of generality. In theory, the most common training algorithm for Backpropagation algorithms leans on reducing ordinary least squares estimator (OLS) or more specifically, the mean squared error (MSE). However, this algorithm is not fully robust when outliers exist in training data, and it will lead to false forecast future value. Therefore, in this paper, we present a new algorithm that manipulate algorithms firefly on least median squares estimator (FFA-LMedS) for Backpropagation neural network nonlinear autoregressive (BPNN-NAR) and Backpropagation neural network nonlinear autoregressive moving (BPNN-NARMA) models to reduce the impact of outliers in time series data. The performances of the proposed enhanced models with comparison to the existing enhanced models using M-estimators, Iterative LMedS (ILMedS) and Particle Swarm Optimization on LMedS (PSO-LMedS) are done based on root mean squared error (RMSE) values which is the main highlight of this paper. In the meanwhile, the real-industrial monthly data of Malaysian Aggregate cost indices data set from January 1980 to December 2012 (base year 1980=100) with different degree of outliers problem is adapted in this research. At the end of this paper, it was found that the enhanced BPNN-NARMA models using M-estimators, ILMedS and FFA-LMedS performed very well with RMSE values almost zero errors. It is expected that the findings would assist the respected authorities involve in Malaysian construction projects to overcome cost overruns.

Download Full-text