scholarly journals Impact of subsampling and tree depth on random forests

2018 ◽  
Vol 22 ◽  
pp. 96-128 ◽  
Author(s):  
Roxane Duroux ◽  
Erwan Scornet

Random forests are ensemble learning methods introduced by Breiman [Mach. Learn. 45 (2001) 5–32] that operate by averaging several decision trees built on a randomly selected subspace of the data set. Despite their widespread use in practice, the respective roles of the different mechanisms at work in Breiman’s forests are not yet fully understood, neither is the tuning of the corresponding parameters. In this paper, we study the influence of two parameters, namely the subsampling rate and the tree depth, on Breiman’s forests performance. More precisely, we prove that quantile forests (a specific type of random forests) based on subsampling and quantile forests whose tree construction is terminated early have similar performances, as long as their respective parameters (subsampling rate and tree depth) are well chosen. Moreover, experiments show that a proper tuning of these parameters leads in most cases to an improvement of Breiman’s original forests in terms of mean squared error.

2021 ◽  
Vol 19 (1) ◽  
pp. 2-20
Author(s):  
Piyush Kant Rai ◽  
Alka Singh ◽  
Muhammad Qasim

This article introduces calibration estimators under different distance measures based on two auxiliary variables in stratified sampling. The theory of the calibration estimator is presented. The calibrated weights based on different distance functions are also derived. A simulation study has been carried out to judge the performance of the proposed estimators based on the minimum relative root mean squared error criterion. A real-life data set is also used to confirm the supremacy of the proposed method.


2021 ◽  
Vol 13 (22) ◽  
pp. 4675
Author(s):  
William Yamada ◽  
Wei Zhao ◽  
Matthew Digman

An automatic method of obtaining geographic coordinates of bales using monovision un-crewed aerial vehicle imagery was developed utilizing a data set of 300 images with a 20-megapixel resolution containing a total of 783 labeled bales of corn stover and soybean stubble. The relative performance of image processing with Otsu’s segmentation, you only look once version three (YOLOv3), and region-based convolutional neural networks was assessed. As a result, the best option in terms of accuracy and speed was determined to be YOLOv3, with 80% precision, 99% recall, 89% F1 score, 97% mean average precision, and a 0.38 s inference time. Next, the impact of using lower-cost cameras was evaluated by reducing image quality to one megapixel. The lower-resolution images resulted in decreased performance, with 79% precision, 97% recall, 88% F1 score, 96% mean average precision, and 0.40 s inference time. Finally, the output of the YOLOv3 trained model, density-based spatial clustering, photogrammetry, and map projection were utilized to predict the geocoordinates of the bales with a root mean squared error of 2.41 m.


10.2196/27386 ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. e27386
Author(s):  
Qingyu Chen ◽  
Alex Rankine ◽  
Yifan Peng ◽  
Elaheh Aghaarabi ◽  
Zhiyong Lu

Background Semantic textual similarity (STS) measures the degree of relatedness between sentence pairs. The Open Health Natural Language Processing (OHNLP) Consortium released an expertly annotated STS data set and called for the National Natural Language Processing Clinical Challenges. This work describes our entry, an ensemble model that leverages a range of deep learning (DL) models. Our team from the National Library of Medicine obtained a Pearson correlation of 0.8967 in an official test set during 2019 National Natural Language Processing Clinical Challenges/Open Health Natural Language Processing shared task and achieved a second rank. Objective Although our models strongly correlate with manual annotations, annotator-level correlation was only moderate (weighted Cohen κ=0.60). We are cautious of the potential use of DL models in production systems and argue that it is more critical to evaluate the models in-depth, especially those with extremely high correlations. In this study, we benchmark the effectiveness and efficiency of top-ranked DL models. We quantify their robustness and inference times to validate their usefulness in real-time applications. Methods We benchmarked five DL models, which are the top-ranked systems for STS tasks: Convolutional Neural Network, BioSentVec, BioBERT, BlueBERT, and ClinicalBERT. We evaluated a random forest model as an additional baseline. For each model, we repeated the experiment 10 times, using the official training and testing sets. We reported 95% CI of the Wilcoxon rank-sum test on the average Pearson correlation (official evaluation metric) and running time. We further evaluated Spearman correlation, R², and mean squared error as additional measures. Results Using only the official training set, all models obtained highly effective results. BioSentVec and BioBERT achieved the highest average Pearson correlations (0.8497 and 0.8481, respectively). BioSentVec also had the highest results in 3 of 4 effectiveness measures, followed by BioBERT. However, their robustness to sentence pairs of different similarity levels varies significantly. A particular observation is that BERT models made the most errors (a mean squared error of over 2.5) on highly similar sentence pairs. They cannot capture highly similar sentence pairs effectively when they have different negation terms or word orders. In addition, time efficiency is dramatically different from the effectiveness results. On average, the BERT models were approximately 20 times and 50 times slower than the Convolutional Neural Network and BioSentVec models, respectively. This results in challenges for real-time applications. Conclusions Despite the excitement of further improving Pearson correlations in this data set, our results highlight that evaluations of the effectiveness and efficiency of STS models are critical. In future, we suggest more evaluations on the generalization capability and user-level testing of the models. We call for community efforts to create more biomedical and clinical STS data sets from different perspectives to reflect the multifaceted notion of sentence-relatedness.


2019 ◽  
Vol 23 (1) ◽  
pp. 125-142
Author(s):  
Helle Hein ◽  
Ljubov Jaanuska

In this paper, the Haar wavelet discrete transform, the artificial neural networks (ANNs), and the random forests (RFs) are applied to predict the location and severity of a crack in an Euler–Bernoulli cantilever subjected to the transverse free vibration. An extensive investigation into two data collection sets and machine learning methods showed that the depth of a crack is more difficult to predict than its location. The data set of eight natural frequency parameters produces more accurate predictions on the crack depth; meanwhile, the data set of eight Haar wavelet coefficients produces more precise predictions on the crack location. Furthermore, the analysis of the results showed that the ensemble of 50 ANN trained by Bayesian regularization and Levenberg–Marquardt algorithms slightly outperforms RF.


Computers ◽  
2019 ◽  
Vol 8 (3) ◽  
pp. 59 ◽  
Author(s):  
Ayyaz-Ul-Haq Qureshi ◽  
Hadi Larijani ◽  
Nhamoinesu Mtetwa ◽  
Abbas Javed ◽  
Jawad Ahmad

The exponential growth of internet communications and increasing dependency of users upon software-based systems for most essential, everyday applications has raised the importance of network security. As attacks are on the rise, cybersecurity should be considered as a prime concern while developing new networks. In the past, numerous solutions have been proposed for intrusion detection; however, many of them are computationally expensive and require high memory resources. In this paper, we propose a new intrusion detection system using a random neural network and an artificial bee colony algorithm (RNN-ABC). The model is trained and tested with the benchmark NSL-KDD data set. Accuracy and other metrics, such as the sensitivity and specificity of the proposed RNN-ABC, are compared with the traditional gradient descent algorithm-based RNN. While the overall accuracy remains at 95.02%, the performance is also estimated in terms of mean of the mean squared error (MMSE), standard deviation of MSE (SDMSE), best mean squared error (BMSE), and worst mean squared error (WMSE) parameters, which further confirms the superiority of the proposed scheme over the traditional methods.


2015 ◽  
Author(s):  
Jeffrey W Hollister ◽  
W. Bryan Milstead ◽  
Betty J. Kreakie

Productivity of lentic ecosystems is well studied and it is widely accepted that as nutrient inputs increase, productivity increases and lakes transition from lower trophic state (e.g. oligotrophic) to higher trophic states (e.g. eutrophic). These broad trophic state classifications are good predictors of ecosystem condition, services, and disservices (e.g. recreation, aesthetics, and harmful algal blooms). While the relationship between nutrients and trophic state provides reliable predictions, it requires in situ water quality data in order to parameterize the model. This limits the application of these models to lakes with existing and, more importantly, available water quality data. To address this, we take advantage of the availability of a large national lakes water quality database (i.e. the National Lakes Assessment), land use/land cover data, lake morphometry data, other universally available data, and apply data mining approaches to predict trophic state. Using this data and random forests, we first model chlorophyll a, then classify the resultant predictions into trophic states. The full model estimates chlorophyll a with both in situ and universally available data. The mean squared error and adjusted R2 of this model was 0.09 and 0.8, respectively. The second model uses universally available GIS data only. The mean squared error was 0.22 and the adjusted R2 was 0.48. The accuracy of the trophic state classifications derived from the chlorophyll a predictions were 69% for the full model and 49% for the “GIS only” model. Random forests extend the usefulness of the class predictions by providing prediction probabilities for each lake. This allows us to make trophic state predictions and also indicate the level of uncertainity around those predictions. For the full model, these predicted class probabilites ranged from 0.42 to 1. For the GIS only model, they ranged from 0.33 to 0.96. It is our conclusion that in situ data are required for better predictions, yet GIS and universally available data provide trophic state predictions, with estimated uncertainty, that still have the potential for a broad array of applications. The source code and data for this manuscript are available from https://github.com/USEPA/LakeTrophicModelling.


2020 ◽  
Author(s):  
Japheth E. Gado ◽  
Gregg T. Beckham ◽  
Christina M. Payne

ABSTRACTAccurate prediction of the optimal catalytic temperature (Topt) of enzymes is vital in biotechnology, as enzymes with high Topt values are desired for enhanced reaction rates. Recently, a machine-learning method (TOME) for predicting Topt was developed. TOME was trained on a normally-distributed dataset with a median Topt of 37°C and less than five percent of Topt values above 85°C, limiting the method’s predictive capabilities for thermostable enzymes. Due to the distribution of the training data, the mean squared error on Topt values greater than 85°C is nearly an order of magnitude higher than the error on values between 30 and 50°C. In this study, we apply ensemble learning and resampling strategies that tackle the data imbalance to significantly decrease the error on high Topt values (>85°C) by 60% and increase the overall R2 value from 0.527 to 0.632. The revised method, TOMER, and the resampling strategies applied in this work are freely available to other researchers as a Python package on GitHub.


Rainfall prediction is a significant part in agriculture, so prediction of rainfall is essential for the best financial development of our nation. In this paper, we represent the linear regression method to predict the yearly rainfall in different states of India. To predict the estimate of yearly rainfall, the linear regression is implemented on the data set and the coefficients are used to predict the yearly rainfall based on the corresponding parameter values. Finally an estimate value of what the rainfall might be at a given values and places can be establish easily. In this paper, we demonstrate how to predict the yearly rainfall in all the states from the year 1901 to 2015 by using simple multi linear regression concepts. Then we train the model using train _test_ split and analyze various performance measures like Mean squared error, Root mean squared error, R^2 and we visualize the data using scatter plots, box plots, expected and predicted values


2009 ◽  
Vol 5 (4) ◽  
pp. 58-76
Author(s):  
Zoran Bosnic ◽  
Igor Kononenko

In machine learning, the reliability estimates for individual predictions provide more information about individual prediction error than the average accuracy of predictive model (e.g. relative mean squared error). Such reliability estimates may represent decisive information in the risk-sensitive applications of machine learning (e.g. medicine, engineering, and business), where they enable the users to distinguish between more and less reliable predictions. In the authors’ previous work they proposed eight reliability estimates for individual examples in regression and evaluated their performance. The results showed that the performance of each estimate strongly varies depending on the domain and regression model properties. In this paper they empirically analyze the dependence of reliability estimates’ performance on the data set and model properties. They present the results which show that the reliability estimates perform better when used with more accurate regression models, in domains with greater number of examples and in domains with less noisy data.


2016 ◽  
Vol 78 (12-3) ◽  
Author(s):  
Saadi Ahmad Kamaruddin ◽  
Nor Azura Md Ghani ◽  
Norazan Mohamed Ramli

Neurocomputing have been adapted in time series forecasting arena, but the presence of outliers that usually occur in data time series may be harmful to the data network training. This is because the ability to automatically find out any patterns without prior assumptions and loss of generality. In theory, the most common training algorithm for Backpropagation algorithms leans on reducing ordinary least squares estimator (OLS) or more specifically, the mean squared error (MSE). However, this algorithm is not fully robust when outliers exist in training data, and it will lead to false forecast future value. Therefore, in this paper, we present a new algorithm that manipulate algorithms firefly on least median squares estimator (FFA-LMedS) for  Backpropagation neural network nonlinear autoregressive (BPNN-NAR) and Backpropagation neural network nonlinear autoregressive moving (BPNN-NARMA) models to reduce the impact of outliers in time series data. The performances of the proposed enhanced models with comparison to the existing enhanced models using M-estimators, Iterative LMedS (ILMedS) and Particle Swarm Optimization on LMedS (PSO-LMedS) are done based on root mean squared error (RMSE) values which is the main highlight of this paper. In the meanwhile, the real-industrial monthly data of Malaysian Aggregate cost indices data set from January 1980 to December 2012 (base year 1980=100) with different degree of outliers problem is adapted in this research. At the end of this paper, it was found that the enhanced BPNN-NARMA models using M-estimators, ILMedS and FFA-LMedS performed very well with RMSE values almost zero errors. It is expected that the findings would assist the respected authorities involve in Malaysian construction projects to overcome cost overruns.


Sign in / Sign up

Export Citation Format

Share Document