Ensemble Machine Learning Approach Improves Predicted Spatial Variation of Surface Soil Organic Carbon Stocks in Data-Limited Northern Circumpolar Region

Frontiers in Big Data ◽

10.3389/fdata.2020.528441 ◽

2020 ◽

Vol 3 ◽

Author(s):

Umakant Mishra ◽

Sagar Gautam ◽

William J. Riley ◽

Forrest M. Hoffman

Keyword(s):

Machine Learning ◽

Environmental Factors ◽

Soil Properties ◽

Spatial Variation ◽

Prediction Accuracy ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Regression Kriging ◽

Soc Stocks

Various approaches of differing mathematical complexities are being applied for spatial prediction of soil properties. Regression kriging is a widely used hybrid approach of spatial variation that combines correlation between soil properties and environmental factors with spatial autocorrelation between soil observations. In this study, we compared four machine learning approaches (gradient boosting machine, multinarrative adaptive regression spline, random forest, and support vector machine) with regression kriging to predict the spatial variation of surface (0–30 cm) soil organic carbon (SOC) stocks at 250-m spatial resolution across the northern circumpolar permafrost region. We combined 2,374 soil profile observations (calibration datasets) with georeferenced datasets of environmental factors (climate, topography, land cover, bedrock geology, and soil types) to predict the spatial variation of surface SOC stocks. We evaluated the prediction accuracy at randomly selected sites (validation datasets) across the study area. We found that different techniques inferred different numbers of environmental factors and their relative importance for prediction of SOC stocks. Regression kriging produced lower prediction errors in comparison to multinarrative adaptive regression spline and support vector machine, and comparable prediction accuracy to gradient boosting machine and random forest. However, the ensemble median prediction of SOC stocks obtained from all four machine learning techniques showed highest prediction accuracy. Although the use of different approaches in spatial prediction of soil properties will depend on the availability of soil and environmental datasets and computational resources, we conclude that the ensemble median prediction obtained from multiple machine learning approaches provides greater spatial details and produces the highest prediction accuracy. Thus an ensemble prediction approach can be a better choice than any single prediction technique for predicting the spatial variation of SOC stocks.

Download Full-text

Abstract TP458: High Accuracy of Predictive Models for SAH Using Different Machine Learning Approaches

Stroke ◽

10.1161/str.51.suppl_1.tp458 ◽

2020 ◽

Vol 51 (Suppl_1) ◽

Author(s):

Paul Litvak ◽

Jeevan Medikonda ◽

Girish Menon ◽

Pitchaiah Mandava

Keyword(s):

Machine Learning ◽

Predictive Models ◽

Prediction Accuracy ◽

Support Vector ◽

World Federation ◽

Learning Approaches ◽

Flow Models ◽

Multi Stage ◽

Stage 1 ◽

Categorical Scale

Background: Patients suffering from subarachnoid hemorrhage (SAH) have poor long-term outcomes. There are predictive models for ischemic and hemorrhagic stroke. However, there is paucity of models for SAH. Machine learning concepts were applied to build multi-stage Neural Networks (NN), Support Vector Machines (SVM) and Keras/Tensor Flow models to predict SAH outcomes. Methods: A database of ~800 aneurysmal SAH patients from Kasturba Medical College was utilized. Baseline variables of World Federation of Neurosurgeons 5-point scale (WFNS 1-5), age, gender, and presence/absence of hypertension and diabetes were considered in Stage 1. Stage 2 included all Stage 1 variables along with presence/absence of radiologic signs vasospasm and ischemia. Stage 3 includes earlier 2 stages and discharge Glasgow Outcome Scale (GOS 1-5). GOS at 3 months was predicted using 2-layer NN/SVM/Keras-TensorFlow models on the five point categorical scale as well as dichotomized to dead/alive and favorable (GOS 4-5) or unfavorable (GOS 1-3). Prediction accuracy of models was compared to the recorded GOS. Results: Prediction accuracy shown as percentages (See Table) for all three stages was similar for SVM, NN and Keras/TensorFlow models. Accuracy was remarkably higher with dichotomization compared to the complete five point GOS categorical scale. Conclusions: SVM, NN, and Keras-TensorFlow based machine learning models can be used to predict SAH outcomes to a high degree of accuracy. These powerful predictive models can be used to prognosticate and select patients into trials.

Download Full-text

Evaluation of Machine Learning Approaches to Predict Soil Organic Matter and pH Using vis-NIR Spectra

Sensors ◽

10.3390/s19020263 ◽

2019 ◽

Vol 19 (2) ◽

pp. 263 ◽

Cited By ~ 16

Author(s):

Meihua Yang ◽

Dongyun Xu ◽

Songchao Chen ◽

Hongyi Li ◽

Zhou Shi

Keyword(s):

Machine Learning ◽

Organic Matter ◽

Soil Organic Matter ◽

Least Squares ◽

Paddy Soil ◽

Prediction Accuracy ◽

Accurate Determination ◽

Support Vector ◽

Learning Approaches ◽

Lower Yangtze

Soil organic matter (SOM) and pH are essential soil fertility indictors of paddy soil in the middle-lower Yangtze Plain. Rapid, non-destructive and accurate determination of SOM and pH is vital to preventing soil degradation caused by inappropriate land management practices. Visible-near infrared (vis-NIR) spectroscopy with multivariate calibration can be used to effectively estimate soil properties. In this study, 523 soil samples were collected from paddy fields in the Yangtze Plain, China. Four machine learning approaches—partial least squares regression (PLSR), least squares-support vector machines (LS-SVM), extreme learning machines (ELM) and the Cubist regression model (Cubist)—were used to compare the prediction accuracy based on vis-NIR full bands and bands reduced using the genetic algorithm (GA). The coefficient of determination (R2), root mean square error (RMSE), and ratio of performance to inter-quartile distance (RPIQ) were used to assess the prediction accuracy. The ELM with GA reduced bands was the best model for SOM (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87) and pH (R2 = 0.76, RMSE = 0.43, RPIQ = 2.15). The performance of the LS-SVM for pH prediction did not differ significantly between the model with GA (R2 = 0.75, RMSE = 0.44, RPIQ = 2.08) and without GA (R2 = 0.74, RMSE = 0.45, RPIQ = 2.07). Although a slight increase was observed when ELM were used for prediction of SOM and pH using reduced bands (SOM: R2 = 0.81, RMSE = 5.17, RPIQ = 2.87; pH: R2 = 0.76, RMSE = 0.43, RPIQ = 2.15) compared with full bands (R2 = 0.81, RMSE = 5.18, RPIQ = 2.83; pH: R2 = 0.76, RMSE = 0.45, RPIQ = 2.07), the number of wavelengths was greatly reduced (SOM: 201 to 44; pH: 201 to 32). Thus, the ELM coupled with reduced bands by GA is recommended for prediction of properties of paddy soil (SOM and pH) in the middle-lower Yangtze Plain.

Download Full-text

Ensemble Machine Learning Approaches for Proteogenomic Cancer Studies

10.21203/rs.3.rs-101902/v1 ◽

2020 ◽

Author(s):

Yulan Liang ◽

Amin Gharipour ◽

Erik Kelemen ◽

Arpad Kelemen

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Accuracy ◽

Performance Criteria ◽

Stable Set ◽

Support Vector ◽

Kappa Statistics ◽

Learning Approaches ◽

Ensemble Machine Learning ◽

Homogeneous Ensemble

Abstract Background: The identification of important proteins is critical for medical diagnosis and prognosis in common diseases. Diverse sets of computational tools were developed for omics data reductions and protein selections. However, standard statistical models with single feature selection involve the multi-testing burden of low power with the available limited samples. Furthermore, high correlations among proteins with high redundancy and moderate effects often lead to unstable selections and cause reproducibility issues. Ensemble feature selection in machine learning may identify a stable set of disease biomarkers that could improve the prediction performance of subsequent classification models, and thereby simplify their interpretability. In this study, we developed a three-stage homogeneous ensemble feature selection approach for both identifying proteins and improving prediction accuracy. This approach was implemented and applied to ovarian cancer proteogenomics data sets: 1) binary putative homologous recombination deficiency positive or negative; and 2) multiple mRNA classes (differentiated, proliferative, immunoreactive, mesenchymal, and unknown). We conducted and compared various machine learning approaches with homogeneous ensemble feature selection including random forest, support vector machine, and neural network for predicting both binary and multiple class outcomes. Various performance criteria including sensitivity, specificity, kappa statistics were used to assess the prediction consistency and accuracy. Results: With the proposed three-stage homogeneous ensemble feature selection approaches, prediction accuracy can be improved with the limited sample through continuously reducing errors and redundancy, i.e. Treebag provided 83% prediction accuracy (85% sensitivity and 81% specificity) for binary ovarian outcomes. For mRNA multi-classes classification, our approach provided even better accuracy with increased sample size. Conclusions: Despite the different prediction accuracies from various models, homogeneous ensemble feature selection proposed identified consistent sets of top ranked important markers out of 9606 proteins linked to the binary disease and multiple mRNA class outcomes.

Download Full-text

Machine Learning Approaches to Predict Peak Demand Days of Cardiovascular Admissions Considering Environmental Exposure

10.21203/rs.2.19636/v2 ◽

2020 ◽

Author(s):

Hang Qiu ◽

Lin Luo ◽

Ziqi Su ◽

Li Zhou ◽

Liya Wang ◽

...

Keyword(s):

Machine Learning ◽

Loss Function ◽

Ambient Air ◽

Quality Data ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Learning Models ◽

Peak Demand ◽

Logarithmic Loss

Abstract Background: Accumulating evidence has linked environmental exposures, such as ambient air pollution and meteorological factors to the development and severity of cardiovascular diseases (CVDs), resulting in increased healthcare demand. Effective prediction of demand for healthcare services, particularly those associated with peak events of CVDs, can be useful in optimizing the allocation of medical resources. However, few studies have attempted to adopt machine learning approaches with excellent predictive abilities to forecast the healthcare demand for CVDs. This study aims to develop and compare several machine learning models in predicting the peak demand days of CVDs admissions using the hospital admissions data, air quality data and meteorological data in Chengdu, China from 2015 to 2017. Methods: Six machine learning algorithms, including logistic regression (LR), support vector machine (SVM), artificial neural network (ANN), random forest (RF), extreme gradient boosting (XGBoost), and light gradient boosting machine (LightGBM) were applied to build the predictive models with a unique feature set. The area under a receiver operating characteristic curve (AUC), logarithmic loss function, accuracy, sensitivity, specificity, precision, and F1 score were used to evaluate the predictive performances between the six models. Results: The LightGBM model exhibited the highest AUC (0.940, 95% CI: 0.900-0.980), which was significantly higher than that of LR (0.842, 95% CI: 0.783-0.901), SVM (0.834, 95% CI: 0.774-0.894) and ANN (0.890, 95% CI: 0.836-0.944), but did not differ significantly from that of RF (0.926, 95% CI: 0.879-0.974) and XGBoost (0.930, 95% CI: 0.878-0.982). In addition, the LightGBM has the optimal logarithmic loss function (0.218), accuracy (91.3%), specificity (94.1%), precision (0.695), and F1 score (0.725). Feature importance identification indicated that the contribution rate of meteorological conditions and air pollutants for the prediction was 32% and 43%, respectively. Conclusion: This study suggests that ensemble learning models, especially the LightGBM model, can be used to effectively predict the peak events of CVDs admissions, and therefore could be a very useful decision making tool for medical resource management.

Download Full-text

Performance of Statistical and Machine Learning-Based Methods for Predicting Biogeographical Patterns of Fungal Productivity in Forest Ecosystems

10.21203/rs.3.rs-122045/v1 ◽

2020 ◽

Author(s):

Albert Morera ◽

Juan Martínez de Aragón ◽

José Antonio Bonet ◽

Jingjing Liang ◽

Sergio de-Miguel

Keyword(s):

Machine Learning ◽

Random Forest ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models ◽

Modelling Approaches

Abstract BackgroundThe prediction of biogeographical patterns from a large number of driving factors with complex interactions, correlations and non-linear dependences require advanced analytical methods and modelling tools. This study compares different statistical and machine learning models for predicting fungal productivity biogeographical patterns as a case study for the thorough assessment of the performance of alternative modelling approaches to provide accurate and ecologically-consistent predictions.MethodsWe evaluated and compared the performance of two statistical modelling techniques, namely, generalized linear mixed models and geographically weighted regression, and four machine learning models, namely, random forest, extreme gradient boosting, support vector machine and deep learning to predict fungal productivity. We used a systematic methodology based on substitution, random, spatial and climatic blocking combined with principal component analysis, together with an evaluation of the ecological consistency of spatially-explicit model predictions.ResultsFungal productivity predictions were sensitive to the modelling approach and complexity. Moreover, the importance assigned to different predictors varied between machine learning modelling approaches. Decision tree-based models increased prediction accuracy by ~7% compared to other machine learning approaches and by more than 25% compared to statistical ones, and resulted in higher ecological consistence at the landscape level.ConclusionsWhereas a large number of predictors are often used in machine learning algorithms, in this study we show that proper variable selection is crucial to create robust models for extrapolation in biophysically differentiated areas. When dealing with spatial-temporal data in the analysis of biogeographical patterns, climatic blocking is postulated as a highly informative technique to be used in cross-validation to assess the prediction error over larger scales. Random forest was the best approach for prediction both in sampling-like environments as well as in extrapolation beyond the spatial and climatic range of the modelling data.

Download Full-text

Machine learning approaches for binary classification using brain signals

10.36227/techrxiv.12089496 ◽

2020 ◽

Author(s):

Tejas Wadiwala ◽

Vikas Trikha ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Eeg Signals ◽

Brain Wave ◽

Brain Signals ◽

Learning Practices ◽

The Brain

<p><b>This paper attempts to perform a comparative analysis of brain signals dataset using various machine learning classifiers such as random forest, gradient boosting, support vector machine, extra trees classifier. The comparative analysis is accomplished based on the performance parameters such as accuracy, area under the ROC curve (AUC), specificity, recall, and precision. The key focus of this paper is to exercise the machine learning practices over an Electroencephalogram (EEG) signals dataset provided by Rochester Institute of Technology and to provide meaningful results using the same. EEG signals are usually captivated to diagnose the problems related to the electrical activities of the brain as it tracks and records brain wave patterns to produce a definitive report on seizure activities of the brain. While exercising machine learning practices, various data preprocessing techniques were implemented to attain cleansed and organized data to predict better results and higher accuracy. Section II gives a comprehensive presurvey of existing work performed so far on the same; furthermore, section III sheds light on the dataset used for this research.</b></p>

Download Full-text

Predicting Breast Cancer: A Comparative Analysis of Machine Learning Algorithms

Proceeding International Conference on Science and Engineering ◽

10.14421/icse.v3.545 ◽

2020 ◽

Vol 3 ◽

pp. 455-459

Author(s):

Pulung Hendro Prastyo ◽

I Gede Yudi Paramartha ◽

Michael S. Moses Pakpahan ◽

Igi Ardiyanto

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Confusion Matrix ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

K Nearest Neighbors ◽

Common Cancer

Breast cancer is the most common cancer among women (43.3 incidents per 100.000 women), with the highest mortality (14.3 incidents per 100.000 women). Early detection is critical for survival. Using machine learning approaches, the problem can be effectively classified, predicted, and analyzed. In this study, we compared eight machine learning algorithms: Gaussian Naïve Bayes (GNB), k-Nearest Neighbors (K-NN), Support Vector Machine(SVM), Random Forest (RF), AdaBoost, Gradient Boosting (GB), XGBoost, and Multi-Layer Perceptron (MLP). The experiment is conducted using Breast Cancer Wisconsin datasets, confusion matrix, and 5-folds cross-validation. Experimental results showed that XGBoost provides the best performance. XGBoost obtained accuracy (97,19%), recall (96,75%), precision (97,28%), F1-score (96,99%), and AUC (99,61%). Our result showed that XGBoost is the most effective method to predict breast cancer in the Breast Cancer Wisconsin dataset.

Download Full-text

Machine learning approaches for binary classification using brain signals

10.36227/techrxiv.12089496.v1 ◽

2020 ◽

Author(s):

Tejas Wadiwala ◽

Vikas Trikha ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Comparative Analysis ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Eeg Signals ◽

Brain Wave ◽

Brain Signals ◽

Learning Practices ◽

The Brain

Download Full-text

APPLYING ECONOMIC MEASURES TO LAPSE RISK MANAGEMENT WITH MACHINE LEARNING APPROACHES

Astin Bulletin ◽

10.1017/asb.2021.10 ◽

2021 ◽

pp. 1-33

Author(s):

Stéphane Loisel ◽

Pierrick Piette ◽

Cheng-Hsien Jason Tsai

Keyword(s):

Machine Learning ◽

Risk Management ◽

Regression Tree ◽

Classification Problem ◽

Point Of View ◽

Classification And Regression Tree ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Modeling policyholders’ lapse behaviors is important to a life insurer, since lapses affect pricing, reserving, profitability, liquidity, risk management, and the solvency of the insurer. In this paper, we apply two machine learning methods to lapse modeling. Then, we evaluate the performance of these two methods along with two popular statistical methods by means of statistical accuracy and profitability measure. Moreover, we adopt an innovative point of view on the lapse prediction problem that comes from churn management. We transform the classification problem into a regression question and then perform optimization, which is new to lapse risk management. We apply the aforementioned four methods to a large real-world insurance dataset. The results show that Extreme Gradient Boosting (XGBoost) and support vector machine outperform logistic regression (LR) and classification and regression tree with respect to statistic accuracy, while LR performs as well as XGBoost in terms of retention gains. This highlights the importance of a proper validation metric when comparing different methods. The optimization after the transformation brings out significant and consistent increases in economic gains. Therefore, the insurer should conduct optimization on its economic objective to achieve optimal lapse management.

Download Full-text

Simple action for depression detection: using kinect-recorded human kinematic skeletal data

BMC Psychiatry ◽

10.1186/s12888-021-03184-4 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Wentao Li ◽

Qingxiang Wang ◽

Xin Liu ◽

Yanhong Yu

Keyword(s):

Machine Learning ◽

Prediction Accuracy ◽

Confusion Matrix ◽

Whole Body ◽

Current Status ◽

Gradient Boosting ◽

Support Vector ◽

Control Group ◽

Female Group ◽

Gender Based

Abstract Background Depression, a common worldwide mental disorder, which brings huge challenges to family and social burden around the world is different from fluctuant emotion and psychological pressure in their daily life. Although body signs have been shown to present manifestations of depression in general, few researches focus on whole body kinematic cues with the help of machine learning methods to aid depression recognition. Using the Kinect V2 device to record participants’ simple kinematic skeleton data of the participant’s body joints, the presented spatial features and low-level features is directly extracted from the record original Kinect-3D coordinates. This research aimed to constructed machine learning model with the preprocessed data importing, which could be used for depression automatic classification. Methods Considering some patients’ conditions and current status and refer to psychiatrists’ advices, simple and significant designed stimulus task will lead human skeleton data collection job. With original Kinect skeleton data extracting and preprocessing, the proposed experiment demonstrated four strong machine learning tools: Support Vector Machine, Logistic Regression, Random Forest and Gradient Boosting. Using the precision, recall, sensitivity, specificity, roc-curve, confusion matrix et.al, indicators were calculated as the measurement of methods, which were commonly used to evaluate classification methodologies. Results Across screened 64 pairs with age and gender totally matching in depression and control group, and Gradient Boosting achieved the best performance with the prediction accuracy of 76.92%. Sorted by female (54.69%) and male for the gender-based depression recognition, we applied best performance classifier Gradient Boosting got prediction accuracy of 66.67% in the male group, and 71.73% in the female group. Utilizing the best model Gradient Boosting for age-based classification, prediction accuracy got 76.92% in the older group (age >40, 50% of total) and 53.85% accuracy in the younger group (age <= 40). Conclusion The depression and non-depression individuals can be well classified by computational models using Kinect captured skeletal data. The Gradient Boosting, an excellent machine learning tool, get the performance in the four methods we demonstrated. Meanwhile, in the gender-based depression classification also gets reasonable accuracy. In particular, the recognition results of the old group are significantly better than that of the young group. All these findings suggest that kinematic skeletal data based depression recognition can be applied as an effective tool for assisting in depression analysis.

Download Full-text