Comparing Machine Learning Models and Hybrid Geostatistical Methods Using Environmental and Soil Covariates for Soil pH Prediction

Panagiotis Tziachris; Vassilis Aschonitis; Theocharis Chatzistathis; Maria Papadopoulou; Ioannis (John) D. Doukas

doi:10.3390/ijgi9040276

Comparing Machine Learning Models and Hybrid Geostatistical Methods Using Environmental and Soil Covariates for Soil pH Prediction

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9040276 ◽

2020 ◽

Vol 9 (4) ◽

pp. 276

Author(s):

Panagiotis Tziachris ◽

Vassilis Aschonitis ◽

Theocharis Chatzistathis ◽

Maria Papadopoulou ◽

Ioannis (John) D. Doukas

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Random Forests ◽

Soil Ph ◽

Prediction Accuracy ◽

Gradient Boosting ◽

Soil Parameters ◽

Geostatistical Methods ◽

Significant Difference ◽

Hyperparameter Selection

In the current paper we assess different machine learning (ML) models and hybrid geostatistical methods in the prediction of soil pH using digital elevation model derivates (environmental covariates) and co-located soil parameters (soil covariates). The study was located in the area of Grevena, Greece, where 266 disturbed soil samples were collected from randomly selected locations and analyzed in the laboratory of the Soil and Water Resources Institute. The different models that were assessed were random forests (RF), random forests kriging (RFK), gradient boosting (GB), gradient boosting kriging (GBK), neural networks (NN), and neural networks kriging (NNK) and finally, multiple linear regression (MLR), ordinary kriging (OK), and regression kriging (RK) that although they are not ML models, they were used for comparison reasons. Both the GB and RF models presented the best results in the study, with NN a close second. The introduction of OK to the ML models’ residuals did not have a major impact. Classical geostatistical or hybrid geostatistical methods without ML (OK, MLR, and RK) exhibited worse prediction accuracy compared to the models that included ML. Furthermore, different implementations (methods and packages) of the same ML models were also assessed. Regarding RF and GB, the different implementations that were applied (ranger-ranger, randomForest-rf, xgboost-xgbTree, xgboost-xgbDART) led to similar results, whereas in NN, the differences between the implementations used (nnet-nnet and nnet-avNNet) were more distinct. Finally, ML models tuned through a random search optimization method were compared with the same ML models with their default values. The results showed that the predictions were improved by the optimization process only where the ML algorithms demanded a large number of hyperparameters that needed tuning and there was a significant difference between the default values and the optimized ones, like in the case of GB and NN, but not in RF. In general, the current study concluded that although RF and GB presented approximately the same prediction accuracy, RF had more consistent results, regardless of different packages, different hyperparameter selection methods, or even the inclusion of OK in the ML models’ residuals.

Download Full-text

A comparison of a gradient boosting decision tree, random forests, and artificial neural networks to model urban land use changes: the case of the Seoul metropolitan area

International Journal of Geographical Information Science ◽

10.1080/13658816.2021.1887490 ◽

2021 ◽

pp. 1-19

Author(s):

Myung-Jin Jun

Keyword(s):

Neural Networks ◽

Land Use ◽

Artificial Neural Networks ◽

Decision Tree ◽

Random Forests ◽

Metropolitan Area ◽

Land Use Changes ◽

Urban Land Use ◽

Gradient Boosting ◽

Seoul Metropolitan Area

Download Full-text

Benchmarking Graph Neural Networks for Materials Chemistry

10.26434/chemrxiv.13615421.v2 ◽

2021 ◽

Author(s):

Victor Fung ◽

Jiaxin Zhang ◽

Eric Juarez ◽

Bobby Sumpter

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Materials Chemistry ◽

Learning Models ◽

High Data ◽

Hyperparameter Selection ◽

Computational Materials ◽

Conventional Models ◽

Graph Neural Networks ◽

Machine Learning Models

Graph neural networks (GNNs) have received intense interest as a rapidly expanding class of machine learning models remarkably well-suited for materials applications. To date, a number of successful GNNs have been proposed and demonstrated for systems ranging from crystal stability to electronic property prediction and to surface chemistry and heterogeneous catalysis. However, a consistent benchmark of these models remains lacking, hindering the development and consistent evaluation of new models in the materials field. Here, we present a workflow and testing platform, MatDeepLearn, for quickly and reproducibly assessing and comparing GNNs and other machine learning models. We use this platform to optimize and evaluate a selection of top performing GNNs on several representative datasets in computational materials chemistry. From our investigations we note the importance of hyperparameter selection and find roughly similar performances for the top models once optimized. We identify several strengths in GNNs over conventional models in cases with compositionally diverse datasets and in its overall flexibility with respect to inputs, due to learned rather than defined representations. Meanwhile several weaknesses of GNNs are also observed including high data requirements, and suggestions for further improvement for applications in materials chemistry are proposed.

Download Full-text

Application of Machine Learning Techniques in the Study of the Relevance of Environmental Factors in Prediction of Tropospheric Ozone

Soft Computing Methods for Practical Environment Solutions ◽

10.4018/978-1-61520-893-7.ch017 ◽

2010 ◽

pp. 278-292

Author(s):

Juan Gómez-Sanchis ◽

Emilio Soria-Olivas ◽

Marcelino Martinez-Sober ◽

Jose Blasco ◽

Juan Guerrero ◽

...

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Ozone Concentration ◽

Tropospheric Ozone ◽

Prediction Accuracy ◽

Linear Models ◽

Machine Learning Techniques ◽

Learning Techniques ◽

Atmospheric Phenomena ◽

Atmospheric Concentrations

This work presents a new approach for one of the main problems in the analysis of atmospheric phenomena, the prediction of atmospheric concentrations of different elements. The proposed methodology is more efficient than other classical approaches and is used in this work to predict tropospheric ozone concentration. The relevance of this problem stems from the fact that excessive ozone concentrations may cause several problems related to public health. Previous research by the authors of this work has shown that the classical approach to this problem (linear models) does not achieve satisfactory results in tropospheric ozone concentration prediction. The authors’ approach is based on Machine Learning (ML) techniques, which include algorithms related to neural networks, fuzzy systems and advanced statistical techniques for data processing. In this work, the authors focus on one of the main ML techniques, namely, neural networks. These models demonstrate their suitability for this problem both in terms of prediction accuracy and information extraction.

Download Full-text

Applications of Machine Learning to Estimating the Sizes and Market Impact of Hidden Orders in the BRICS Financial Markets

Journal of Advanced Studies in Finance ◽

10.14505//jasf.v11.1(21).04 ◽

2020 ◽

Vol 11 (1) ◽

pp. 28

Author(s):

Witness MAAKE ◽

Terence VAN ZYL

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Artificial Neural Networks ◽

Financial Markets ◽

Random Forests ◽

Price Impact ◽

Average Price ◽

Market Impact ◽

Far Right ◽

Hidden Orders

The research aims to investigate the role of hidden orders on the structure of the average market impact curves in the five BRICS financial markets. The concept of market impact is central to the implementation of cost-effective trading strategies during financial order executions. The literature is replicated using the data of visible orders from the five BRICS financial markets. We repeat the implementation of the literature to investigate the effect of hidden orders. We subsequently study the dynamics of hidden orders. The research applies machine learning to estimate the sizes of hidden orders. We revisit the methodology of the literature to compare the average market impact curves in which true hidden orders are added to visible orders to the average market impact curves in which hidden orders sizes are estimated via machine learning. The study discovers that: (1) hidden orders sizes could be uncovered via machine learning techniques such as Generalized Linear Models (GLM), Artificial Neural Networks (ANN), Support Vector Machines (SVM), and Random Forests (RF); and (2) there exist no set of market features that are consistently predictive of the sizes of hidden orders across different stocks. Artificial Neural Networks produce large R2 and small Mean Squared Error on the prediction of hidden orders of individual stocks across the five studied markets. Random Forests produce the most appropriate average price impact curves of visible and estimated hidden orders that are closest to the average market impact curves of visible and true hidden orders. In some markets, hidden orders produce a convex power-law far-right tail in contrast to visible orders which produce a concave power-law far-right tail. Hidden orders may affect the average price impact curves for orders of size less than the average order size; meanwhile, hidden orders may not affect the structure of the average price impact curves in other markets. The research implies ANN and RF as the recommended tools to uncover hidden orders.

Download Full-text

Natural language processing systems for data extraction and mapping on the basis of unstructured text blocks

Proceedings of the International conference “InterCarto/InterGIS” ◽

10.35595/2414-9179-2020-3-26-53-61 ◽

2020 ◽

Vol 26 (3) ◽

pp. 53-61

Author(s):

Pavel Kikin ◽

Alexey Kolesnikov ◽

Alexey Portnov ◽

Denis Grischenko

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Mathematical Models ◽

Optimal Algorithm ◽

The State ◽

Gradient Boosting ◽

Learning Methods ◽

Data Set ◽

Machine Learning Methods ◽

Spatio Temporal

The state of ecological systems, along with their general characteristics, is almost always described by indicators that vary in space and time, which leads to a significant complication of constructing mathematical models for predicting the state of such systems. One of the ways to simplify and automate the construction of mathematical models for predicting the state of such systems is the use of machine learning methods. The article provides a comparison of traditional and based on neural networks, algorithms and machine learning methods for predicting spatio-temporal series representing ecosystem data. Analysis and comparison were carried out among the following algorithms and methods: logistic regression, random forest, gradient boosting on decision trees, SARIMAX, neural networks of long-term short-term memory (LSTM) and controlled recurrent blocks (GRU). To conduct the study, data sets were selected that have both spatial and temporal components: the values of the number of mosquitoes, the number of dengue infections, the physical condition of tropical grove trees, and the water level in the river. The article discusses the necessary steps for preliminary data processing, depending on the algorithm used. Also, Kolmogorov complexity was calculated as one of the parameters that can help formalize the choice of the most optimal algorithm when constructing mathematical models of spatio-temporal data for the sets used. Based on the results of the analysis, recommendations are given on the application of certain methods and specific technical solutions, depending on the characteristics of the data set that describes a particular ecosystem

Download Full-text

Comparison of the Validity and Generalizability of Machine Learning Algorithms for the Prediction of Energy Expenditure: Validation Study

JMIR mhealth and uhealth ◽

10.2196/23938 ◽

2021 ◽

Vol 9 (8) ◽

pp. e23938

Author(s):

Ruairi O'Driscoll ◽

Jake Turicchi ◽

Mark Hopkins ◽

Cristiana Duarte ◽

Graham W Horgan ◽

...

Keyword(s):

Physical Activity ◽

Machine Learning ◽

Neural Networks ◽

Random Forest ◽

Energy Expenditure ◽

Superior Performance ◽

Gradient Boosting ◽

Support Vector ◽

Data Sets ◽

Out Of Sample

Background Accurate solutions for the estimation of physical activity and energy expenditure at scale are needed for a range of medical and health research fields. Machine learning techniques show promise in research-grade accelerometers, and some evidence indicates that these techniques can be applied to more scalable commercial devices. Objective This study aims to test the validity and out-of-sample generalizability of algorithms for the prediction of energy expenditure in several wearables (ie, Fitbit Charge 2, ActiGraph GT3-x, SenseWear Armband Mini, and Polar H7) using two laboratory data sets comprising different activities. Methods Two laboratory studies (study 1: n=59, age 44.4 years, weight 75.7 kg; study 2: n=30, age=31.9 years, weight=70.6 kg), in which adult participants performed a sequential lab-based activity protocol consisting of resting, household, ambulatory, and nonambulatory tasks, were combined in this study. In both studies, accelerometer and physiological data were collected from the wearables alongside energy expenditure using indirect calorimetry. Three regression algorithms were used to predict metabolic equivalents (METs; ie, random forest, gradient boosting, and neural networks), and five classification algorithms (ie, k-nearest neighbor, support vector machine, random forest, gradient boosting, and neural networks) were used for physical activity intensity classification as sedentary, light, or moderate to vigorous. Algorithms were evaluated using leave-one-subject-out cross-validations and out-of-sample validations. Results The root mean square error (RMSE) was lowest for gradient boosting applied to SenseWear and Polar H7 data (0.91 METs), and in the classification task, gradient boost applied to SenseWear and Polar H7 was the most accurate (85.5%). Fitbit models achieved an RMSE of 1.36 METs and 78.2% accuracy for classification. Errors tended to increase in out-of-sample validations with the SenseWear neural network achieving RMSE values of 1.22 METs in the regression tasks and the SenseWear gradient boost and random forest achieving an accuracy of 80% in classification tasks. Conclusions Algorithms trained on combined data sets demonstrated high predictive accuracy, with a tendency for superior performance of random forests and gradient boosting for most but not all wearable devices. Predictions were poorer in the between-study validations, which creates uncertainty regarding the generalizability of the tested algorithms.

Download Full-text

Applying Deep Neural Networks and Ensemble Machine Learning Methods to Forecast Airborne Ambrosia Pollen

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph16111992 ◽

2019 ◽

Vol 16 (11) ◽

pp. 1992 ◽

Cited By ~ 6

Author(s):

Gebreab K. Zewdie ◽

David J. Lary ◽

Estelle Levetin ◽

Gemechu F. Garuma

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Land Surface ◽

Deep Neural Networks ◽

Airborne Pollen ◽

Training Data ◽

Gradient Boosting ◽

Learning Approaches ◽

Ambrosia Pollen ◽

Extreme Gradient Boosting

Allergies to airborne pollen are a significant issue affecting millions of Americans. Consequently, accurately predicting the daily concentration of airborne pollen is of significant public benefit in providing timely alerts. This study presents a method for the robust estimation of the concentration of airborne Ambrosia pollen using a suite of machine learning approaches including deep learning and ensemble learners. Each of these machine learning approaches utilize data from the European Centre for Medium-Range Weather Forecasts (ECMWF) atmospheric weather and land surface reanalysis. The machine learning approaches used for developing a suite of empirical models are deep neural networks, extreme gradient boosting, random forests and Bayesian ridge regression methods for developing our predictive model. The training data included twenty-four years of daily pollen concentration measurements together with ECMWF weather and land surface reanalysis data from 1987 to 2011 is used to develop the machine learning predictive models. The last six years of the dataset from 2012 to 2017 is used to independently test the performance of the machine learning models. The correlation coefficients between the estimated and actual pollen abundance for the independent validation datasets for the deep neural networks, random forest, extreme gradient boosting and Bayesian ridge were 0.82, 0.81, 0.81 and 0.75 respectively, showing that machine learning can be used to effectively forecast the concentrations of airborne pollen.

Download Full-text

Machine Learning Reactivity in the Chemical Space Surrounding Vaska's Complex

10.26434/chemrxiv.10347566.v1 ◽

2019 ◽

Author(s):

Pascal Friederich ◽

Gabriel dos Passos Gomes ◽

Riccardo De Bin ◽

Alan Aspuru-Guzik ◽

David Balcells

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Chemical Space ◽

Bayesian Optimization ◽

Gradient Boosting ◽

Learning Models ◽

Hydrogen Activation ◽

Activation Barriers ◽

Machine Learning Models ◽

Vaska's Complex

Machine learning models, including neural networks, Bayesian optimization, gradient boosting and Gaussian processes, were trained with DFT data for the accurate, affordable and explainable prediction of hydrogen activation barriers in the chemical space surrounding Vaska's complex.

Download Full-text

Machine learning analysis on American Gut Project microbiome data to identify subjects with cancer both with and without chemotherapy exposure.

Journal of Clinical Oncology ◽

10.1200/jco.2020.38.15_suppl.e14069 ◽

2020 ◽

Vol 38 (15_suppl) ◽

pp. e14069-e14069

Author(s):

Oguz Akbilgic ◽

Ibrahim Karabayir ◽

Hakan Gunturkun ◽

Joseph F Pierre ◽

Ashley C Rashe ◽

...

Keyword(s):

Machine Learning ◽

Cancer Patients ◽

Test Data ◽

Gut Microbiome ◽

Classification Model ◽

Outcome Variable ◽

Gradient Boosting ◽

Healthy Controls ◽

Extreme Gradient Boosting ◽

Significant Difference

e14069 Background: There is growing interest in the links between cancer and the gut microbiome. However, the effect of chemotherapy upon the gut microbiome remains unknown. We studied whether machine learning can: 1) accurately classify subjects with cancer vs healthy controls and 2) whether this classification model is affected by chemotherapy exposure status. Methods: We used the American Gut Project data to build a extreme gradient boosting (XGBoost) model to distinguish between subjects with cancer vs healthy controls using data on simple demographics and published microbiome. We then further explore the selected features for cancer subjects based on chemotherapy exposure. Results: The cohort included 7,685 subjects consisting of 561 subjects with cancer, 52.5% female, 87.3% White, and average age of 44.7 (SD 17.7). The binary outcome variable represents cancer status. Among 561 subjects with cancer, 94 of them were treated with chemotherapy agents before sampling of microbiomes. As predictors, there were four demographic variables (sex, race, age, BMI) and 1,812 operational taxonomic units (OTUs) each found in at least 2 subjects via RNA sequencing. We randomly split data into 80% training and 20% hidden test. We then built an XGBoost model with 5-fold cross-validation using only training data yielding an AUC (with 95% CI) of 0.79 (0.77, 0.80) and obtained the almost the same AUC on the hidden test data. Based on feature importance analysis, we identified 12 most important features (Age, BMI and 12 OTUs; 4C0d-2, Brachyspirae, Methanosphaera, Geodermatophilaceae, Bifidobacteriaceae, Slackia, Staphylococcus, Acidaminoccus, Devosia, Proteus) and rebuilt a model using only these features and obtained AUC of 0.80 (0.77, 0.83) on the hidden test data. The average predicted probabilities for controls, cancer patients who were exposed to chemotherapy, and cancer patients who were not were 0.071 (0.070,0.073), 0.125 (0.110, 0.140), 0.156 (0.148, 0.164), respectively. There was no statistically significant difference on levels of these 12 OTUs between cancer subjects treated with and without chemotherapy. Conclusions: Machine learning achieved a moderately high accuracy identifying patients’ cancer status based on microbiome. Despite the literature on microbiome and chemotherapy interaction, the levels of 12 OTUs used in our model were not significantly different for cancer patients with or without chemotherapy exposure. Testing this model on other large population databases is needed for broader validation.

Download Full-text

The extraction of early warning features for the predicting financial distress based on XGboost model and shap framework

International Journal of Financial Engineering ◽

10.1142/s2424786321410048 ◽

2021 ◽

pp. 2141004

Author(s):

He Yang ◽

Emma Li ◽

Yi Fang Cai ◽

Jiapei Li ◽

George X. Yuan

Keyword(s):

Machine Learning ◽

Early Warning ◽

Financial Distress ◽

Prediction Accuracy ◽

Financial Risk ◽

Learning Algorithm ◽

Listed Companies ◽

Gradient Boosting ◽

Distress Risk ◽

Extreme Gradient Boosting

The purpose of this paper is to establish a framework for the extraction of early warning risk features for the predicting financial distress based on XGBoost model and SHAP. It is well known that the way to construct early warning risk features to predict financial distress of companies is very important, and by comparing with the traditional statistical methods, though the data-driven machine learning for the financial early warning, modelling has a better performance in terms of prediction accuracy, but it also brings the difficulty such as the one the corresponding model may be not explained well. Recently, eXtreme Gradient Boosting (XGBoost), an ensemble learning algorithm based on extreme gradient boosting, has become a hot topic in the area of machine learning research field due to its strong nonlinear information recognition ability and high prediction accuracy in the practice. In this study, the XGBoost algorithm is used to extract early warning features for the predicting financial distress for listed companies, with 76 financial risk features from seven categories of aspects, and 14 non-financial risk features from four categories of aspects, which are collected to establish an early warning system for the predication of financial distress. With applications, we conduct the empirical testing respect to AUC, KS and Kappa, the numerical results show that by comparing with the Logistic model, our method based on XGBoost model established in this paper has much better ability to predict the financial distress risk of listed companies. Moreover, under the framework of SHAP (SHAPley Additive exPlanations), we are able to give a reasonable explanation for important risk features and influencing ways affecting the financial distress visibly. The results given by this paper show that the XGBoost approach to model early warning features for financial distress does not only preform a better prediction accuracy, but also is explainable, which is significant for the identification of early warning to the financial distress risk for listed companies in the practice.

Download Full-text