Comparative performance of generalized additive models and boosted regression trees for statistical modeling of incidental catch of wahoo (Acanthocybium solandri) in the Mexican tuna purse-seine fishery

Performance evaluation of cetacean species distribution models developed using generalized additive models and boosted regression trees

Ecology and Evolution ◽

10.1002/ece3.6316 ◽

2020 ◽

Vol 10 (12) ◽

pp. 5759-5784

Author(s):

Elizabeth A. Becker ◽

James V. Carretta ◽

Karin A. Forney ◽

Jay Barlow ◽

Stephanie Brodie ◽

...

Keyword(s):

Performance Evaluation ◽

Species Distribution ◽

Species Distribution Models ◽

Generalized Additive Models ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Distribution Models ◽

Cetacean Species

Download Full-text

SAT0587 MACHINE-LEARNING DERIVED ALGORITHMS FOR OUTCOMES PREDICTION IN RHEUMATIC DISEASES: APPLICATION TO RADIOGRAPHIC PROGRESSION IN EARLY AXIAL SPONDYLOARTHRITIS

Annals of the Rheumatic Diseases ◽

10.1136/annrheumdis-2020-eular.431 ◽

2020 ◽

Vol 79 (Suppl 1) ◽

pp. 1252.2-1253

Author(s):

R. Garofoli ◽

M. Resche-Rigon ◽

M. Dougados ◽

D. Van der Heijde ◽

C. Roux ◽

...

Keyword(s):

Machine Learning ◽

Radiographic Progression ◽

Generalized Additive Models ◽

Regression Trees ◽

Information Criterion ◽

Additive Models ◽

Super Learner ◽

Additive Regression ◽

Selection Operator ◽

Lasso Method

Background:Axial spondyloarthritis (axSpA) is a chronic rheumatic disease that encompasses various clinical presentations: inflammatory chronic back pain, peripheral manifestations and extra-articular manifestations. The current nomenclature divides axSpA in radiographic (in the presence of radiographic sacroiliitis) and non-radiographic (in the absence of radiographic sacroiliitis, with or without MRI sacroiliitis. Given that the functional burden of the disease appears to be greater in patients with radiographic forms, it seems crucial to be able to predict which patients will be more likely to develop structural damage over time. Predictive factors for radiographic progression in axSpA have been identified through use of traditional statistical models like logistic regression. However, these models present some limitations. In order to overcome these limitations and to improve the predictive performance, machine learning (ML) methods have been developed.Objectives:To compare ML models to traditional models to predict radiographic progression in patients with early axSpA.Methods:Study design: prospective French multicentric cohort study (DESIR cohort) with 5years of follow-up. Patients: all patients included in the cohort, i.e. 708 patients with inflammatory back pain for >3 months but <3 years, highly suggestive of axSpA. Data on the first 5 years of follow-up was used. Statistical analyses: radiographic progression was defined as progression either at the spine (increase of at least 1 point per 2 years of mSASSS scores) or at the sacroiliac joint (worsening of at least one grade of the mNY score between 2 visits). Traditional modelling: we first performed a bivariate analysis between our outcome (radiographic progression) and explanatory variables at baseline to select the variables to be included in our models and then built a logistic regression model (M1). Variable selection for traditional models was performed with 2 different methods: stepwise selection based on Akaike Information Criterion (stepAIC) method (M2), and the Least Absolute Shrinkage and Selection Operator (LASSO) method (M3). We also performed sensitivity analysis on all patients with manual backward method (M4) after multiple imputation of missing data. Machine learning modelling: using the “SuperLearner” package on R, we modelled radiographic progression with stepAIC, LASSO, random forest, Discrete Bayesian Additive Regression Trees Samplers (DBARTS), Generalized Additive Models (GAM), multivariate adaptive polynomial spline regression (polymars), Recursive Partitioning And Regression Trees (RPART) and Super Learner. Finally, the accuracy of traditional and ML models was compared based on their 10-foldcross-validated AUC (cv-AUC).Results:10-fold cv-AUC for traditional models were 0.79 and 0.78 for M2 and M3, respectively. The 3 best models in the ML algorithm were the GAM, the DBARTS and the Super Learner models, with 10-fold cv-AUC of: 0.77, 0.76 and 0.74, respectively (Table 1).Table 1.Comparison of 10-fold cross-validated AUC between best traditional and machine learning models.Best modelsCross-validated AUCTraditional models M2 (step AIC method)0.79 M3 (LASSO method)0.78Machine learning approach SL Discrete Bayesian Additive Regression Trees Samplers (DBARTS)0.76 SL Generalized Additive Models (GAM)0.77 Super Learner0.74AUC: Area Under the Curve; AIC: Akaike Information Criterion; LASSO: Least Absolute Shrinkage and Selection Operator; SL: SuperLearner. N = 295.Conclusion:Traditional models predicted better radiographic progression than ML models in this early axSpA population. Further ML algorithms image-based or with other artificial intelligence methods (e.g. deep learning) might perform better than traditional models in this setting.Acknowledgments:Thanks to the French National Society of Rheumatology and the DESIR cohort.Disclosure of Interests:Romain Garofoli: None declared, Matthieu resche-rigon: None declared, Maxime Dougados Grant/research support from: AbbVie, Eli Lilly, Merck, Novartis, Pfizer and UCB Pharma, Consultant of: AbbVie, Eli Lilly, Merck, Novartis, Pfizer and UCB Pharma, Speakers bureau: AbbVie, Eli Lilly, Merck, Novartis, Pfizer and UCB Pharma, Désirée van der Heijde Consultant of: AbbVie, Amgen, Astellas, AstraZeneca, BMS, Boehringer Ingelheim, Celgene, Cyxone, Daiichi, Eisai, Eli-Lilly, Galapagos, Gilead Sciences, Inc., Glaxo-Smith-Kline, Janssen, Merck, Novartis, Pfizer, Regeneron, Roche, Sanofi, Takeda, UCB Pharma; Director of Imaging Rheumatology BV, Christian Roux: None declared, Anna Moltó Grant/research support from: Pfizer, UCB, Consultant of: Abbvie, BMS, MSD, Novartis, Pfizer, UCB

Download Full-text

A comparison of regression trees, logistic regression, generalized additive models, and multivariate adaptive regression splines for predicting AMI mortality

Statistics in Medicine ◽

10.1002/sim.2770 ◽

2007 ◽

Vol 26 (15) ◽

pp. 2937-2957 ◽

Cited By ~ 98

Author(s):

Peter C. Austin

Keyword(s):

Logistic Regression ◽

Generalized Additive Models ◽

Regression Trees ◽

Additive Models ◽

Multivariate Adaptive Regression Splines ◽

Regression Splines ◽

Adaptive Regression ◽

Adaptive Regression Splines

Download Full-text

Comparative performance of generalized additive models and multivariate adaptive regression splines for statistical modelling of species distributions

Ecological Modelling ◽

10.1016/j.ecolmodel.2006.05.022 ◽

2006 ◽

Vol 199 (2) ◽

pp. 188-196 ◽

Cited By ~ 245

Author(s):

J.R. Leathwick ◽

J. Elith ◽

T. Hastie

Keyword(s):

Generalized Additive Models ◽

Species Distributions ◽

Statistical Modelling ◽

Additive Models ◽

Multivariate Adaptive Regression Splines ◽

Regression Splines ◽

Comparative Performance ◽

Adaptive Regression ◽

Adaptive Regression Splines

Download Full-text

Characterizing environmental and spatial variables associated with the incidental catch of olive ridley (Lepidochelys olivacea) in the Eastern Tropical Pacific purse-seine fishery

Fisheries Oceanography ◽

10.1111/fog.12130 ◽

2015 ◽

Vol 25 (1) ◽

pp. 1-14 ◽

Cited By ~ 8

Author(s):

Jose T. Montero ◽

Raul O. Martinez-Rincon ◽

Selina S. Heppell ◽

Martin Hall ◽

Michael Ewal

Keyword(s):

Tropical Pacific ◽

Purse Seine ◽

Eastern Tropical Pacific ◽

Olive Ridley ◽

Spatial Variables ◽

Lepidochelys Olivacea ◽

Incidental Catch ◽

Purse Seine Fishery

Download Full-text

A Spatial Approach for Modeling Amphibian Road-Kills: Comparison of Regression Techniques

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10050343 ◽

2021 ◽

Vol 10 (5) ◽

pp. 343

Author(s):

Diana Sousa-Guedes ◽

Marc Franch ◽

Neftalí Sillero

Keyword(s):

Geographically Weighted Regression ◽

Environmental Variables ◽

Linear Models ◽

Model Performance ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Weighted Regression ◽

Road Kill ◽

Regression Techniques

Road networks are the main source of mortality for many species. Amphibians, which are in global decline, are the most road-killed fauna group, due to their activity patterns and preferred habitats. Many different methodologies have been applied in modeling the relationship between environment and road-kills events, such as logistic regression. Here, we compared the performance of five regression techniques to relate amphibians’ road-kill frequency to environmental variables. For this, we surveyed three country roads in northern Portugal in search of road-killed amphibians. To explain the presence of road-kills, we selected a set of environmental variables important for the presence of amphibians and the occurrence of road-kills. We compared the performances of five modeling techniques: (i) generalized linear models, (ii) generalized additive models, (iii) random forest, (iv) boosted regression trees, and (v) geographically weighted regression. The boosted regression trees and geographically weighted regression techniques performed the best, with a percentage of deviance explained between 61.8% and 76.6% and between 55.3% and 66.7%, respectively. Moreover, the geographically weighted regression showed a great advantage over the other techniques, as it allows mapping local parameter coefficients as well as local model performance (pseudo-R2). The results suggest that geographically weighted regression is a useful tool for road-kill modeling, as well as to better visualize and map the spatial variability of the models.

Download Full-text

Testing the ability of species distribution models to infer variable importance

10.1101/715904 ◽

2019 ◽

Cited By ~ 2

Author(s):

Adam B. Smith ◽

Maria J. Santos

Keyword(s):

Predictive Accuracy ◽

Permutation Test ◽

Generalized Additive Models ◽

Simulated Data ◽

Variable Importance ◽

Additive Models ◽

Environmental Data ◽

Boosted Regression Trees ◽

Distribution Models ◽

Model Algorithm

AbstractModels of species’ distributions and niches are frequently used to infer the importance of range- and niche-defining variables. However, the degree to which these models can reliably identify important variables and quantify their influence remains unknown. Here we use a series of simulations to explore how well models can 1) discriminate between variables with different influence and 2) calibrate the magnitude of influence relative to an “omniscient” model. To quantify variable importance, we trained generalized additive models (GAMs), Maxent, and boosted regression trees (BRTs) on simulated data and tested their sensitivity to permutations in each predictor. Importance was inferred by calculating the correlation between permuted and unpermuted predictions, and by comparing predictive accuracy of permuted and unpermuted predictions using AUC and the Continuous Boyce Index. In scenarios with one influential and one uninfluential variable, models were unable to discriminate reliably between variables in conditions that are normally challenging for generating accurate predictions: training occurrences <8-64; prevalence >0.5; small spatial extent; environmental data with coarse resolution when spatial autocorrelation is low; and correlation between environmental variables where |r| >0.7. When two variables influenced the distribution equally, importance was underestimated when species had narrow or intermediate niche breadth. Interactions between variables in how they shaped the niche did not affect inferences about their importance. When variables acted unequally, the effect of the stronger variable was overestimated. GAMs and Maxent discriminated between variables more reliably than BRTs, but no algorithm was consistently well-calibrated vis-à-vis the omniscient model. Algorithm-specific measures of importance like Maxent’s change-in-gain metric were less robust than the permutation test. Overall, high predictive accuracy did not connote robust inferential capacity. As a result, requirements for reliably measuring variable importance are likely more stringent than for creating models with high predictive accuracy.

Download Full-text

Partitioning environment and space in site-by-species matrices: a comparison of methods for community ecology and macroecology

10.1101/871251 ◽

2019 ◽

Cited By ~ 2

Author(s):

Duarte S. Viana ◽

Petr Keil ◽

Alienor Jeliazkov

Keyword(s):

Machine Learning ◽

Community Ecology ◽

Empirical Test ◽

Variation Partitioning ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Spatial Effects ◽

Ecological Data ◽

Constrained Ordination

AbstractCommunity ecologists and macroecologists have long sought to evaluate the importance of environmental conditions and other drivers in determining species composition across sites. Different methods have been used to estimate species-environment relationships while accounting for or partitioning the variation attributed to environment and spatial autocorrelation, but their differences and respective reliability remain poorly known. We compared the performance of four families of statistical methods in estimating the contribution of the environment and space to explain variation in multi-species occurrence and abundance. These methods included distance-based regression (MRM), constrained ordination (RDA and CCA), generalised linear and additive models (GLM, GAM), and treebased machine learning (regression trees, boosted regression trees, and random forests). Depending on the method, the spatial model consisted of either Moran’s Eigenvector Maps (MEM; in constrained ordination and GLM), smooth spatial splines (in GAM), or tree-based non-linear modelling of spatial coordinates (in machine learning). We simulated typical ecological data to assess the methods’ performance in (1) fitting environmental and spatial effects, and (2) partitioning the variation explained by the environmental and spatial effects. Differences in the fitting performance among major model types – (G)LM, GAM, machine learning – were reflected in the variation partitioning performance of the different methods. Machine learning methods, namely boosted regression trees, performed overall better. GAM performed similarly well, though likelihood optimisation did not converge for some empirical test data. The remaining methods performed worse under most simulated data variations (depending on the type of species data, sample size and coverage, autocorrelation range, and response shape). Our results suggest that tree-based machine learning is a robust approach that can be widely used for variation partitioning. Our recommendations apply to single-species niche models, community ecology, and macroecology studies aiming at disentangling the relative contributions of space vs. environment and other drivers of variation in site-by-species matrices.

Download Full-text

Comparative species abundance modeling of Capitellidae (Annelida) in Tampa Bay, Florida, USA

Marine Ecology Progress Series ◽

10.3354/meps13484 ◽

2020 ◽

Vol 653 ◽

pp. 105-119

Author(s):

J Hilliard ◽

D Karlen ◽

T Dix ◽

S Markham ◽

A Schulze

Keyword(s):

Negative Binomial ◽

Generalized Additive Models ◽

Species Abundance ◽

Single Species ◽

Tampa Bay ◽

Additive Models ◽

Boosted Regression Trees ◽

Environmental Drivers ◽

Hurdle Models ◽

Abundance Modeling

Capitellid polychaetes are ubiquitous throughout the world’s oceans and are often encountered in high abundance. We used an extensive dataset of species abundance and distribution records of the Capitella capitata complex, C. aciculata, C. jonesi, Heteromastus filiformis, Mediomastus ambiseta, and M. californiensis from Tampa Bay, Florida, USA, as a model system of closely related species filling a similar ecological niche. We sought to (1) characterize the spatial distribution of each species, (2) determine if a single species abundance modeling strategy could be applied to them all, and (3) assess environmental drivers of species distribution and abundance. We found that all species had a zero-inflated abundance distribution and there was spatial autocorrelation by bay regions. Lorenz curves were an effective tool to assess spatial patterns of species abundance across large areas. Bay segment, depth, and dissolved oxygen were the most important environmental drivers. Modeling was accomplished by comparing 6 different approaches: 4 generalized additive models (GAMs: Poisson, negative binomial, Tweedie, and zero-inflated Poisson distributions), hurdle models, and boosted regression trees. There was no single model with top performance for every species. However, GAM-Tweedie and hurdle models performed well overall and may be useful for studies of other benthic marine invertebrates.

Download Full-text

Supplemental Material for An Introduction to Modeling Longitudinal Data With Generalized Additive Models: Applications to Single-Case Designs

Psychological Methods ◽

10.1037/met0000020.supp ◽

2014 ◽

Keyword(s):

Longitudinal Data ◽

Single Case ◽

Generalized Additive Models ◽

Additive Models ◽

Single Case Designs

Download Full-text