Testing the ability of species distribution models to infer variable importance

Mapping Intimacies ◽

10.1101/715904 ◽

2019 ◽

Cited By ~ 2

Author(s):

Adam B. Smith ◽

Maria J. Santos

Keyword(s):

Predictive Accuracy ◽

Permutation Test ◽

Generalized Additive Models ◽

Simulated Data ◽

Variable Importance ◽

Additive Models ◽

Environmental Data ◽

Boosted Regression Trees ◽

Distribution Models ◽

Model Algorithm

AbstractModels of species’ distributions and niches are frequently used to infer the importance of range- and niche-defining variables. However, the degree to which these models can reliably identify important variables and quantify their influence remains unknown. Here we use a series of simulations to explore how well models can 1) discriminate between variables with different influence and 2) calibrate the magnitude of influence relative to an “omniscient” model. To quantify variable importance, we trained generalized additive models (GAMs), Maxent, and boosted regression trees (BRTs) on simulated data and tested their sensitivity to permutations in each predictor. Importance was inferred by calculating the correlation between permuted and unpermuted predictions, and by comparing predictive accuracy of permuted and unpermuted predictions using AUC and the Continuous Boyce Index. In scenarios with one influential and one uninfluential variable, models were unable to discriminate reliably between variables in conditions that are normally challenging for generating accurate predictions: training occurrences <8-64; prevalence >0.5; small spatial extent; environmental data with coarse resolution when spatial autocorrelation is low; and correlation between environmental variables where |r| >0.7. When two variables influenced the distribution equally, importance was underestimated when species had narrow or intermediate niche breadth. Interactions between variables in how they shaped the niche did not affect inferences about their importance. When variables acted unequally, the effect of the stronger variable was overestimated. GAMs and Maxent discriminated between variables more reliably than BRTs, but no algorithm was consistently well-calibrated vis-à-vis the omniscient model. Algorithm-specific measures of importance like Maxent’s change-in-gain metric were less robust than the permutation test. Overall, high predictive accuracy did not connote robust inferential capacity. As a result, requirements for reliably measuring variable importance are likely more stringent than for creating models with high predictive accuracy.

Download Full-text

Performance evaluation of cetacean species distribution models developed using generalized additive models and boosted regression trees

Ecology and Evolution ◽

10.1002/ece3.6316 ◽

2020 ◽

Vol 10 (12) ◽

pp. 5759-5784

Author(s):

Elizabeth A. Becker ◽

James V. Carretta ◽

Karin A. Forney ◽

Jay Barlow ◽

Stephanie Brodie ◽

...

Keyword(s):

Performance Evaluation ◽

Species Distribution ◽

Species Distribution Models ◽

Generalized Additive Models ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Distribution Models ◽

Cetacean Species

Download Full-text

Empirical streamflow simulation for water resource management in data-scarce seasonal watersheds

Hydrology and Earth System Sciences Discussions ◽

10.5194/hessd-12-11083-2015 ◽

2015 ◽

Vol 12 (10) ◽

pp. 11083-11127 ◽

Cited By ~ 3

Author(s):

J. E. Shortridge ◽

S. D. Guikema ◽

B. F. Zaitchik

Keyword(s):

Water Resource Management ◽

Predictive Accuracy ◽

Generalized Additive Models ◽

Model Performance ◽

Additive Models ◽

Physical Models ◽

Multivariate Adaptive Regression Splines ◽

Learning Approaches ◽

Climate Conditions ◽

Monthly Streamflow

Abstract. In the past decade, certain methods for empirical rainfall–runoff modeling have seen extensive development and been proposed as a useful complement to physical hydrologic models, particularly in basins where data to support process-based models is limited. However, the majority of research has focused on a small number of methods, such as artificial neural networks, despite the development of multiple other approaches for non-parametric regression in recent years. Furthermore, this work has generally evaluated model performance based on predictive accuracy alone, while not considering broader objectives such as model interpretability and uncertainty that are important if such methods are to be used for planning and management decisions. In this paper, we use multiple regression and machine-learning approaches to simulate monthly streamflow in five highly-seasonal rivers in the highlands of Ethiopia and compare their performance in terms of predictive accuracy, error structure and bias, model interpretability, and uncertainty when faced with extreme climate conditions. While the relative predictive performance of models differed across basins, data-driven approaches were able to achieve reduced errors when compared to physical models developed for the region. Methods such as random forests and generalized additive models may have advantages in terms of visualization and interpretation of model structure, which can be useful in providing insights into physical watershed function. However, the uncertainty associated with model predictions under climate change should be carefully evaluated, since certain models (especially generalized additive models and multivariate adaptive regression splines) became highly variable when faced with high temperatures.

Download Full-text

Comparative performance of generalized additive models and boosted regression trees for statistical modeling of incidental catch of wahoo (Acanthocybium solandri) in the Mexican tuna purse-seine fishery

Ecological Modelling ◽

10.1016/j.ecolmodel.2012.03.006 ◽

2012 ◽

Vol 233 ◽

pp. 20-25 ◽

Cited By ~ 31

Author(s):

Raul O. Martínez-Rincón ◽

Sofía Ortega-García ◽

Juan G. Vaca-Rodríguez

Keyword(s):

Statistical Modeling ◽

Generalized Additive Models ◽

Regression Trees ◽

Additive Models ◽

Boosted Regression Trees ◽

Purse Seine ◽

Comparative Performance ◽

Incidental Catch ◽

Purse Seine Fishery

Download Full-text

Using generalized additive models to analyze biomedical non-linear longitudinal data

10.1101/2021.06.10.447970 ◽

2021 ◽

Author(s):

Ariel I. Mundo ◽

John R. Tipton ◽

Timothy J. Muldoon

Keyword(s):

Longitudinal Data ◽

Biomedical Research ◽

Repeated Measures ◽

Generalized Additive Models ◽

Simulated Data ◽

Additive Models ◽

Biomedical Literature ◽

Missing Observations ◽

Non Linear ◽

Biased Estimates

In biomedical research, the outcome of longitudinal studies has been traditionally analyzed using the repeated measures analysis of variance (rm-ANOVA) or more recently, linear mixed models (LMEMs). Although LMEMs are less restrictive than rm-ANOVA in terms of correlation and missing observations, both methodologies share an assumption of linearity in the measured response, which results in biased estimates and unreliable inference when they are used to analyze data where the trends are non-linear, which is a common occurrence in biomedical research. In contrast, generalized additive models (GAMs) relax the linearity assumption, and allow the data to determine the fit of the model while permitting missing observations and different correlation structures. Therefore, GAMs present an excellent choice to analyze non-linear longitudinal data in the context of biomedical research. This paper summarizes the limitations of rm-ANOVA and LMEMs and uses simulated data to visually show how both methods produce biased estimates when used on non-linear data. We also present the basic theory of GAMs, and using trends of oxygen saturation in tumors reported in the biomedical literature, we simulate example longitudinal data (2 treatment groups, 10 subjects per group, 6 repeated measures for each group) to demonstrate how these models can be computationally implemented. We show that GAMs are able to produce estimates that are consistent with the trends of biomedical non-linear data even in the case when missing observations exist (with 40% of the observations missing), allowing reliable inference from the data. To make this work reproducible, the code and data used in this paper are available at: https://github.com/aimundo/GAMs-biomedical-research.

Download Full-text

Machine learning methods for empirical streamflow simulation: a comparison of model accuracy, interpretability, and uncertainty in seasonal watersheds

Hydrology and Earth System Sciences ◽

10.5194/hess-20-2611-2016 ◽

2016 ◽

Vol 20 (7) ◽

pp. 2611-2628 ◽

Cited By ~ 70

Author(s):

Julie E. Shortridge ◽

Seth D. Guikema ◽

Benjamin F. Zaitchik

Keyword(s):

Machine Learning ◽

Predictive Accuracy ◽

Generalized Additive Models ◽

Additive Models ◽

Multivariate Adaptive Regression Splines ◽

Regression Splines ◽

Climate Conditions ◽

Machine Learning Methods ◽

Adaptive Regression ◽

Adaptive Regression Splines

Abstract. In the past decade, machine learning methods for empirical rainfall–runoff modeling have seen extensive development and been proposed as a useful complement to physical hydrologic models, particularly in basins where data to support process-based models are limited. However, the majority of research has focused on a small number of methods, such as artificial neural networks, despite the development of multiple other approaches for non-parametric regression in recent years. Furthermore, this work has often evaluated model performance based on predictive accuracy alone, while not considering broader objectives, such as model interpretability and uncertainty, that are important if such methods are to be used for planning and management decisions. In this paper, we use multiple regression and machine learning approaches (including generalized additive models, multivariate adaptive regression splines, artificial neural networks, random forests, and M5 cubist models) to simulate monthly streamflow in five highly seasonal rivers in the highlands of Ethiopia and compare their performance in terms of predictive accuracy, error structure and bias, model interpretability, and uncertainty when faced with extreme climate conditions. While the relative predictive performance of models differed across basins, data-driven approaches were able to achieve reduced errors when compared to physical models developed for the region. Methods such as random forests and generalized additive models may have advantages in terms of visualization and interpretation of model structure, which can be useful in providing insights into physical watershed function. However, the uncertainty associated with model predictions under extreme climate conditions should be carefully evaluated, since certain models (especially generalized additive models and multivariate adaptive regression splines) become highly variable when faced with high temperatures.

Download Full-text

Comparative analysis of statistical tools to identify recruitment–environment relationships and forecast recruitment strength

ICES Journal of Marine Science ◽

10.1016/j.icesjms.2005.05.018 ◽

2005 ◽

Vol 62 (7) ◽

pp. 1256-1269 ◽

Cited By ~ 38

Author(s):

Bernard A. Megrey ◽

Yong-Woo Lee ◽

S. Allen Macklin

Keyword(s):

Linear Regression ◽

Generalized Additive Models ◽

Simulated Data ◽

Additive Models ◽

Clupea Harengus ◽

Influential Factor ◽

Data Simulation ◽

Sea Temperature ◽

The North ◽

The North Atlantic Oscillation

Abstract Many of the factors affecting recruitment in marine populations are still poorly understood, complicating the prediction of strong year classes. Despite numerous attempts, the complexity of the problem often seems beyond the capabilities of traditional statistical analysis paradigms. This study examines the utility of four statistical procedures to identify relationships between recruitment and the environment. Because we can never really know the parameters or underlying relationships of actual data, we chose to use simulated data with known properties and different levels of measurement error to test and compare the methods, especially their ability to forecast future recruitment states. Methods examined include traditional linear regression, non-linear regression, Generalized Additive Models (GAM), and Artificial Neural Networks (ANN). Each is compared according to its ability to recover known patterns and parameters from simulated data, as well as to accurately forecast future recruitment states. We also apply the methods to published Norwegian spring-spawning herring (Clupea harengus L.) spawner–recruit–environment data. Results were not consistently conclusive, but in general, flexible non-parametric methods such as GAMs and ANNs performed better than parametric approaches in both parameter estimation and forecasting. Even under controlled data simulation procedures, we saw evidence of spurious correlations. Models fit to the Norwegian spring-spawning herring data show the importance of sea temperature and spawning biomass. The North Atlantic Oscillation (NAO) did not appear to be an influential factor affecting herring recruitment.

Download Full-text

Distribution and Catch Rate Characteristics of Narrow-Barred Spanish Mackerel (Scomberomorus commerson) in Relation to Oceanographic Factors in the Waters Around Taiwan

Frontiers in Marine Science ◽

10.3389/fmars.2021.770722 ◽

2021 ◽

Vol 8 ◽

Author(s):

Lu-Chi Chen ◽

Jinn-Shing Weng ◽

Muhamad Naimullah ◽

Po-Yuan Hsiao ◽

Chen-Te Tseng ◽

...

Keyword(s):

Southern Oscillation ◽

Generalized Additive Models ◽

Additive Models ◽

Environmental Data ◽

La Niña ◽

Northeast Monsoon ◽

Spatial And Temporal Patterns ◽

Spanish Mackerel ◽

La Nina ◽

Scomberomorus Commerson

This study investigated the relationship of the catch rates (CRs) of Spanish mackerel (Scomberomorus commerson) with oceanographic factors in the waters around Taiwan by using high-resolution fishery and environmental data for the period 2011–2016. The investigation results revealed that trammel nets accounted for 69.79% of the total catch of S. commerson and were operated mostly in the Taiwan Strait (TS). We noted seasonal variations in the distribution of high CRs. These CRs were observed in the southwestern TS, including the waters along the southwestern coast of Taiwan and around the Penghu Islands, and extended to the Taiwan Bank during autumn; they increased in winter. To predict the spatial and temporal patterns of Spanish mackerel density and their relationship with oceanographic and spatiotemporal variables, generalized additive models were used. These models explained 48.4% of the total deviance, which was consistent with the assumed Gaussian distribution. Moreover, all variables examined were significant CR predictors (p < 0.05). Latitude and longitude were the key factors influencing the spatiotemporal distribution of S. commerson, and sea surface chlorophyll a concentration was a key oceanographic factor. Observing projected changes in El Niño/Southern Oscillation events for S. commerson revealed that CRs were higher and distributed further southward during La Niña events than during other events. We inferred that the S. commerson distribution gradually moved toward the southwest with the northeast monsoon, which was enhanced during La Niña in winter.

Download Full-text

Geocoding and spatio-temporal modeling of long-term PM2.5 and NO2 exposure: the Mexican Teachers' Cohort

10.21203/rs.3.rs-362282/v1 ◽

2021 ◽

Author(s):

Cervantes - Martínez Karla ◽

Riojas - Rodríguez Horacio ◽

Díaz - Ávalos Carlos ◽

Moreno - Macías Hortensia ◽

López - Ridaura Ruy ◽

...

Keyword(s):

Predictive Accuracy ◽

Generalized Additive Models ◽

Epidemiological Studies ◽

Additive Models ◽

Temporal Modeling ◽

Secondary Sources ◽

Temporal Models ◽

Out Of Sample ◽

Spatio Temporal

Abstract Epidemiological studies on the effects of air pollution in Mexico often use the environmental concentrations of monitors closest to the home as exposure proxies, yet this approach disregards the space gradients of pollutants and assumes that individuals have no intra-city mobility. Our aim was to develop high-resolution spatial and temporal models for predicting long-term exposure to PM2.5 and NO2 in a population of ~ 16 500 participants from the Mexican Teachers’ Cohort study. We geocoded the home and work addresses of participants. Using information from secondary sources on geographic and meteorological variables as well as other pollutants, we fitted two generalized additive models to predict monthly PM2.5 and NO2 concentrations in the 2004–2019 period. The models were evaluated through 10-fold cross validation. Both showed high predictive accuracy with out-of-sample data and no overfitting (CV RMSE = 0.102 for PM2.5 and CV RMSE = 4.497 for NO2). Participants were exposed to a monthly average of 24.38 (6.78) µg/m3 of PM2.5 and 28.21 (8.00) ppb of NO2 during the study period. These models offer a solid alternative for estimating PM2.5 and NO2 exposure with high spatio-temporal resolution for epidemiological studies in the Valle de México region.

Download Full-text

Generalized Additive Models Analysis of the Atmospheric Pollutants Response to the Emission Reduction and Meteorology During the COVID-19 Lockdown in the North of Africa (Morocco)

10.21203/rs.3.rs-1029027/v1 ◽

2021 ◽

Author(s):

salah eddine sbai ◽

farida Bentayeb ◽

Hao Yin

Keyword(s):

Air Pollution ◽

Air Pollutants ◽

Emission Reduction ◽

Meteorological Factors ◽

Generalized Additive Models ◽

Additive Models ◽

Environmental Data ◽

Atmospheric Pollutants ◽

Photochemical Oxidation ◽

The North

Abstract Climate and air quality change due to COVID 19 lockdown (LCD) are extremely concerned subjects of several research recently. The contribution of meteorological factors and emission reduction to air pollution change over the north of Morocco has been investigated in this study using the framework generalized additive models (GAM), that have been proved to be a robust technique for the environmental data sets, focusing on main atmospheric pollutants in the region including ozone (O3), nitrogen dioxide (NO2), sulfur dioxide (SO2), particulate matter (PM2.5 and PM10), secondary inorganic aerosols (SIA), nom-methane volatile organic compounds (NMVOC) and carbon monoxide (CO) from the regional air pollution dataset of the Copernicus Atmosphere Monitoring Service (CAMS). Our results indicate that secondary air pollutants (PM2.5, PM10 and O3) are more influenced by metrological factors and the other air pollutants reported by this study in comparison with primary air pollutants (NO2 and SO2). We found that meteorological factors contribute to O3, PM2.5, PM10 and SIA average mass concentration by 22%, 5%, 3% and 34% before LCD and by 28%, 19%, 5% and 42% during LCD respectively. The increase in meteorological factors effect during LCD shows the contribution of photochemical oxidation to air pollution due to increase in atmospheric oxidant (O3 and OH radical) during LCD, which can explain the response of PM to emission reduction. Our study indicates that PM (PM2.5, PM10) has more controlled by SO2 due to the formation of sulfate particles especially under high oxidants level. The positive correlation between westward wind at 10m (WW10M), Northward Wind at 10m (NW10M) and PM indicates the implication of sea salt particles transported from Mediterranean Sea and Atlantic Ocean. This study shows the contribution of atmospheric oxidation capacity to air pollution change.

Download Full-text

Zooplankton abundance trends and patterns in Shelikof Strait, western Gulf of Alaska, USA, 1990–2017

Journal of Plankton Research ◽

10.1093/plankt/fbaa019 ◽

2020 ◽

Vol 42 (3) ◽

pp. 334-354 ◽

Cited By ~ 2

Author(s):

David G Kimmel ◽

Janet T Duffy-Anderson

Keyword(s):

Time Series ◽

Generalized Additive Models ◽

Zooplankton Community ◽

Strong Relationship ◽

Gulf Of Alaska ◽

Additive Models ◽

Environmental Data ◽

Dynamic Factor ◽

Zooplankton Abundance ◽

Underlying Trend

Abstract A multivariate approach was used to analyze spring zooplankton abundance in Shelikof Strait, western Gulf of Alaska. abundance of individual zooplankton taxa was related to environmental variables using generalized additive models. The most important variables that correlated with zooplankton abundance were water temperature, salinity and ordinal day (day of year when sample was collected). A long-term increase in abundance was found for the calanoid copepod Calanus pacificus, copepodite stage 5 (C5). A dynamic factor analysis (DFA) indicated one underlying trend in the multivariate environmental data that related to phases of the Pacific Decadal Oscillation. DFA of zooplankton time series also indicated one underlying trend where the positive phase was characterized by increases in the abundance of C. marshallae C5, C. pacificus C5, Eucalanus bungii C4, Pseudocalanus spp. C5 and Limacina helicina and declines in the abundance of Neocalanus cristatus C4 and Neocalanus spp. C4. The environmental and zooplankton DFA trends were not correlated over the length of the entire time period; however, the two time series were correlated post-2004. The strong relationship between environmental conditions, zooplankton abundance and time of sampling suggests that continued warming in the region may lead to changes in zooplankton community composition and timing of life history events during spring.

Download Full-text