Linear Models I: Regression; PCA of Predictor Variables

2019 ◽  
Vol 11 (3) ◽  
pp. 222 ◽  
Author(s):  
John Hogland ◽  
David L.R. Affleck

Remotely sensed data are commonly used as predictor variables in spatially explicit models depicting landscape characteristics of interest (response) across broad extents, at relatively fine resolution. To create these models, variables are spatially registered to a known coordinate system and used to link responses with predictor variable values. Inherently, this linking process introduces measurement error into the response and predictors, which in the latter case causes attenuation bias. Through simulations, our findings indicate that the spatial correlation of response and predictor variables and their corresponding spatial registration (co-registration) errors can have a substantial impact on the bias and accuracy of linear models. Additionally, in this study we evaluate spatial aggregation as a mechanism to minimize the impact of co-registration errors, assess the impact of subsampling within the extent of sample units, and provide a technique that can be used to both determine the extent of an observational unit needed to minimize the impact of co-registration and quantify the amount of error potentially introduced into predictive models.


2018 ◽  
Vol 6 (1) ◽  
Author(s):  
Dominik Janzing ◽  
Bernhard Schölkopf

AbstractWe study a model where one target variable $Y$ is correlated with a vector $\textbf{X}:=(X_1,\dots,X_d)$ of predictor variables being potential causes of $Y$. We describe a method that infers to what extent the statistical dependences between $\textbf{X}$ and $Y$ are due to the influence of $\textbf{X}$ on $Y$ and to what extent due to a hidden common cause (confounder) of $\textbf{X}$ and $Y$. The method relies on concentration of measure results for large dimensions $d$ and an independence assumption stating that, in the absence of confounding, the vector of regression coefficients describing the influence of each $\textbf{X}$ on $Y$ typically has ‘generic orientation’ relative to the eigenspaces of the covariance matrix of $\textbf{X}$. For the special case of a scalar confounder we show that confounding typically spoils this generic orientation in a characteristic way that can be used to quantitatively estimate the amount of confounding (subject to our idealized model assumptions).


2020 ◽  
Author(s):  
Connor McCabe ◽  
Max Andrew Halvorson ◽  
Kevin Michael King ◽  
Xiaolin Cao ◽  
Dale Sim Kim

Psychology research frequently involves the study of probabilities and counts. These are typically analyzed using generalized linear models (GLMs), which can produce these quantities via nonlinear transformation of model parameters. Interactions are central within many research applications of these models. To date, typical practice in evaluating interactions for probabilities or counts extends directly from linear approaches, in which evidence of an interaction effect is supported by using the product term coefficient between variables of interest. However, unlike linear models, interaction effects in GLMs describing probabilities and counts are not equal to product terms between predictor variables. Instead, interactions may be functions of the predictors of a model, requiring non-traditional approaches for interpreting these effects accurately. Here, we define interactions as change in a marginal effect of one variable as a function of change in another variable, and describe the use of partial derivatives and discrete differences for quantifying these effects. Using guidelines and simulated examples, we then use these approaches to describe how interaction effects should be estimated and interpreted for GLMs on probability and count scales. We conclude with an example using the Adolescent Brain Cognitive Development Study demonstrating how to correctly evaluate interaction effects in a logistic model.


2016 ◽  
Vol 11 (2) ◽  
Author(s):  
Renke Lühken ◽  
Jörn Martin Gethmann ◽  
Petra Kranz ◽  
Pia Steffenhagen ◽  
Christoph Staubach ◽  
...  

This study analysed <em>Culicoides</em> presence-absence data from 46 sampling sites in Germany, where monitoring was carried out from April 2007 until May 2008. <em>Culicoides</em> presence-absence data were analysed in relation to land cover data, in order to study whether the prevalence of biting midges is correlated to land cover data with respect to the trapping sites. We differentiated eight scales, <em>i.e.</em> buffer zones with radii of 0.5, 1, 2, 3, 4, 5, 7.5 and 10 km, around each site, and chose several land cover variables. For each species, we built eight single-scale models (<em>i.e.</em> predictor variables from one of the eight scales for each model) based on averaged, generalised linear models and two multiscale models (<em>i.e.</em> predictor variables from all of the eight scales) based on averaged, generalised linear models and generalised linear models with random forest variable selection. There were no significant differences between performance indicators of models built with land cover data from different buffer zones around the trapping sites. However, the overall performance of multi-scale models was higher than the alternatives. Furthermore, these models mostly achieved the best performance for the different species using the index area under the receiver operating characteristic curve. However, as also presented in this study, the relevance of the different variables could significantly differ between various scales, including the number of species affected and the positive or negative direction. This is an even more severe problem if multi-scale models are concerned, in which one model can have the same variable at different scales but with different directions, <em>i.e.</em> negative and positive direction of the same variable at different scales. However, multi-scale modelling is a promising approach to model the distribution of <em>Culicoides</em> species, accounting much more for the ecology of biting midges, which uses different resources (breeding sites, hosts, <em>etc</em>.) at different scales.


HortScience ◽  
2020 ◽  
Vol 55 (7) ◽  
pp. 1111-1118
Author(s):  
Yun Kong ◽  
Xiangyue Kong ◽  
Youbin Zheng

Nondestructive estimation of individual shoot fresh weight (FW) from its measurable morphological traits is useful for a wide variety of purposes in pea shoot production. To predict individual shoot FW, nine regression models in total were developed, including two power models using stem diameter (SMD) or stem length (SML) as a variable, and seven linear models using part or all the following variables: SMD, SML, leaflet length (LL), leaflet width (LW), stipule length (SEL), and stipule width (SEW). Among the nine models, the 6-variable linear equation had the highest coefficient of determination, R2 = 0.92, indicating it is most effective at explaining the variation in FW. The linear equations including only one variable, SMD or SML, were equally the least effective as nonlinear equations (i.e., power models). This finding suggests that there was a linear rather than nonlinear relationship between FW and the morphological variables. During stepwise regression, SEW and LW together were first removed from the 6-variable linear models without reducing the R2, and then SEL, SMD, SML were further removed one-by-one, which reduced the R2 from 0.92 to 0.90, 0.85, and 0.71, respectively. The result suggests that SMD, SML, SEL, and LL were the most important four predictor variables for multivariable linear regression models to estimate FW, an idea that was also supported by path analysis. For the four linear models with 1–4 predictor variables from stepwise regression, the prediction accuracy of FW was evaluated based on the agreement between the predicted and measured values using another independent dataset. The 4- and 3-variable linear models (i.e., FW = −1.437 + 0.276 SMD + 0.010 SML + 0.022 LL + 0.013 SEL and FW = −1.383 + 0.308 SMD + 0.011 SML + 0.030 LL, respectively) were selected for their more accurate prediction than 1- and 2-variable linear models and relatively simpler forms than a 6-variable linear model. Although the prediction accuracy can be potentially affected by air temperature, light conditions, and harvesting time, the multilinear regression model is an effective approach for estimating fresh weight of individual pea shoots using its measurable morphological traits.


Author(s):  
Silvie Kafková ◽  
Lenka Křivánková

Actuaries in insurance companies try to find the best model for an estimation of insurance premium. It depends on many risk factors, e.g. the car characteristics and the profile of the driver. In this paper, an analysis of the portfolio of vehicle insurance data using a generalized linear model (GLM) is performed. The main advantage of the approach presented in this article is that the GLMs are not limited by inflexible preconditions. Our aim is to predict the relation of annual claim frequency on given risk factors. Based on a large real-world sample of data from 57 410 vehicles, the present study proposed a classification analysis approach that addresses the selection of predictor variables. The models with different predictor variables are compared by analysis of deviance and Akaike information criterion (AIC). Based on this comparison, the model for the best estimate of annual claim frequency is chosen. All statistical calculations are computed in R environment, which contains stats package with the function for the estimation of parameters of GLM and the function for analysis of deviation.


2001 ◽  
Vol 58 (7) ◽  
pp. 1265-1285 ◽  
Author(s):  
Edward J Gregr ◽  
Andrew W Trites

Whaling records from British Columbia coastal whaling stations reliably report the positions of 9592 whales killed between 1948 and 1967. We used this positional information and oceanographic data (bathymetry, temperature, and salinity) to predict critical habitat off the coast of British Columbia for sperm (Physeter macrocephalus), sei (Balaenoptera borealis), fin (Balaenoptera physalus), humpback (Megaptera novaeangliae), and blue (Balaenoptera musculus) whales. We used generalized linear models at annual and monthly time scales to relate whale occurrence to six predictor variables (month, depth, slope, depth class, and sea surface temperature and salinity). The models showed critical habitat for sei, fin, and male sperm whales along the continental slope and over a large area off the northwest coast of Vancouver Island. Habitat models for blue, humpback, and female sperm whales were relatively insensitive to the predictor variables, owing partially to the smaller sample sizes for these groups. The habitat predictions lend support to recent hypotheses about sperm whale breeding off British Columbia and identify humpback whale habitat in sheltered bays and straits throughout the coast. The habitat models also provide insights about the nature of the linkages between the environment and the distribution of whales in the North Pacific Ocean.


Sign in / Sign up

Export Citation Format

Share Document