eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

Mapping Intimacies ◽

10.1101/305870 ◽

2018 ◽

Cited By ~ 1

Author(s):

Julián Candia ◽

John S. Tsang

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Statistical Significance ◽

Model Fitting ◽

Predictive Performance ◽

R Package ◽

Research Area ◽

Elastic Net ◽

Feature Identification ◽

Practical Applications

AbstractBackgroundRegularized generalized linear models (GLMs) are popular regression methods in bioinformatics, particularly useful in scenarios with fewer observations than parameters/features or when many of the features are correlated. In both ridge and lasso regularization, feature shrinkage is controlled by a penalty parameter λ. The elastic net introduces a mixing parameter α to tune the shrinkage continuously from ridge to lasso. Selecting α objectively and determining which features contributed significantly to prediction after model fitting remain a practical challenge given the paucity of available software to evaluate performance and statistical significance.ResultseNetXplorer builds on top of glmnet to address the above issues for linear (Gaussian), binomial (logistic), and multinomial GLMs. It provides new functionalities to empower practical applications by using a cross validation framework that assesses the predictive performance and statistical significance of a family of elastic net models (as α is varied) and of the corresponding features that contribute to prediction. The user can select which quality metrics to use to quantify the concordance between predicted and observed values, with defaults provided for each GLM. Statistical significance for each model (as defined by α) is determined based on comparison to a set of null models generated by random permutations of the response; the same permutation-based approach is used to evaluate the significance of individual features. In the analysis of large and complex biological datasets, such as transcriptomic and proteomic data, eNetXplorer provides summary statistics, output tables, and visualizations to help assess which subset(s) of features have predictive value for a set of response measurements, and to what extent those subset(s) of features can be expanded or reduced via regularization.ConclusionsThis package presents a framework and software for exploratory data analysis and visualization. By making regularized GLMs more accessible and interpretable, eNetXplorer guides the process to generate hypotheses based on features significantly associated with biological phenotypes of interest, e.g. to identify biomarkers for therapeutic responsiveness. eNetXplorer is also generally applicable to any research area that may benefit from predictive modeling and feature identification using regularized GLMs.Availability and implementationThe package is available under GPL-3 license at the CRAN repository, https://CRAN.R-project.org/package=eNetXplorer

Download Full-text

eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

BMC Bioinformatics ◽

10.1186/s12859-019-2778-5 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 5

Author(s):

Julián Candia ◽

John S Tsang

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

R Package ◽

Elastic Net

Download Full-text

tscount: An R Package for Analysis of Count Time Series Following Generalized Linear Models

Journal of Statistical Software ◽

10.18637/jss.v082.i05 ◽

2017 ◽

Vol 82 (5) ◽

Cited By ~ 28

Author(s):

Tobias Liboschik ◽

Konstantinos Fokianos ◽

Roland Fried

Keyword(s):

Time Series ◽

Generalized Linear Models ◽

Linear Models ◽

R Package ◽

Count Time Series

Download Full-text

Semi-automated simultaneous predictor selection for regression-SARIMA models

Statistics and Computing ◽

10.1007/s11222-020-09970-6 ◽

2020 ◽

Vol 30 (6) ◽

pp. 1759-1778

Author(s):

Aaron P. Lowther ◽

Paul Fearnhead ◽

Matthew A. Nunes ◽

Kjeld Jensen

Keyword(s):

Linear Models ◽

Model Fitting ◽

Selection Procedure ◽

Predictive Performance ◽

Mixed Integer ◽

Linear Regression Models ◽

Regression Residuals ◽

Wide Range ◽

Telecommunications Network ◽

Integral Role

Abstract Deciding which predictors to use plays an integral role in deriving statistical models in a wide range of applications. Motivated by the challenges of predicting events across a telecommunications network, we propose a semi-automated, joint model-fitting and predictor selection procedure for linear regression models. Our approach can model and account for serial correlation in the regression residuals, produces sparse and interpretable models and can be used to jointly select models for a group of related responses. This is achieved through fitting linear models under constraints on the number of nonzero coefficients using a generalisation of a recently developed mixed integer quadratic optimisation approach. The resultant models from our approach achieve better predictive performance on the motivating telecommunications data than methods currently used by industry.

Download Full-text

Variable selection for high-dimensional generalized linear models with the weighted elastic-net procedure

Journal of Applied Statistics ◽

10.1080/02664763.2015.1078300 ◽

2015 ◽

Vol 43 (5) ◽

pp. 796-809 ◽

Cited By ~ 17

Author(s):

Xiuli Wang ◽

Mingqiu Wang

Keyword(s):

Variable Selection ◽

Generalized Linear Models ◽

Linear Models ◽

Elastic Net ◽

High Dimensional ◽

Selection For

Download Full-text

Individual Loss Reserving Using a Gradient Boosting-Based Approach

Risks ◽

10.3390/risks7030079 ◽

2019 ◽

Vol 7 (3) ◽

pp. 79 ◽

Cited By ~ 2

Author(s):

Pigeon ◽

Duval

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Gradient Boosting ◽

Insurance Company ◽

Practical Applications ◽

Individual Level ◽

Loss Reserving ◽

Boosting Algorithm ◽

Property And Casualty Insurance

In this paper, we propose models for non-life loss reserving combining traditionalapproaches such as Mack’s or generalized linear models and gradient boosting algorithm in anindividual framework. These claim-level models use information about each of the payments madefor each of the claims in the portfolio, as well as characteristics of the insured. We provide an examplebased on a detailed dataset from a property and casualty insurance company. We contrast sometraditional aggregate techniques, at the portfolio-level, with our individual-level approach and wediscuss some points related to practical applications.

Download Full-text

modglm: An R package for testing, interpreting, and displaying interactions in generalized linear models of discrete data

10.31234/osf.io/6vmsa ◽

2021 ◽

Author(s):

Connor McCabe ◽

Max Andrew Halvorson ◽

Kevin Michael King ◽

Xiaolin Cao ◽

Dale Sim Kim

Keyword(s):

Open Source ◽

Generalized Linear Models ◽

Open Source Software ◽

Interaction Effect ◽

Software Package ◽

Linear Models ◽

R Package ◽

Interaction Effects ◽

Research Practice ◽

Nonlinear Scale

Many researchers hope to examine interaction effects using generalized linear models (GLMs) to predict outcomes on nonlinear scales. For instance, logistic and Poisson GLMs are used to estimate associations between predictors and outcomes in nonlinear probability and count scales, respectively. However, we (McCabe et al., 2021; Halvorson et al., in press) and others (Ai & Norton, 2003; Mize, 2019; Norton, Wang, & Ai, 2004) have shown that testing and interpreting interaction effects on these scales is not straightforward. GLMs require the application of partial derivatives and/or discrete differences to compute and probe interaction effects appropriately when models are interpreted on their nonlinear scale. Currently available open-source software does not provide methods of computing these interaction effects on probability and count scales, reflecting a central limitation in applying these methods in research practice. Here, we introduce `modglm`, an R-based software package that accompanies our manuscript providing recommendations for computing interaction effects in nonlinear probability and counts (McCabe et al., 2021). This software produces the interaction effect between two variables in generalized linear models of probabilities and counts and provides additional statistics and plotting utilities for evaluating and describing this effect.

Download Full-text

c060: Extended Inference with Lasso and Elastic-Net Regularized Cox and Generalized Linear Models

Journal of Statistical Software ◽

10.18637/jss.v062.i05 ◽

2014 ◽

Vol 62 (5) ◽

Cited By ~ 16

Author(s):

Martin Sill ◽

Thomas Hielscher ◽

Natalia Becker ◽

Manuela Zucknick

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Elastic Net

Download Full-text

Isolation Prevalence of Pulmonary Nontuberculous Mycobacteria in Ontario in 2007

Canadian Respiratory Journal ◽

10.1155/2011/865831 ◽

2011 ◽

Vol 18 (1) ◽

pp. 19-24 ◽

Cited By ~ 39

Author(s):

Mohammed Al Houqani ◽

Frances Jamieson ◽

Pamela Chedore ◽

Mauli Mehta ◽

Kevin May ◽

...

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Negative Binomial ◽

Nontuberculous Mycobacteria ◽

Statistical Significance ◽

Population Changes ◽

Annual Increase ◽

Rapidly Growing Mycobacteria ◽

Mycobacterium Xenopi ◽

Published Research

BACKGROUND: The reported prevalence of pulmonary nontuberculous mycobacteria (NTM) infections is increasing.OBJECTIVE: To determine the ‘isolation prevalence’ of NTM in 2007 and compare it with previously published research that examined the increasing rates of isolation of NTM from clinical pulmonary specimens between 1997 and 2003.METHODS: Isolation prevalence was investigated retrospectively by reviewing a cohort of all positive pulmonary NTM culture results from the Tuberculosis and Mycobacteriology Laboratory, Public Health Laboratory (Toronto, Ontario) in 2007, which identifies at least 95% of NTM isolates in Ontario. Isolation prevalence was calculated as the number of persons with a pulmonary isolate in a calendar year divided by the contemporary population and expressed per 100,000 population. Changes in isolation prevalence from previous years were assessed for statistical significance using generalized linear models with a negative binomial distribution.RESULTS: In 2007, 4160 pulmonary isolates of NTM were collected from 2463 patients. The isolation prevalence of all species (excludingMycobacterium gordonae) was 19 per 100,000 population in 2007 – an increase from previous observations reported for Ontario – corresponding to an average annual increase of 8.5% from 1997 to 2007 (P<0.0001). Average annual increases in isolation prevalence ofMycobacterium aviumcomplex (8.8%, P<0.0001) andMycobacterium xenopi(7.3%, P=0.0005) were largely responsible for the overall increase, while prevalence rates of rapidly growing mycobacteria remained relatively stable.CONCLUSION: The isolation prevalence of pulmonary NTM continues to increase significantly in Ontario, supporting the belief that pulmonary NTM disease is increasingly common.

Download Full-text

CPMCGLM: an R package for p-value adjustment when looking for an optimal transformation of a single explanatory variable in generalized linear models

BMC Medical Research Methodology ◽

10.1186/s12874-019-0711-2 ◽

2019 ◽

Vol 19 (1) ◽

Cited By ~ 2

Author(s):

Benoit Liquet ◽

Jérémie Riou

Keyword(s):

Generalized Linear Models ◽

Linear Models ◽

Explanatory Variable ◽

R Package ◽

P Value

Download Full-text

glmGamPoi: Fitting Gamma-Poisson Generalized Linear Models on Single Cell Count Data

Bioinformatics ◽

10.1093/bioinformatics/btaa1009 ◽

2020 ◽

Author(s):

Constantin Ahlmann-Eltze ◽

Wolfgang Huber

Keyword(s):

Single Cell ◽

Poisson Distribution ◽

Generalized Linear Models ◽

Count Data ◽

Linear Models ◽

Differential Expression Analysis ◽

Source Code ◽

Principal Component ◽

R Package ◽

Single Cell Rna Sequencing

Abstract Motivation The Gamma-Poisson distribution is a theoretically and empirically motivated model for the sampling variability of single cell RNA-sequencing counts (Grün et al., 2014; Svensson, 2020; Silverman et al., 2018; Hafemeister and Satija, 2019) and an essential building block for analysis approaches including differential expression analysis (Robinson et al., 2010; McCarthy et al., 2012; Anders and Huber, 2010; Love et al., 2014), principal component analysis (Townes et al., 2019) and factor analysis (Risso et al., 2018). Existing implementations for inferring its parameters from data often struggle with the size of single cell datasets, which can comprise millions of cells; at the same time, they do not take full advantage of the fact that zero and other small numbers are frequent in the data. These limitations have hampered uptake of the model, leaving room for statistically inferior approaches such as logarithm(-like) transformation. Results We present a new R package for fitting the Gamma-Poisson distribution to data with the characteristics of modern single cell datasets more quickly and more accurately than existing methods. The software can work with data on disk without having to load them into RAM simultaneously. Availability The package glmGamPoi is available from Bioconductor for Windows, macOS, and Linux, and source code is available on github.com/const-ae/glmGamPoi under a GPL-3 license.

Download Full-text