Ensemble Linear Subspace Analysis of High-Dimensional Data

S. Ejaz Ahmed; Saeid Amiri; Kjell Doksum

doi:10.3390/e23030324

Ensemble Linear Subspace Analysis of High-Dimensional Data

Entropy ◽

10.3390/e23030324 ◽

2021 ◽

Vol 23 (3) ◽

pp. 324

Author(s):

S. Ejaz Ahmed ◽

Saeid Amiri ◽

Kjell Doksum

Keyword(s):

Linear Models ◽

Real Data ◽

Penalty Methods ◽

Tuning Parameter ◽

High Dimensional ◽

Prediction Errors ◽

Tuning Parameters ◽

Ensemble Approach ◽

High Dimensional Regression ◽

Multivariate Mutual Information

Regression models provide prediction frameworks for multivariate mutual information analysis that uses information concepts when choosing covariates (also called features) that are important for analysis and prediction. We consider a high dimensional regression framework where the number of covariates (p) exceed the sample size (n). Recent work in high dimensional regression analysis has embraced an ensemble subspace approach that consists of selecting random subsets of covariates with fewer than p covariates, doing statistical analysis on each subset, and then merging the results from the subsets. We examine conditions under which penalty methods such as Lasso perform better when used in the ensemble approach by computing mean squared prediction errors for simulations and a real data example. Linear models with both random and fixed designs are considered. We examine two versions of penalty methods: one where the tuning parameter is selected by cross-validation; and one where the final predictor is a trimmed average of individual predictors corresponding to the members of a set of fixed tuning parameters. We find that the ensemble approach improves on penalty methods for several important real data and model scenarios. The improvement occurs when covariates are strongly associated with the response, when the complexity of the model is high. In such cases, the trimmed average version of ensemble Lasso is often the best predictor.

Download Full-text

A Robust Rerank Approach for Feature Selection and Its Application to Pooling-Based GWA Studies

Computational and Mathematical Methods in Medicine ◽

10.1155/2013/860673 ◽

2013 ◽

Vol 2013 ◽

pp. 1-11 ◽

Cited By ~ 1

Author(s):

Jia-Rou Liu ◽

Po-Hsiu Kuo ◽

Hung Hung

Keyword(s):

Characteristic Curve ◽

Real Data ◽

Tuning Parameter ◽

Practical Implementation ◽

Tuning Parameters ◽

Genome Wide ◽

Feature Based ◽

The Difference ◽

Gwa Studies ◽

Selection Of

Large-p-small-ndatasets are commonly encountered in modern biomedical studies. To detect the difference between two groups, conventional methods would fail to apply due to the instability in estimating variances int-test and a high proportion of tied values in AUC (area under the receiver operating characteristic curve) estimates. The significance analysis of microarrays (SAM) may also not be satisfactory, since its performance is sensitive to the tuning parameter, and its selection is not straightforward. In this work, we propose a robust rerank approach to overcome the above-mentioned diffculties. In particular, we obtain a rank-based statistic for each feature based on the concept of “rank-over-variable.” Techniques of “random subset” and “rerank” are then iteratively applied to rank features, and the leading features will be selected for further studies. The proposed re-rank approach is especially applicable for large-p-small-ndatasets. Moreover, it is insensitive to the selection of tuning parameters, which is an appealing property for practical implementation. Simulation studies and real data analysis of pooling-based genome wide association (GWA) studies demonstrate the usefulness of our method.

Download Full-text

A new test for high‐dimensional regression coefficients in partially linear models

Canadian Journal of Statistics ◽

10.1002/cjs.11665 ◽

2021 ◽

Author(s):

Fanrong Zhao ◽

Nan Lin ◽

Baoxue Zhang

Keyword(s):

Linear Models ◽

Regression Coefficients ◽

High Dimensional ◽

Partially Linear Models ◽

Partially Linear ◽

High Dimensional Regression

Download Full-text

Gene Selection using a High-Dimensional Regression Model with Microarrays in Cancer Prognostic Studies

Cancer Informatics ◽

10.4137/cin.s9048 ◽

2012 ◽

Vol 11 ◽

pp. CIN.S9048 ◽

Cited By ~ 4

Author(s):

Shuhei Kaneko ◽

Akihiro Hirakawa ◽

Chikuma Hamada

Keyword(s):

False Positive ◽

Cross Validation ◽

Gene Selection ◽

Cox Model ◽

False Positive Rate ◽

Real Data ◽

Tuning Parameter ◽

High Dimensional ◽

Positive Rate ◽

Selection Operator

Mining of gene expression data to identify genes associated with patient survival is an ongoing problem in cancer prognostic studies using microarrays in order to use such genes to achieve more accurate prognoses. The least absolute shrinkage and selection operator (lasso) is often used for gene selection and parameter estimation in high-dimensional microarray data. The lasso shrinks some of the coefficients to zero, and the amount of shrinkage is determined by the tuning parameter, often determined by cross validation. The model determined by this cross validation contains many false positives whose coefficients are actually zero. We propose a method for estimating the false positive rate (FPR) for lasso estimates in a high-dimensional Cox model. We performed a simulation study to examine the precision of the FPR estimate by the proposed method. We applied the proposed method to real data and illustrated the identification of false positive genes.

Download Full-text

A Survey of Tuning Parameter Selection for High-Dimensional Regression

Annual Review of Statistics and Its Application ◽

10.1146/annurev-statistics-030718-105038 ◽

2020 ◽

Vol 7 (1) ◽

pp. 209-226 ◽

Cited By ~ 2

Author(s):

Yunan Wu ◽

Lan Wang

Keyword(s):

Standard Technique ◽

Penalized Regression ◽

Optimal Choice ◽

Parameter Selection ◽

Tuning Parameter ◽

High Dimensional ◽

Design Matrix ◽

High Dimensional Regression ◽

Selection For ◽

Sparsity Level

Penalized (or regularized) regression, as represented by lasso and its variants, has become a standard technique for analyzing high-dimensional data when the number of variables substantially exceeds the sample size. The performance of penalized regression relies crucially on the choice of the tuning parameter, which determines the amount of regularization and hence the sparsity level of the fitted model. The optimal choice of tuning parameter depends on both the structure of the design matrix and the unknown random error distribution (variance, tail behavior, etc.). This article reviews the current literature of tuning parameter selection for high-dimensional regression from both the theoretical and practical perspectives. We discuss various strategies that choose the tuning parameter to achieve prediction accuracy or support recovery. We also review several recently proposed methods for tuning-free high-dimensional regression.

Download Full-text

Test for high dimensional regression coefficients of partially linear models

Communication in Statistics- Theory and Methods ◽

10.1080/03610926.2019.1594293 ◽

2020 ◽

Vol 49 (17) ◽

pp. 4091-4116

Author(s):

Siyang Wang ◽

Hengjian Cui

Keyword(s):

Linear Models ◽

Regression Coefficients ◽

High Dimensional ◽

Partially Linear Models ◽

Partially Linear ◽

High Dimensional Regression

Download Full-text

Interactions of Genetic and Environment Scores: Alternating Lasso Regularization Avoids Overfitting and Finds Interpretable Scores

10.31234/osf.io/7k8xz ◽

2020 ◽

Author(s):

Philipp Doebler ◽

Anna Doebler ◽

Philip Buczak ◽

Andreas Groll

Keyword(s):

Linear Models ◽

Interaction Model ◽

Screening Procedure ◽

Tuning Parameter ◽

High Dimensional ◽

Trivial Extension ◽

Model Interpretation ◽

Interaction Terms ◽

Score Model ◽

Gene X Environment Interactions

Regression models with interaction terms are common models for moderating relationships. When several predictors from one group, e.g., genetic variables, are potentially moderated by several predictors from another, e.g., environmental variables, many interaction terms result. This complicates model interpretation, especially when coefficient signs point in different directions. By first forming a score for each group of predictors, the interaction model's dimension is severely reduced. The hierarchical score model is an elegant one step approach: Score weights and regression model coefficients are estimated simultaneously by an alternating optimization (AO) algorithm. Especially in high dimensional settings, scores remain an effective technique to reduce interaction model dimension, and we propose regularization to ensure sparsity and interpretability of the score weights. A non-trivial extension of the original AO algorithm is presented, which adds a lasso penalty, resulting in the alternating lasso optimization algorithm (ALOA). The hierarchical score model with ALOA is an interpretable statistical learning technique for moderation in potentially high dimensional applications, and encompasses generalized linear models for the main interaction model. In addition to the lasso regularization, a screening procedure called regularization and residualization (RR) is proposed to avoid spurious interactions. ALOA tuning parameter choice and the RR screening procedure are investigated by simulations, and an illustrative application to lifetime depression risk and gene x environment interactions is provided.

Download Full-text

Independently Interpretable Lasso for Generalized Linear Models

Neural Computation ◽

10.1162/neco_a_01279 ◽

2020 ◽

Vol 32 (6) ◽

pp. 1168-1221

Author(s):

Masaaki Takada ◽

Taiji Suzuki ◽

Hironori Fujisawa

Keyword(s):

Generalized Linear Models ◽

Regularization Method ◽

Linear Models ◽

Estimation Error ◽

Real Data ◽

Regression Coefficients ◽

Learning Problems ◽

High Dimensional ◽

Optimal Convergence Rate ◽

Sparse Regularization

Sparse regularization such as [Formula: see text] regularization is a quite powerful and widely used strategy for high-dimensional learning problems. The effectiveness of sparse regularization has been supported practically and theoretically by several studies. However, one of the biggest issues in sparse regularization is that its performance is quite sensitive to correlations between features. Ordinary [Formula: see text] regularization selects variables correlated with each other under weak regularizations, which results in deterioration of not only its estimation error but also interpretability. In this letter, we propose a new regularization method, independently interpretable lasso (IILasso), for generalized linear models. Our proposed regularizer suppresses selecting correlated variables, so that each active variable affects the response independently in the model. Hence, we can interpret regression coefficients intuitively, and the performance is also improved by avoiding overfitting. We analyze the theoretical property of the IILasso and show that the proposed method is advantageous for its sign recovery and achieves almost minimax optimal convergence rate. Synthetic and real data analyses also indicate the effectiveness of the IILasso.

Download Full-text

Generalized F-test for high dimensional regression coefficients of partially linear models

Journal of Systems Science and Complexity ◽

10.1007/s11424-017-6012-0 ◽

2017 ◽

Vol 30 (5) ◽

pp. 1206-1226 ◽

Cited By ~ 5

Author(s):

Siyang Wang ◽

Hengjian Cui

Keyword(s):

Linear Models ◽

Regression Coefficients ◽

High Dimensional ◽

Partially Linear Models ◽

F Test ◽

Partially Linear ◽

High Dimensional Regression

Download Full-text

Outlier-resistant high-dimensional regression modelling based on distribution-free outlier detection and tuning parameter selection

Journal of Statistical Computation and Simulation ◽

10.1080/00949655.2017.1287186 ◽

2017 ◽

Vol 87 (9) ◽

pp. 1799-1812 ◽

Cited By ~ 1

Author(s):

Heewon Park

Keyword(s):

Outlier Detection ◽

Parameter Selection ◽

Tuning Parameter ◽

High Dimensional ◽

Distribution Free ◽

Regression Modelling ◽

High Dimensional Regression

Download Full-text

Additive partially linear models for ultra‐high‐dimensional regression

Stat ◽

10.1002/sta4.223 ◽

2019 ◽

Vol 8 (1) ◽

Cited By ~ 2

Author(s):

Xinyi Li ◽

Li Wang ◽

Dan Nettleton

Keyword(s):

Linear Models ◽

High Dimensional ◽

Partially Linear Models ◽

Partially Linear ◽

High Dimensional Regression

Download Full-text