Principal component of explained variance: an efficient and optimal data dimension reduction framework for association studies

Mapping Intimacies ◽

10.1101/036566 ◽

2016 ◽

Cited By ~ 1

Author(s):

Maxime Turgeon ◽

Karim Oualkacha ◽

Antonio Ciampi ◽

Golsa Dehghan ◽

Brent W. Zanke ◽

...

Keyword(s):

Dimension Reduction ◽

Association Studies ◽

Computational Cost ◽

Principal Component ◽

Original Method ◽

High Dimensional ◽

Testing Procedures ◽

Simple Strategy ◽

Reduction Techniques ◽

Explained Variance

The genomics era has led to an increase in the dimensionality of the data collected to investigate biological questions. In this context, dimension-reduction techniques can be used to summarize high-dimensional signals into low-dimensional ones, to further test for association with one or more covariates of interest. This paper revisits one such approach, previously known as Principal Component of Heritability and renamed here as Principal Component of Explained Variance (PCEV). As its name suggests, the PCEV seeks a linear combination of outcomes in an optimal manner, by maximising the proportion of variance explained by one or several covariates of interest. By construction, this method optimises power but limited by its computational complexity, it has unfortunately received little attention in the past. Here, we propose a general analytical PCEV framework that builds on the assets of the original method, i.e. conceptually simple and free of tuning parameters. Moreover, our framework extends the range of applications of the original procedure by providing a computationally simple strategy for high-dimensional outcomes, along with exact and asymptotic testing procedures that drastically reduce its computational cost. We investigate the merits of the PCEV using an extensive set of simulations. Furthermore, the use of the PCEV approach will be illustrated using three examples taken from the epigenetics and brain imaging areas.

Download Full-text

Principal component of explained variance: An efficient and optimal data dimension reduction framework for association studies

Statistical Methods in Medical Research ◽

10.1177/0962280216660128 ◽

2016 ◽

Vol 27 (5) ◽

pp. 1331-1350 ◽

Cited By ~ 4

Author(s):

Maxime Turgeon ◽

Karim Oualkacha ◽

Antonio Ciampi ◽

Hanane Miftah ◽

Golsa Dehghan ◽

...

Keyword(s):

Dimension Reduction ◽

Association Studies ◽

Computational Cost ◽

Principal Component ◽

Original Method ◽

High Dimensional ◽

Testing Procedures ◽

Simple Strategy ◽

Reduction Techniques ◽

Explained Variance

The genomics era has led to an increase in the dimensionality of data collected in the investigation of biological questions. In this context, dimension-reduction techniques can be used to summarise high-dimensional signals into low-dimensional ones, to further test for association with one or more covariates of interest. This paper revisits one such approach, previously known as principal component of heritability and renamed here as principal component of explained variance (PCEV). As its name suggests, the PCEV seeks a linear combination of outcomes in an optimal manner, by maximising the proportion of variance explained by one or several covariates of interest. By construction, this method optimises power; however, due to its computational complexity, it has unfortunately received little attention in the past. Here, we propose a general analytical PCEV framework that builds on the assets of the original method, i.e. conceptually simple and free of tuning parameters. Moreover, our framework extends the range of applications of the original procedure by providing a computationally simple strategy for high-dimensional outcomes, along with exact and asymptotic testing procedures that drastically reduce its computational cost. We investigate the merits of the PCEV using an extensive set of simulations. Furthermore, the use of the PCEV approach is illustrated using three examples taken from the fields of epigenetics and brain imaging.

Download Full-text

Application of t-SNE to human genetic data

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017500172 ◽

2017 ◽

Vol 15 (04) ◽

pp. 1750017 ◽

Cited By ~ 51

Author(s):

Wentian Li ◽

Jane E. Cerise ◽

Yaning Yang ◽

Henry Han

Keyword(s):

Dimension Reduction ◽

Population Stratification ◽

Association Studies ◽

Genetic Association Studies ◽

Principal Component ◽

Genetic Data ◽

Visualization Technique ◽

Data Intensive ◽

Reduction Techniques ◽

Dimension Reduction Techniques

The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

Download Full-text

Hybrid Reduction Techniques with Covariate Shift Optimization in High-Dimensional Track Geometry

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4051597 ◽

2021 ◽

pp. 1-17

Author(s):

Ibrahim Balogun ◽

Nii Attoh-Okine

Keyword(s):

Dimension Reduction ◽

Dimensional Space ◽

Safety Evaluation ◽

Principal Component ◽

High Dimensional ◽

Space Representation ◽

Hybrid Index ◽

Track Geometry ◽

Reduction Techniques ◽

Defect Probability

Abstract In discussions of track geometry, track safety takes precedence over other requirements because its shortfall often leads to unrecoverable loss. Track geometry is unanimously positioned as the index for safety evaluation—corrective or predictive—to predict the rightful maintenance regime based on track conditions. A recent study has shown that track defect probability thresholds can best be explored using a hybrid index. Hence, a dimension reduction technique that combines both safety components and geometry quality is needed. It is observed that dimensional space representation of track parameters without prior covariate shift evaluation could affect the overall distribution as the underlying discrepancies could pose a problem for the accuracy of the prediction. In this study, the authors applied a covariate shift framework to track geometry parameters before applying the dimension reduction techniques. Whilst both principal component analysis (PCA) and t-distributed stochastic neighbour embedding (TSNE) are viable techniques that express the probability distribution of parameters based on correlation in their embedded space and inclination to maximize the variance, shift distribution evaluation should be considered. In conclusion, we demonstrate that our framework can detect and evaluate a covariate shift likelihood in a high-dimensional track geometry defect problem.

Download Full-text

Application of t-SNE to Human Genetic Data

10.1101/114884 ◽

2017 ◽

Author(s):

Wentian Li ◽

Jane E Cerise ◽

Yaning Yang ◽

Henry Han

Keyword(s):

Dimension Reduction ◽

Population Stratification ◽

Association Studies ◽

Genetic Association Studies ◽

Principal Component ◽

Genetic Data ◽

Visualization Technique ◽

Data Intensive ◽

Reduction Techniques ◽

Dimension Reduction Techniques

AbstractThe t-SNE (t-distributed stochastic neighbor embedding) is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

Download Full-text

High-Dimensional Data Dimension Reduction Based on KECA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.303-306.1101 ◽

2013 ◽

Vol 303-306 ◽

pp. 1101-1104 ◽

Cited By ~ 2

Author(s):

Yong De Hu ◽

Jing Chang Pan ◽

Xin Tan

Keyword(s):

Principal Component Analysis ◽

Dimension Reduction ◽

High Dimensional Data ◽

Principal Component ◽

Good Method ◽

Component Analysis ◽

Renyi Entropy ◽

Rényi Entropy ◽

Kernel Principal Component Analysis ◽

High Dimensional

Kernel entropy component analysis (KECA) reveals the original data’s structure by kernel matrix. This structure is related to the Renyi entropy of the data. KECA maintains the invariance of the original data’s structure by keeping the data’s Renyi entropy unchanged. This paper described the original data by several components on the purpose of dimension reduction. Then the KECA was applied in celestial spectra reduction and was compared with Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) by experiments. Experimental results show that the KECA is a good method in high-dimensional data reduction.

Download Full-text

EVALUATING UNIFORM MANIFOLD APPROXIMATION AND PROJECTION FOR DIMENSION REDUCTION AND VISUALIZATION OF POLINSAR FEATURES

ISPRS Annals of Photogrammetry Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-annals-v-1-2021-39-2021 ◽

2021 ◽

Vol V-1-2021 ◽

pp. 39-46

Author(s):

S. Schmitz ◽

U. Weidner ◽

H. Hammer ◽

A. Thiele

Keyword(s):

Dimension Reduction ◽

Visual Analysis ◽

Principal Component ◽

Feature Space ◽

Decomposition Methods ◽

High Dimensional ◽

Feature Representations ◽

Wide Range ◽

Nonlinear Dimension ◽

Low Dimensional

Abstract. In this paper, the nonlinear dimension reduction algorithm Uniform Manifold Approximation and Projection (UMAP) is investigated to visualize information contained in high dimensional feature representations of Polarimetric Interferometric Synthetic Aperture Radar (PolInSAR) data. Based on polarimetric parameters, target decomposition methods and interferometric coherences a wide range of features is extracted that spans the high dimensional feature space. UMAP is applied to determine a representation of the data in 2D and 3D euclidean space, preserving local and global structures of the data and still suited for classification. The performance of UMAP in terms of generating expressive visualizations is evaluated on PolInSAR data acquired by the F-SAR sensor and compared to that of Principal Component Analysis (PCA), Laplacian Eigenmaps (LE) and t-distributed Stochastic Neighbor embedding (t-SNE). For this purpose, a visual analysis of 2D embeddings is performed. In addition, a quantitative analysis is provided for evaluating the preservation of information in low dimensional representations with respect to separability of different land cover classes. The results show that UMAP exceeds the capability of PCA and LE in these regards and is competitive with t-SNE.

Download Full-text

A SURVEY ON THE CURES FOR THE CURSE OF DIMENSIONALITY IN BIG DATA

Asian Journal of Pharmaceutical and Clinical Research ◽

10.22159/ajpcr.2017.v10s1.19755 ◽

2017 ◽

Vol 10 (13) ◽

pp. 355 ◽

Cited By ~ 1

Author(s):

Reshma Remesh ◽

Pattabiraman. V

Keyword(s):

Dimensionality Reduction ◽

Input Data ◽

Principal Component ◽

Kernel Principal Component Analysis ◽

High Dimensional ◽

Data Sets ◽

Learning Approaches ◽

Data Set ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques

Dimensionality reduction techniques are used to reduce the complexity for analysis of high dimensional data sets. The raw input data set may have large dimensions and it might consume time and lead to wrong predictions if unnecessary data attributes are been considered for analysis. So using dimensionality reduction techniques one can reduce the dimensions of input data towards accurate prediction with less cost. In this paper the different machine learning approaches used for dimensionality reductions such as PCA, SVD, LDA, Kernel Principal Component Analysis and Artificial Neural Network have been studied.

Download Full-text

A dimension-reduction metamodeling approach to simulation-based uncertainty quantification problems with high dimensionalities

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1177/0954406221991189 ◽

2021 ◽

pp. 095440622199118

Author(s):

YP Ju

Keyword(s):

Dimension Reduction ◽

Uncertainty Quantification ◽

Computational Cost ◽

Dynamics Simulation ◽

High Dimensional ◽

Support Vector ◽

Compressor Cascade ◽

Support Vector Regression Model ◽

Simulation Based ◽

Spatially Varying

A common strategy to handle simulation-based uncertainty quantification problems is adopting a metamodel to replace time-demanding calculations such as computational fluid dynamics simulation or finite element analysis within Monte Carlo simulation process. However, most of the so far metamodel-assisted uncertainty quantification methods suffer from the ‘curse of dimensionality.’ The required number of evaluations, which determines the computational cost, increases exponentially as the dimensionality of the input uncertainty increases, resulting in unaffordable computational cost for high-dimensional problems. Another challenge emerges when the output uncertainties are a spatially varying field accommodating a huge number of spatial nodes. To solve these issues, here we propose a dimension-reduction metamodeling approach, in which active subspace method is utilized to reduce the input dimensionality and proper orthogonal decomposition method is utilized to reduce the output dimensionality of the spatially varying field. The relationship between the two methods is established by using the support vector regression model. Through uncertainty quantification of seven stochastic analytical functions and one stochastic convection-diffusion equation, the proposed approach was verified to be fairly accurate in propagating high-dimensional input uncertainties to either a scalar value or a spatially varying output. The accuracy and efficiency of the proposed approach in dealing with even more practical simulation-based problems were then validated by uncertainty quantification of a compressor cascade with stochastic protrusions/dents distributed on the blade surface. This work provides an effective and versatile approach for simulation-based high-dimensional uncertainty quantification problems.

Download Full-text

Performance Analysis of Dimensionality Reduction Techniques in the Context of Clustering

Asian Journal of Computer Science and Technology ◽

10.51983/ajcst-2019.8.s3.2084 ◽

2019 ◽

Vol 8 (S3) ◽

pp. 66-71

Author(s):

T. Sudha ◽

P. Nagendra Kumar

Keyword(s):

Principal Component Analysis ◽

Dimensionality Reduction ◽

High Dimensional Data ◽

Principal Component ◽

Component Analysis ◽

High Dimensional ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques ◽

Low Dimensional ◽

Probabilistic Principal Component Analysis

Data mining is one of the major areas of research. Clustering is one of the main functionalities of datamining. High dimensionality is one of the main issues of clustering and Dimensionality reduction can be used as a solution to this problem. The present work makes a comparative study of dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis in the context of clustering. High dimensional data have been reduced to low dimensional data using dimensionality reduction techniques such as t-distributed stochastic neighbour embedding and probabilistic principal component analysis. Cluster analysis has been performed on the high dimensional data as well as the low dimensional data sets obtained through t-distributed stochastic neighbour embedding and Probabilistic principal component analysis with varying number of clusters. Mean squared error; time and space have been considered as parameters for comparison. The results obtained show that time taken to convert the high dimensional data into low dimensional data using probabilistic principal component analysis is higher than the time taken to convert the high dimensional data into low dimensional data using t-distributed stochastic neighbour embedding.The space required by the data set reduced through Probabilistic principal component analysis is less than the storage space required by the data set reduced through t-distributed stochastic neighbour embedding.

Download Full-text

Various dimension reduction techniques for high dimensional data analysis: a review

Artificial Intelligence Review ◽

10.1007/s10462-020-09928-0 ◽

2021 ◽

Author(s):

Papia Ray ◽

S. Surender Reddy ◽

Tuhina Banerjee

Keyword(s):

Data Analysis ◽

Dimension Reduction ◽

High Dimensional Data ◽

High Dimensional ◽

Reduction Techniques ◽

High Dimensional Data Analysis ◽

Dimension Reduction Techniques

Download Full-text