A Robust Supervised Variable Selection for Noisy High-Dimensional Data

BioMed Research International ◽

10.1155/2015/320385 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 6

Author(s):

Jan Kalina ◽

Anna Schlenker

Keyword(s):

Variable Selection ◽

Dimensionality Reduction ◽

Robust Statistics ◽

High Dimensional Data ◽

Real Data ◽

High Dimensional ◽

Adaptive Weights ◽

Novel Approach ◽

Reduction Methods ◽

Data Adaptive

The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers.

Download Full-text

Unsupervised Text Feature Learning via Deep Variational Auto-encoder

Information Technology And Control ◽

10.5755/j01.itc.49.3.25918 ◽

2020 ◽

Vol 49 (3) ◽

pp. 421-437

Author(s):

Genggeng Liu ◽

Lin Xie ◽

Chi-Hua Chen

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

Image Data ◽

Original Data ◽

Feature Representation ◽

High Dimensional ◽

Learning To Learn ◽

Text Feature ◽

Reduction Methods ◽

Low Dimensional

Dimensionality reduction plays an important role in the data processing of machine learning and data mining, which makes the processing of high-dimensional data more efficient. Dimensionality reduction can extract the low-dimensional feature representation of high-dimensional data, and an effective dimensionality reduction method can not only extract most of the useful information of the original data, but also realize the function of removing useless noise. The dimensionality reduction methods can be applied to all types of data, especially image data. Although the supervised learning method has achieved good results in the application of dimensionality reduction, its performance depends on the number of labeled training samples. With the growing of information from internet, marking the data requires more resources and is more difficult. Therefore, using unsupervised learning to learn the feature of data has extremely important research value. In this paper, an unsupervised multilayered variational auto-encoder model is studied in the text data, so that the high-dimensional feature to the low-dimensional feature becomes efficient and the low-dimensional feature can retain mainly information as much as possible. Low-dimensional feature obtained by different dimensionality reduction methods are used to compare with the dimensionality reduction results of variational auto-encoder (VAE), and the method can be significantly improved over other comparison methods.

Download Full-text

LASSO-Type Variable Selection Methods for High-Dimensional Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.444-445.604 ◽

2013 ◽

Vol 444-445 ◽

pp. 604-609

Author(s):

Guang Hui Fu ◽

Pan Wang

Keyword(s):

Variable Selection ◽

High Dimensional Data ◽

Real Data ◽

Oracle Property ◽

High Dimensional ◽

Data Sets ◽

Group Effect ◽

Selection Methods ◽

Variable Selection Method ◽

Type Variable

LASSO is a very useful variable selection method for high-dimensional data , But it does not possess oracle property [Fan and Li, 200 and group effect [Zou and Hastie, 200. In this paper, we firstly review four improved LASSO-type methods which satisfy oracle property and (or) group effect, and then give another two new ones called WFEN and WFAEN. The performance on both the simulation and real data sets shows that WFEN and WFAEN are competitive with other LASSO-type methods.

Download Full-text

A Similar Distribution Discriminant Analysis with Orthogonal and Nearly Statistically Uncorrelated Characteristics

Mathematical Problems in Engineering ◽

10.1155/2019/3145973 ◽

2019 ◽

Vol 2019 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Zhibo Guo ◽

Ying Zhang

Keyword(s):

Discriminant Analysis ◽

Dimensionality Reduction ◽

Linear Discriminant Analysis ◽

High Dimensional Data ◽

Sensor Data ◽

High Dimensional ◽

Similar Distribution ◽

Face Database ◽

Linear Discriminant ◽

Reduction Methods

It is very difficult to process and analyze high-dimensional data directly. Therefore, it is necessary to learn a potential subspace of high-dimensional data through excellent dimensionality reduction algorithms to preserve the intrinsic structure of high-dimensional data and abandon the less useful information. Principal component analysis (PCA) and linear discriminant analysis (LDA) are two popular dimensionality reduction methods for high-dimensional sensor data preprocessing. LDA contains two basic methods, namely, classic linear discriminant analysis and FS linear discriminant analysis. In this paper, a new method, called similar distribution discriminant analysis (SDDA), is proposed based on the similarity of samples’ distribution. Furthermore, the method of solving the optimal discriminant vector is given. These discriminant vectors are orthogonal and nearly statistically uncorrelated. The disadvantages of PCA and LDA are overcome, and the extracted features are more effective by using SDDA. The recognition performance of SDDA exceeds PCA and LDA largely. Some experiments on the Yale face database, FERET face database, and UCI multiple features dataset demonstrate that the proposed method is effective. The results reveal that SDDA obtains better performance than comparison dimensionality reduction methods.

Download Full-text

An approach to clustering biological phenotypes

10.32469/10355/62341 ◽

2017 ◽

Author(s):

◽

Avimanyou Kumar Vatsa

Keyword(s):

Clustering Algorithm ◽

High Dimensional Data ◽

Subspace Clustering ◽

Real Data ◽

Significant Loss ◽

High Dimensional ◽

Data Space ◽

Large Sets ◽

Reduction Methods ◽

High Throughput Phenotyping

Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.

Download Full-text

A New Outlier Detection Algorithms Based on Markov Chain

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.366.456 ◽

2011 ◽

Vol 366 ◽

pp. 456-459 ◽

Cited By ~ 3

Author(s):

Jun Yang ◽

Ying Long Wang

Keyword(s):

Markov Chain ◽

Outlier Detection ◽

High Dimensional Data ◽

Weighted Graph ◽

Real Data ◽

Curse Of Dimensionality ◽

High Dimensional ◽

Large Set ◽

Data Set ◽

Novel Approach

Detecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. In high-dimensional data, these approaches are bound to deteriorate due to the notorious “curse of dimensionality”. In this paper, we propose a novel approach named ODMC (Outlier Detection Based On Markov Chain)，the effects of the “curse of dimensionality” are alleviated compared to purely distance-based approaches. A main advantage of our new approach is that our method is to use a major feature of an undirected weighted graph to calculate the outlier degree of each node, In a thorough experimental evaluation, we compare ODMC to the ABOD and FindFPOF for various artificial and real data set and show ODMC to perform especially well on high-dimensional data.

Download Full-text

Proposing Robust LAD-Atan Penalty of Regression Model Estimation for High Dimensional Data

Baghdad Science Journal ◽

10.21123/bsj.2020.17.2.0550 ◽

2020 ◽

Vol 17 (2) ◽

pp. 0550

Author(s):

Ali Hameed Yousef ◽

Omar Abdulmohsin Ali

Keyword(s):

Variable Selection ◽

Regression Model ◽

High Dimensional Data ◽

Real Data ◽

Penalized Regression ◽

Good Method ◽

Superior Performance ◽

High Dimensional ◽

Absolute Deviation ◽

Heavy Tailed

The issue of penalized regression model has received considerable critical attention to variable selection. It plays an essential role in dealing with high dimensional data. Arctangent denoted by the Atan penalty has been used in both estimation and variable selection as an efficient method recently. However, the Atan penalty is very sensitive to outliers in response to variables or heavy-tailed error distribution. While the least absolute deviation is a good method to get robustness in regression estimation. The specific objective of this research is to propose a robust Atan estimator from combining these two ideas at once. Simulation experiments and real data applications show that the proposed LAD-Atan estimator has superior performance compared with other estimators.

Download Full-text

Quality-based guidance for exploratory dimensionality reduction

Information Visualization ◽

10.1177/1473871612460526 ◽

2012 ◽

Vol 12 (1) ◽

pp. 44-64 ◽

Cited By ~ 12

Author(s):

Sara Johansson Fernstad ◽

Jane Shaw ◽

Jimmy Johansson

Keyword(s):

Dimensionality Reduction ◽

High Dimensional Data ◽

Quality Metrics ◽

Exploratory Analysis ◽

High Dimensional ◽

Data Sets ◽

Data Set ◽

Bacterial Populations ◽

Reduction Methods ◽

Individual Variables

High-dimensional data sets containing hundreds of variables are difficult to explore, as traditional visualization methods often are unable to represent such data effectively. This is commonly addressed by employing dimensionality reduction prior to visualization. Numerous dimensionality reduction methods are available. However, few reduction approaches take the importance of several structures into account and few provide an overview of structures existing in the full high-dimensional data set. For exploratory analysis, as well as for many other tasks, several structures may be of interest. Exploration of the full high-dimensional data set without reduction may also be desirable. This paper presents flexible methods for exploratory analysis and interactive dimensionality reduction. Automated methods are employed to analyse the variables, using a range of quality metrics, providing one or more measures of ‘interestingness’ for individual variables. Through ranking, a single value of interestingness is obtained, based on several quality metrics, that is usable as a threshold for the most interesting variables. An interactive environment is presented in which the user is provided with many possibilities to explore and gain understanding of the high-dimensional data set. Guided by this, the analyst can explore the high-dimensional data set and interactively select a subset of the potentially most interesting variables, employing various methods for dimensionality reduction. The system is demonstrated through a use-case analysing data from a DNA sequence-based study of bacterial populations.

Download Full-text

Bayesian variable selection in clustering high-dimensional data via a mixture of finite mixtures

Journal of Statistical Computation and Simulation ◽

10.1080/00949655.2021.1902526 ◽

2021 ◽

pp. 1-18

Author(s):

Woojin Doo ◽

Heeyoung Kim

Keyword(s):

Variable Selection ◽

High Dimensional Data ◽

Bayesian Variable Selection ◽

High Dimensional ◽

Finite Mixtures

Download Full-text

Supervised dimensionality reduction for big data

Nature Communications ◽

10.1038/s41467-021-23102-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Joshua T. Vogelstein ◽

Eric W. Bridgeford ◽

Minh Tang ◽

Da Zheng ◽

Christopher Douville ◽

...

Keyword(s):

Dimensionality Reduction ◽

Data Science ◽

Real Data ◽

Low Rank ◽

Conditional Moment ◽

Desktop Computer ◽

Reduction Techniques ◽

Reduction Methods ◽

The Individual ◽

Low Dimensional

AbstractTo solve key biomedical problems, experimentalists now routinely measure millions or billions of features (dimensions) per sample, with the hope that data science techniques will be able to build accurate data-driven inferences. Because sample sizes are typically orders of magnitude smaller than the dimensionality of these data, valid inferences require finding a low-dimensional representation that preserves the discriminating information (e.g., whether the individual suffers from a particular disease). There is a lack of interpretable supervised dimensionality reduction methods that scale to millions of dimensions with strong statistical theoretical guarantees. We introduce an approach to extending principal components analysis by incorporating class-conditional moment estimates into the low-dimensional projection. The simplest version, Linear Optimal Low-rank projection, incorporates the class-conditional means. We prove, and substantiate with both synthetic and real data benchmarks, that Linear Optimal Low-Rank Projection and its generalizations lead to improved data representations for subsequent classification, while maintaining computational efficiency and scalability. Using multiple brain imaging datasets consisting of more than 150 million features, and several genomics datasets with more than 500,000 features, Linear Optimal Low-Rank Projection outperforms other scalable linear dimensionality reduction techniques in terms of accuracy, while only requiring a few minutes on a standard desktop computer.

Download Full-text

Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

Scientific Programming ◽

10.1155/2015/180214 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Sai Kiranmayee Samudrala ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Baskar Ganapathysubramanian

Keyword(s):

Dimensionality Reduction ◽

Organic Solar Cells ◽

Large Scale ◽

Parallel Implementation ◽

High Dimensional Data ◽

Real Life ◽

Processing Parameters ◽

High Dimensional ◽

Morphology Evolution ◽

Reduction Techniques

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

Download Full-text