column subset selection
Recently Published Documents


TOTAL DOCUMENTS

41
(FIVE YEARS 6)

H-INDEX

9
(FIVE YEARS 0)

Cancers ◽  
2021 ◽  
Vol 13 (17) ◽  
pp. 4297
Author(s):  
Pratip Rana ◽  
Phuc Thai ◽  
Thang Dinh ◽  
Preetam Ghosh

Biologists seek to identify a small number of significant features that are important, non-redundant, and relevant from diverse omics data. For example, statistical methods such as LIMMA and DEseq distinguish differentially expressed genes between a case and control group from the transcript profile. Researchers also apply various column subset selection algorithms on genomics datasets for a similar purpose. Unfortunately, genes selected by such statistical or machine learning methods are often highly co-regulated, making their performance inconsistent. Here, we introduce a novel feature selection algorithm that selects highly disease-related and non-redundant features from a diverse set of omics datasets. We successfully applied this algorithm to three different biological problems: (a) disease-to-normal sample classification; (b) multiclass classification of different disease samples; and (c) disease subtypes detection. Considering the classification of ROC-AUC, false-positive, and false-negative rates, our algorithm outperformed other gene selection and differential expression (DE) methods for all six types of cancer datasets from TCGA considered here for binary and multiclass classification problems. Moreover, genes picked by our algorithm improved the disease subtyping accuracy for four different cancer types over state-of-the-art methods. Hence, we posit that our proposed feature reduction method can support the community to solve various problems, including the selection of disease-specific biomarkers, precision medicine design, and disease sub-type detection.


Author(s):  
Matthias Ryser ◽  
Felix M. Neuhauser ◽  
Christoph Hein ◽  
Pavel Hora ◽  
Markus Bambach

AbstractIn this paper, we propose a new approach for the simulation-based support of tryout operations in deep drawing which can be schematically classified as automatic knowledge acquisition. The central idea is to identify information maximising sensor positions for draw-in as well as local blank holder force sensors by solving the column subset selection problem with respect to the sensor sensitivities. Inverse surrogate models are then trained using the selected sensor signals as predictors and the material and process parameters as targets. The final models are able to observe the drawing process by estimating current material and process parameters, which can then be compared to the target values to identify process corrections. The methodology is examined on an Audi A8L side panel frame using a set of 635 simulations, where 20 out of 21 material and process parameters can be estimated with an R2 value greater than 0.9. The result shows that the observational models are not only capable of estimating all but one process parameters with high accuracy, but also allow the determination of material parameters at the same time. Since no assumptions are made about the type of process, sensors, material or process parameters, the methodology proposed can also be applied to other manufacturing processes and use cases.


Author(s):  
Michał Dereziński ◽  
Rajiv Khanna ◽  
Michael W. Mahoney

The Column Subset Selection Problem (CSSP) and the Nystrom method are among the leading tools for constructing interpretable low-rank approximations of large datasets by selecting a small but representative set of features or instances. A fundamental question in this area is: what is the cost of this interpretability, i.e., how well can a data subset of size k compete with the best rank k approximation? We develop techniques which exploit spectral properties of the data matrix to obtain improved approximation guarantees which go beyond the standard worst-case analysis. Our approach leads to significantly better bounds for datasets with known rates of singular value decay, e.g., polynomial or exponential decay. Our analysis also reveals an intriguing phenomenon: the cost of interpretability as a function of k may exhibit multiple peaks and valleys, which we call a multiple-descent curve. A lower bound we establish shows that this behavior is not an artifact of our analysis, but rather it is an inherent property of the CSSP and Nystrom tasks. Finally, using the example of a radial basis function (RBF) kernel, we show that both our improved bounds and the multiple-descent curve can be observed on real datasets simply by varying the RBF parameter.


2020 ◽  
Author(s):  
Mohsen Joneidi ◽  
Saeed Vahidian ◽  
Ashkan Esmaeili ◽  
Siavash Khodadadeh

We propose a novel technique for finding representatives from a large, unsupervised dataset. The approach is based on the concept of self-rank, defined as the minimum number of samples needed to reconstruct all samples with an accuracy proportional to the rank-$K$ approximation. Our proposed algorithm enjoys linear complexity w.r.t. the size of original dataset and simultaneously it provides an adaptive upper bound for approximation ratio. These favorable characteristics result in filling a historical gap between practical and theoretical methods in finding representatives.<br>


2020 ◽  
Author(s):  
Mohsen Joneidi ◽  
Saeed Vahidian ◽  
Ashkan Esmaeili ◽  
Siavash Khodadadeh

We propose a novel technique for finding representatives from a large, unsupervised dataset. The approach is based on the concept of self-rank, defined as the minimum number of samples needed to reconstruct all samples with an accuracy proportional to the rank-$K$ approximation. Our proposed algorithm enjoys linear complexity w.r.t. the size of original dataset and simultaneously it provides an adaptive upper bound for approximation ratio. These favorable characteristics result in filling a historical gap between practical and theoretical methods in finding representatives.<br>


Sign in / Sign up

Export Citation Format

Share Document