A High-Dimensional Modeling System Based on Analytical Hierarchy Process and Information Criteria

Mathematical Problems in Engineering ◽

10.1155/2021/6198317 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Tuba Koç

Keyword(s):

High Dimensional Data ◽

Information Criteria ◽

Classification Model ◽

High Dimensional ◽

Data Sets ◽

Suitable Model ◽

Classification Problems ◽

Dimensional Modeling ◽

Novel Approach ◽

Hierarchy Process

High-dimensional data sets frequently occur in several scientific areas, and special techniques are required to analyze these types of data sets. Especially, it becomes important to apply a suitable model in classification problems. In this study, a novel approach is proposed to estimate a statistical model for high-dimensional data sets. The proposed method uses analytical hierarchical process (AHP) and information criteria for determining the optimal PCs for the classification model. The high-dimensional “colon” and “gravier” datasets were used in evaluation part. Application results demonstrate that the proposed approach can be successfully used for modeling purposes.

Download Full-text

A high-dimensional classification approach based on class-dependent feature subspace

Industrial Management & Data Systems ◽

10.1108/imds-11-2016-0491 ◽

2017 ◽

Vol 117 (10) ◽

pp. 2325-2339

Author(s):

Fuzan Chen ◽

Harris Wu ◽

Runliang Dou ◽

Minqiang Li

Keyword(s):

Predictive Power ◽

High Dimensional Data ◽

Classification Model ◽

High Dimensional ◽

Svm Classifier ◽

Data Sets ◽

Content Type ◽

Classification Approach ◽

Dimensional Classification ◽

Feature Subspace

Purpose The purpose of this paper is to build a compact and accurate classifier for high-dimensional classification. Design/methodology/approach A classification approach based on class-dependent feature subspace (CFS) is proposed. CFS is a class-dependent integration of a support vector machine (SVM) classifier and associated discriminative features. For each class, our genetic algorithm (GA)-based approach evolves the best subset of discriminative features and SVM classifier simultaneously. To guarantee convergence and efficiency, the authors customize the GA in terms of encoding strategy, fitness evaluation, and genetic operators. Findings Experimental studies demonstrated that the proposed CFS-based approach is superior to other state-of-the-art classification algorithms on UCI data sets in terms of both concise interpretation and predictive power for high-dimensional data. Research limitations/implications UCI data sets rather than real industrial data are used to evaluate the proposed approach. In addition, only single-label classification is addressed in the study. Practical implications The proposed method not only constructs an accurate classification model but also obtains a compact combination of discriminative features. It is helpful for business makers to get a concise understanding of the high-dimensional data. Originality/value The authors propose a compact and effective classification approach for high-dimensional data. Instead of the same feature subset for all the classes, the proposed CFS-based approach obtains the optimal subset of discriminative feature and SVM classifier for each class. The proposed approach enhances both interpretability and predictive power for high-dimensional data.

Download Full-text

stepwiseCM: An R Package for Stepwise Classification of Cancer Samples Using Multiple Heterogeneous Data Sets

Cancer Informatics ◽

10.4137/cin.s13075 ◽

2014 ◽

Vol 13 ◽

pp. CIN.S13075

Author(s):

Askar Obulkasim ◽

Mark A van de Wiel

Keyword(s):

Waiting Times ◽

High Dimensional Data ◽

R Package ◽

Heterogeneous Data ◽

The Other ◽

High Dimensional ◽

Data Sets ◽

Classification Problems ◽

Data Types ◽

Crucial Difference

This paper presents the R/Bioconductor package stepwiseCM, which classifies cancer samples using two heterogeneous data sets in an efficient way. The algorithm is able to capture the distinct classification power of two given data types without actually combining them. This package suits for classification problems where two different types of data sets on the same samples are available. One of these data types has measurements on all samples and the other one has measurements on some samples. One is easy to collect and/or relatively cheap (eg, clinical covariates) compared to the latter (high-dimensional data, eg, gene expression). One additional application for which stepwiseCM is proven to be useful as well is the combination of two high-dimensional data types, eg, DNA copy number and mRNA expression. The package includes functions to project the neighborhood information in one data space to the other to determine a potential group of samples that are likely to benefit most by measuring the second type of covariates. The two heterogeneous data spaces are connected by indirect mapping. The crucial difference between the stepwise classification strategy implemented in this package and the existing packages is that our approach aims to be cost-efficient by avoiding measuring additional covariates, which might be expensive or patient-unfriendly, for a potentially large subgroup of individuals. Moreover, in diagnosis for these individuals test, results would be quickly available, which may lead to reduced waiting times and hence lower the patients’ distress. The improvement described remedies the key limitations of existing packages, and facilitates the use of the stepwiseCM package in diverse applications.

Download Full-text

An Advanced Mining Services in Predicting and Ranking User Vitality across Dynamic and High Dimensional Data Sets

SSRN Electronic Journal ◽

10.2139/ssrn.3395242 ◽

2019 ◽

Author(s):

Ch. Durga Bhavani ◽

Dr. A. Daveedu Raju ◽

Dr. V. Surya Narayana

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Data Sets

Download Full-text

Hybrid random subsample classifier ensemble for high dimensional data sets

International Journal of Hybrid Intelligent Systems ◽

10.3233/his-2012-0149 ◽

2012 ◽

Vol 9 (2) ◽

pp. 91-103 ◽

Cited By ~ 1

Author(s):

Santhosh Pathical ◽

Gursel Serpen

Keyword(s):

High Dimensional Data ◽

Classifier Ensemble ◽

High Dimensional ◽

Data Sets

Download Full-text

Fast Approximate Similarity Search in Extremely High-Dimensional Data Sets

21st International Conference on Data Engineering (ICDE'05) ◽

10.1109/icde.2005.66 ◽

2005 ◽

Cited By ~ 43

Author(s):

M.E. Houle ◽

Jun Sakuma

Keyword(s):

Similarity Search ◽

High Dimensional Data ◽

High Dimensional ◽

Data Sets ◽

Approximate Similarity Search ◽

Approximate Similarity

Download Full-text

Mahalanobis distance informed by clustering

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iay011 ◽

2018 ◽

Vol 8 (2) ◽

pp. 377-406

Author(s):

Almog Lahav ◽

Ronen Talmon ◽

Yuval Kluger

Keyword(s):

Mahalanobis Distance ◽

High Dimensional Data ◽

Hidden Variables ◽

Real Data ◽

Risk Groups ◽

High Dimensional ◽

Data Sets ◽

Kaplan Meier ◽

Data Points ◽

Survival Plot

Abstract A fundamental question in data analysis, machine learning and signal processing is how to compare between data points. The choice of the distance metric is specifically challenging for high-dimensional data sets, where the problem of meaningfulness is more prominent (e.g. the Euclidean distance between images). In this paper, we propose to exploit a property of high-dimensional data that is usually ignored, which is the structure stemming from the relationships between the coordinates. Specifically, we show that organizing similar coordinates in clusters can be exploited for the construction of the Mahalanobis distance between samples. When the observable samples are generated by a nonlinear transformation of hidden variables, the Mahalanobis distance allows the recovery of the Euclidean distances in the hidden space. We illustrate the advantage of our approach on a synthetic example where the discovery of clusters of correlated coordinates improves the estimation of the principal directions of the samples. Our method was applied to real data of gene expression for lung adenocarcinomas (lung cancer). By using the proposed metric we found a partition of subjects to risk groups with a good separation between their Kaplan–Meier survival plot.

Download Full-text

A Novel Approach to Visualization of High-Dimensional Data by Pairwise Fusion Matrices Using t-SNE

Current Topics on Mathematics and Computer Science Vol. 2 ◽

10.9734/bpi/ctmcs/v2/2338f ◽

2021 ◽

pp. 89-108

Author(s):

Mujtaba Husnain ◽

Malik Muhammad Saad Missen ◽

Shahzad Mumtaz ◽

Muhammad Muzzamil Luqman ◽

Mickael Coustaty ◽

...

Keyword(s):

High Dimensional Data ◽

High Dimensional ◽

Novel Approach

Download Full-text

Clustering Based Feature Data Selection Technique Algorithm for High Dimensional Data: A Novel Approach

10.9734/bpi/nvst/v7/5002f ◽

2021 ◽

pp. 33-38

Author(s):

Amos R ◽

Kowshik N ◽

Suraksha M. S

Keyword(s):

High Dimensional Data ◽

Data Selection ◽

High Dimensional ◽

Selection Technique ◽

Novel Approach

Download Full-text

Compact Belief Rule Base Learning for Classification with Evidential Clustering

Entropy ◽

10.3390/e21050443 ◽

2019 ◽

Vol 21 (5) ◽

pp. 443 ◽

Cited By ~ 1

Author(s):

Lianmeng Jiao ◽

Xiaojiao Geng ◽

Quan Pan

Keyword(s):

Classification System ◽

Optimization Procedure ◽

Classification Model ◽

Rule Base ◽

Data Sets ◽

Classification Problems ◽

Ecm Algorithm ◽

Rule Based ◽

Belief Rule Base ◽

Belief Function Theory

The belief rule-based classification system (BRBCS) is a promising technique for addressing different types of uncertainty in complex classification problems, by introducing the belief function theory into the classical fuzzy rule-based classification system. However, in the BRBCS, high numbers of instances and features generally induce a belief rule base (BRB) with large size, which degrades the interpretability of the classification model for big data sets. In this paper, a BRB learning method based on the evidential C-means clustering (ECM) algorithm is proposed to efficiently design a compact belief rule-based classification system (CBRBCS). First, a supervised version of the ECM algorithm is designed by means of weighted product-space clustering to partition the training set with the goals of obtaining both good inter-cluster separability and inner-cluster pureness. Then, a systematic method is developed to construct belief rules based on the obtained credal partitions. Finally, an evidential partition entropy-based optimization procedure is designed to get a compact BRB with a better trade-off between accuracy and interpretability. The key benefit of the proposed CBRBCS is that it can provide a more interpretable classification model on the premise of comparative accuracy. Experiments based on synthetic and real data sets have been conducted to evaluate the classification accuracy and interpretability of the proposal.

Download Full-text

Double-Arc Parallel Coordinates and its Axes re-Ordering Methods

Mobile Networks and Applications ◽

10.1007/s11036-019-01455-9 ◽

2020 ◽

Vol 25 (4) ◽

pp. 1376-1391

Author(s):

Liangfu Lu ◽

Wenbo Wang ◽

Zhiyuan Tan

Keyword(s):

High Dimensional Data ◽

Two Dimensions ◽

High Dimensional ◽

Data Sets ◽

Parallel Coordinates ◽

Practical Applications ◽

Data Volume ◽

Visual Clutter ◽

Correlation Information ◽

Value Decomposition

AbstractThe Parallel Coordinates Plot (PCP) is a popular technique for the exploration of high-dimensional data. In many cases, researchers apply it as an effective method to analyze and mine data. However, when today’s data volume is getting larger, visual clutter and data clarity become two of the main challenges in parallel coordinates plot. Although Arc Coordinates Plot (ACP) is a popular approach to address these challenges, few optimization and improvement have been made on it. In this paper, we do three main contributions on the state-of-the-art PCP methods. One approach is the improvement of visual method itself. The other two approaches are mainly on the improvement of perceptual scalability when the scale or the dimensions of the data turn to be large in some mobile and wireless practical applications. 1) We present an improved visualization method based on ACP, termed as double arc coordinates plot (DACP). It not only reduces the visual clutter in ACP, but use a dimension-based bundling method with further optimization to deals with the issues of the conventional parallel coordinates plot (PCP). 2)To reduce the clutter caused by the order of the axes and reveal patterns that hidden in the data sets, we propose our first dimensional reordering method, a contribution-based method in DACP, which is based on the singular value decomposition (SVD) algorithm. The approach computes the importance score of attributes (dimensions) of the data using SVD and visualize the dimensions from left to right in DACP according the score in SVD. 3) Moreover, a similarity-based method, which is based on the combination of nonlinear correlation coefficient and SVD algorithm, is proposed as well in the paper. To measure the correlation between two dimensions and explains how the two dimensions interact with each other, we propose a reordering method based on non-linear correlation information measurements. We mainly use mutual information to calculate the partial similarity of dimensions in high-dimensional data visualization, and SVD is used to measure global data. Lastly, we use five case scenarios to evaluate the effectiveness of DACP, and the results show that our approaches not only do well in visualizing multivariate dataset, but also effectively alleviate the visual clutter in the conventional PCP, which bring users a better visual experience.

Download Full-text