Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations

Mapping Intimacies ◽

10.1101/163915 ◽

2017 ◽

Author(s):

Lan Huong Nguyen ◽

Susan Holmes

Keyword(s):

Continuous Process ◽

Microbial Community Composition ◽

High Dimensional ◽

Registration Method ◽

One Dimensional ◽

Reduction Techniques ◽

Unidimensional Scaling ◽

Data Points ◽

Or Gene ◽

High Dimensional Datasets

AbstractBackgroundDetecting patterns in high-dimensional multivariate datasets is non-trivial. Clustering and dimensionality reduction techniques often help in discerning inherent structures. In biological datasets such as microbial community composition or gene expression data, observations can be generated from a continuous process, often unknown. Estimating data points’ ‘natural ordering’ and their corresponding uncertainties can help researchers draw insights about the mechanisms involved.ResultsWe introduce a Bayesian Unidimensional Scaling (BUDS) technique which extracts dominant sources of variation in high dimensional datasets and produces their visual data summaries, facilitating the exploration of a hidden continuum. The method maps multivariate data points to latent one-dimensional coordinates along their underlying trajectory, and provides estimated uncertainty bounds. By statistically modeling dissimilarities and applying a DiSTATIS registration method to their posterior samples, we are able to incorporate visualizations of uncertainties in the estimated data trajectory across different regions using confidence contours for individual data points. We also illustrate the estimated overall data density across different areas by including density clouds. One-dimensional coordinates recovered by BUDS help researchers discover sample attributes or covariates that are factors driving the main variability in a dataset. We demonstrated usefulness and accuracy of BUDS on a set of published microbiome 16S and RNA-seq and roll call data.ConclusionsOur method effectively recovers and visualizes natural orderings present in datasets. Automatic visualization tools for data exploration and analysis are available at: https://nlhuong.shinyapps.io/visTrajectory/.

Download Full-text

LICIC: Less Important Components for Imbalanced Multiclass Classification

Information ◽

10.3390/info9120317 ◽

2018 ◽

Vol 9 (12) ◽

pp. 317 ◽

Cited By ~ 5

Author(s):

Vincenzo Dentamaro ◽

Donato Impedovo ◽

Giuseppe Pirlo

Keyword(s):

Gene Expression ◽

Class Imbalance ◽

Imbalanced Data ◽

Multiclass Classification ◽

Cancer Diagnostics ◽

Mass Spectrometry Data ◽

High Dimensional ◽

Or Gene ◽

High Dimensional Datasets

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.

Download Full-text

Face recognition via collaborative representation based multiple one-dimensional embedding

International Journal of Wavelets Multiresolution and Information Processing ◽

10.1142/s0219691316400038 ◽

2016 ◽

Vol 14 (02) ◽

pp. 1640003 ◽

Cited By ~ 4

Author(s):

Y. Wang ◽

Yuan Yan Tang ◽

Luoqing Li ◽

Jianzhong Wang

Keyword(s):

Face Recognition ◽

Dimensional Space ◽

Main Idea ◽

Classification Problem ◽

Collaborative Representation ◽

High Dimensional ◽

One Dimensional ◽

Original Dataset ◽

Data Points ◽

The One

This paper presents a novel classifier based on collaborative representation (CR) and multiple one-dimensional (1D) embedding with applications to face recognition. To use multiple 1D embedding (1DME) framework in semi-supervised learning is first proposed by one of the authors, J. Wang, in 2014. The main idea of the multiple 1D embedding is the following: Given a high-dimensional dataset, we first map it onto several different 1D sequences on the line while keeping the proximity of data points in the original ambient high-dimensional space. By this means, a classification problem on high dimension reduces to the one in a 1D framework, which can be efficiently solved by any classical 1D regularization method, for instance, an interpolation scheme. The dissimilarity metric plays an important role in learning a decent 1DME of the original dataset. Our another contribution is to develop a collaborative representation based dissimilarity (CRD) metric. Compared to the conventional Euclidean distance based metric, the proposed method can lead to better results. The experimental results on real-world databases verify the efficacy of the proposed method.

Download Full-text

Analysis of unsupervised dimensionality reduction techniques

Computer Science and Information Systems ◽

10.2298/csis0902217k ◽

2009 ◽

Vol 6 (2) ◽

pp. 217-227 ◽

Cited By ~ 29

Author(s):

Aswani Kumar

Keyword(s):

Dimensionality Reduction ◽

Approximation Error ◽

High Dimensional ◽

Retrieval Task ◽

Document Collections ◽

Noise Effects ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques ◽

Text Images ◽

High Dimensional Datasets

Domains such as text, images etc contain large amounts of redundancies and ambiguities among the attributes which result in considerable noise effects (i.e. the data is high dimension). Retrieving the data from high dimensional datasets is a big challenge. Dimensionality reduction techniques have been a successful avenue for automatically extracting the latent concepts by removing the noise and reducing the complexity in processing the high dimensional data. In this paper we conduct a systematic study on comparing the unsupervised dimensionality reduction techniques for text retrieval task. We analyze these techniques from the view of complexity, approximation error and retrieval quality with experiments on four testing document collections.

Download Full-text

K-means clustering based filter feature selection on high dimensional data

International Journal of Advances in Intelligent Informatics ◽

10.26555/ijain.v2i1.54 ◽

2016 ◽

Vol 2 (1) ◽

pp. 38 ◽

Cited By ~ 7

Author(s):

Dewi Pramudi Ismi ◽

Shireen Panchoo ◽

Murinto Murinto

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

High Dimensional ◽

Economic Activities ◽

Label Information ◽

Data Points ◽

Selection For ◽

Computation Load ◽

High Dimensional Datasets

With hundreds or thousands of features in high dimensional data, computational workload is challenging. In classification process, features which do not contribute significantly to prediction of classes, add to the computational workload. Therefore the aim of this paper is to use feature selection to decrease the computation load by reducing the size of high dimensional data. Selecting subsets of features which represent all features were used. Hence the process is two-fold; discarding irrelevant data and choosing one feature that representing a number of redundant features. There have been many studies regarding feature selection, for example backward feature selection and forward feature selection. In this study, a k-means clustering based feature selection is proposed. It is assumed that redundant features are located in the same cluster, whereas irrelevant features do not belong to any clusters. In this research, two different high dimensional datasets are used: 1) the Human Activity Recognition Using Smartphones (HAR) Dataset, containing 7352 data points each of 561 features and 2) the National Classification of Economic Activities Dataset, which contains 1080 data points each of 857 features. Both datasets provide class label information of each data point. Our experiment shows that k-means clustering based feature selection can be performed to produce subset of features. The latter returns more than 80% accuracy of classification result.

Download Full-text

Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations

BMC Bioinformatics ◽

10.1186/s12859-017-1790-x ◽

2017 ◽

Vol 18 (S10) ◽

Cited By ~ 5

Author(s):

Lan Huong Nguyen ◽

Susan Holmes

Keyword(s):

High Dimensional ◽

Unidimensional Scaling ◽

High Dimensional Datasets

Download Full-text

Performance evaluation of similarity measures for K-means clustering algorithm

Bayero Journal of Pure and Applied Sciences ◽

10.4314/bajopas.v12i2.21 ◽

2021 ◽

Vol 12 (2) ◽

pp. 144-148

Author(s):

D. Usman ◽

S.F. Sani

Keyword(s):

Clustering Algorithm ◽

Similarity Measures ◽

High Dimensional ◽

Manhattan Distance ◽

Intrinsic Structure ◽

Data Points ◽

Data Objects ◽

Clustering Criterion ◽

High Dimensional Datasets ◽

Dimensional Domain

Clustering is a useful technique that organizes a large quantity of unordered datasets into a small number of meaningful and coherent clusters. Every clustering method is based on the index of similarity or dissimilarity between data points. However, the true intrinsic structure of the data could be correctly described by the similarity formula defined and embedded in the clustering criterion function. This paper uses squared Euclidean distance and Manhattan distance to investigates the best method for measuring similarity between data objects in sparse and high-dimensional domain which is fast, capable of providing high quality clustering result and consistent. The performances of these two methods were reported with simulated high dimensional datasets.

Download Full-text

Interpolation and extrapolation

10.1093/oso/9780198708636.003.0004 ◽

2018 ◽

Author(s):

Joseph F. Boudreau ◽

Eric S. Swanson

Keyword(s):

Public Transportation ◽

Spline Interpolation ◽

Lattice Gauge Theory ◽

Human Vision ◽

Physical Sciences ◽

One Dimensional ◽

Lattice Gauge ◽

Multidimensional Problems ◽

Data Points ◽

Lagrange Interpolating Polynomial

This chapter deals with two related problems occurring frequently in the physical sciences: first, the problem of estimating the value of a function from a limited number of data points; and second, the problem of calculating its value from a series approximation. Numerical methods for interpolating and extrapolating data are presented. The famous Lagrange interpolating polynomial is introduced and applied to one-dimensional and multidimensional problems. Cubic spline interpolation is introduced and an implementation in terms of Eigen classes is given. Several techniques for improving the convergence of Taylor series are discussed, including Shank’s transformation, Richardson extrapolation, and the use of Padé approximants. Conversion between representations with the quotient-difference algorithm is discussed. The exercises explore public transportation, human vision, the wine market, and SU(2) lattice gauge theory, among other topics.

Download Full-text

High-order compact LOD methods for solving high-dimensional advection equations

Computational and Applied Mathematics ◽

10.1007/s40314-021-01483-w ◽

2021 ◽

Vol 40 (3) ◽

Author(s):

Bo Hou ◽

Yongbin Ge

Keyword(s):

Truncation Error ◽

Three Dimensional ◽

Taylor Series Expansion ◽

High Order ◽

High Dimensional ◽

Analysis Method ◽

Order Accuracy ◽

Von Neumann ◽

One Dimensional ◽

Von Neumann Analysis

AbstractIn this paper, by using the local one-dimensional (LOD) method, Taylor series expansion and correction for the third derivatives in the truncation error remainder, two high-order compact LOD schemes are established for solving the two- and three- dimensional advection equations, respectively. They have the fourth-order accuracy in both time and space. By the von Neumann analysis method, it shows that the two schemes are unconditionally stable. Besides, the consistency and convergence of them are also proved. Finally, numerical experiments are given to confirm the accuracy and efficiency of the present schemes.

Download Full-text

A generalization of t-SNE and UMAP to single-cell multimodal omics

Genome Biology ◽

10.1186/s13059-021-02356-5 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Van Hoan Do ◽

Stefan Canzar

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

Cell Types ◽

High Dimensional ◽

Omics Data ◽

Relative Contribution ◽

Reduction Techniques ◽

Dimensionality Reduction Techniques ◽

Concise Representation ◽

Cellular Identity

AbstractEmerging single-cell technologies profile multiple types of molecules within individual cells. A fundamental step in the analysis of the produced high-dimensional data is their visualization using dimensionality reduction techniques such as t-SNE and UMAP. We introduce j-SNE and j-UMAP as their natural generalizations to the joint visualization of multimodal omics data. Our approach automatically learns the relative contribution of each modality to a concise representation of cellular identity that promotes discriminative features but suppresses noise. On eight datasets, j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes.

Download Full-text

A new method for mining disjunctive emerging patterns in high-dimensional datasets using hypergraphs

Information Systems ◽

10.1016/j.is.2013.09.001 ◽

2014 ◽

Vol 40 ◽

pp. 1-10 ◽

Cited By ~ 8

Author(s):

Renato Vimieiro ◽

Pablo Moscato

Keyword(s):

New Method ◽

High Dimensional ◽

Emerging Patterns ◽

High Dimensional Datasets

Download Full-text