scholarly journals Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations

2017 ◽  
Author(s):  
Lan Huong Nguyen ◽  
Susan Holmes

AbstractBackgroundDetecting patterns in high-dimensional multivariate datasets is non-trivial. Clustering and dimensionality reduction techniques often help in discerning inherent structures. In biological datasets such as microbial community composition or gene expression data, observations can be generated from a continuous process, often unknown. Estimating data points’ ‘natural ordering’ and their corresponding uncertainties can help researchers draw insights about the mechanisms involved.ResultsWe introduce a Bayesian Unidimensional Scaling (BUDS) technique which extracts dominant sources of variation in high dimensional datasets and produces their visual data summaries, facilitating the exploration of a hidden continuum. The method maps multivariate data points to latent one-dimensional coordinates along their underlying trajectory, and provides estimated uncertainty bounds. By statistically modeling dissimilarities and applying a DiSTATIS registration method to their posterior samples, we are able to incorporate visualizations of uncertainties in the estimated data trajectory across different regions using confidence contours for individual data points. We also illustrate the estimated overall data density across different areas by including density clouds. One-dimensional coordinates recovered by BUDS help researchers discover sample attributes or covariates that are factors driving the main variability in a dataset. We demonstrated usefulness and accuracy of BUDS on a set of published microbiome 16S and RNA-seq and roll call data.ConclusionsOur method effectively recovers and visualizes natural orderings present in datasets. Automatic visualization tools for data exploration and analysis are available at: https://nlhuong.shinyapps.io/visTrajectory/.

Information ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 317 ◽  
Author(s):  
Vincenzo Dentamaro ◽  
Donato Impedovo ◽  
Giuseppe Pirlo

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.


Author(s):  
Y. Wang ◽  
Yuan Yan Tang ◽  
Luoqing Li ◽  
Jianzhong Wang

This paper presents a novel classifier based on collaborative representation (CR) and multiple one-dimensional (1D) embedding with applications to face recognition. To use multiple 1D embedding (1DME) framework in semi-supervised learning is first proposed by one of the authors, J. Wang, in 2014. The main idea of the multiple 1D embedding is the following: Given a high-dimensional dataset, we first map it onto several different 1D sequences on the line while keeping the proximity of data points in the original ambient high-dimensional space. By this means, a classification problem on high dimension reduces to the one in a 1D framework, which can be efficiently solved by any classical 1D regularization method, for instance, an interpolation scheme. The dissimilarity metric plays an important role in learning a decent 1DME of the original dataset. Our another contribution is to develop a collaborative representation based dissimilarity (CRD) metric. Compared to the conventional Euclidean distance based metric, the proposed method can lead to better results. The experimental results on real-world databases verify the efficacy of the proposed method.


2009 ◽  
Vol 6 (2) ◽  
pp. 217-227 ◽  
Author(s):  
Aswani Kumar

Domains such as text, images etc contain large amounts of redundancies and ambiguities among the attributes which result in considerable noise effects (i.e. the data is high dimension). Retrieving the data from high dimensional datasets is a big challenge. Dimensionality reduction techniques have been a successful avenue for automatically extracting the latent concepts by removing the noise and reducing the complexity in processing the high dimensional data. In this paper we conduct a systematic study on comparing the unsupervised dimensionality reduction techniques for text retrieval task. We analyze these techniques from the view of complexity, approximation error and retrieval quality with experiments on four testing document collections.


Author(s):  
Dewi Pramudi Ismi ◽  
Shireen Panchoo ◽  
Murinto Murinto

With hundreds or thousands of features in high dimensional data, computational workload is challenging. In classification process, features which do not contribute significantly to prediction of classes, add to the computational workload. Therefore the aim of this paper is to use feature selection to decrease the computation load by reducing the size of high dimensional data. Selecting subsets of features which represent all features were used. Hence the process is two-fold; discarding irrelevant data and choosing one feature that representing a number of redundant features. There have been many studies regarding feature selection, for example backward feature selection and forward feature selection. In this study, a k-means clustering based feature selection is proposed. It is assumed that redundant features are located in the same cluster, whereas irrelevant features do not belong to any clusters. In this research, two different high dimensional datasets are used: 1) the Human Activity Recognition Using Smartphones (HAR) Dataset, containing 7352 data points each of 561 features and 2) the National Classification of Economic Activities Dataset, which contains 1080 data points each of 857 features. Both datasets provide class label information of each data point. Our experiment shows that k-means clustering based feature selection can be performed to produce subset of features. The latter returns more than 80% accuracy of classification result.


2021 ◽  
Vol 12 (2) ◽  
pp. 144-148
Author(s):  
D. Usman ◽  
S.F. Sani

Clustering is a useful technique that organizes a large quantity of unordered datasets into a small number of meaningful and coherent clusters. Every clustering method is based on the index of similarity or dissimilarity between data points. However, the true intrinsic structure of the data could be correctly described by the similarity formula defined and embedded in the clustering criterion function. This paper uses squared Euclidean distance and Manhattan distance to investigates the best method for measuring similarity between data objects in sparse and high-dimensional domain which is fast, capable of providing high quality clustering result and consistent. The performances of these two methods were reported with simulated high dimensional datasets.


Author(s):  
Joseph F. Boudreau ◽  
Eric S. Swanson

This chapter deals with two related problems occurring frequently in the physical sciences: first, the problem of estimating the value of a function from a limited number of data points; and second, the problem of calculating its value from a series approximation. Numerical methods for interpolating and extrapolating data are presented. The famous Lagrange interpolating polynomial is introduced and applied to one-dimensional and multidimensional problems. Cubic spline interpolation is introduced and an implementation in terms of Eigen classes is given. Several techniques for improving the convergence of Taylor series are discussed, including Shank’s transformation, Richardson extrapolation, and the use of Padé approximants. Conversion between representations with the quotient-difference algorithm is discussed. The exercises explore public transportation, human vision, the wine market, and SU(2) lattice gauge theory, among other topics.


2021 ◽  
Vol 40 (3) ◽  
Author(s):  
Bo Hou ◽  
Yongbin Ge

AbstractIn this paper, by using the local one-dimensional (LOD) method, Taylor series expansion and correction for the third derivatives in the truncation error remainder, two high-order compact LOD schemes are established for solving the two- and three- dimensional advection equations, respectively. They have the fourth-order accuracy in both time and space. By the von Neumann analysis method, it shows that the two schemes are unconditionally stable. Besides, the consistency and convergence of them are also proved. Finally, numerical experiments are given to confirm the accuracy and efficiency of the present schemes.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Van Hoan Do ◽  
Stefan Canzar

AbstractEmerging single-cell technologies profile multiple types of molecules within individual cells. A fundamental step in the analysis of the produced high-dimensional data is their visualization using dimensionality reduction techniques such as t-SNE and UMAP. We introduce j-SNE and j-UMAP as their natural generalizations to the joint visualization of multimodal omics data. Our approach automatically learns the relative contribution of each modality to a concise representation of cellular identity that promotes discriminative features but suppresses noise. On eight datasets, j-SNE and j-UMAP produce unified embeddings that better agree with known cell types and that harmonize RNA and protein velocity landscapes.


Sign in / Sign up

Export Citation Format

Share Document