Clustering high-dimensional mixed data to uncover sub-phenotypes: joint analysis of phenotypic and genotypic data

D. McParland; C. M. Phillips; L. Brennan; H. M. Roche; I. C. Gormley

doi:10.1002/sim.7371

Feature selection algorithms for very high dimensional data and mixed data

10.32657/10356/41404 ◽

2008 ◽

Author(s):

Wen Yin Tang

Keyword(s):

Feature Selection ◽

High Dimensional Data ◽

Mixed Data ◽

High Dimensional ◽

Selection Algorithms ◽

Very High

Download Full-text

Regularized Latent Class Model for Joint Analysis of High‐Dimensional Longitudinal Biomarkers and a Time‐to‐Event Outcome

Biometrics ◽

10.1111/biom.12964 ◽

2018 ◽

Vol 75 (1) ◽

pp. 69-77

Author(s):

Jiehuan Sun ◽

Jose D. Herazo‐Maya ◽

Philip L. Molyneaux ◽

Toby M. Maher ◽

Naftali Kaminski ◽

...

Keyword(s):

Latent Class ◽

Latent Class Model ◽

Joint Analysis ◽

High Dimensional ◽

Time To Event ◽

Class Model

Download Full-text

Joint analysis of multiple high-dimensional data types using sparse matrix approximations of rank-1 with applications to ovarian and liver cancer

BioData Mining ◽

10.1186/s13040-016-0103-7 ◽

2016 ◽

Vol 9 (1) ◽

Cited By ~ 3

Author(s):

Gordon Okimoto ◽

Ashkan Zeinalzadeh ◽

Tom Wenska ◽

Michael Loomis ◽

James B. Nation ◽

...

Keyword(s):

Liver Cancer ◽

Sparse Matrix ◽

High Dimensional Data ◽

Joint Analysis ◽

High Dimensional ◽

Data Types ◽

Matrix Approximations

Download Full-text

Unsupervised Learning for Large Scale Data: The ATHLOS Project

10.1101/2021.04.01.21254751 ◽

2021 ◽

Author(s):

Petros Barmpas ◽

Sotiris Tasoulis ◽

Aristidis G. Vrahatis ◽

Panagiotis Anagnostou ◽

Spiros Georgakopoulos ◽

...

Keyword(s):

Unsupervised Learning ◽

Real World ◽

Large Scale ◽

High Dimensional Data ◽

Experimental Studies ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Types ◽

Unified Framework

1AbstractRecent technological advancements in various domains, such as the biomedical and health, offer a plethora of big data for analysis. Part of this data pool is the experimental studies that record various and several features for each instance. It creates datasets having very high dimensionality with mixed data types, with both numerical and categorical variables. On the other hand, unsupervised learning has shown to be able to assist in high-dimensional data, allowing the discovery of unknown patterns through clustering, visualization, dimensionality reduction, and in some cases, their combination. This work highlights unsupervised learning methodologies for large-scale, high-dimensional data, providing the potential of a unified framework that combines the knowledge retrieved from clustering and visualization. The main purpose is to uncover hidden patterns in a high-dimensional mixed dataset, which we achieve through our application in a complex, real-world dataset. The experimental analysis indicates the existence of notable information exposing the usefulness of the utilized methodological framework for similar high-dimensional and mixed, real-world applications.

Download Full-text

High dimensional latent Gaussian copula model for mixed data in imaging genetics

2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018) ◽

10.1109/isbi.2018.8363533 ◽

2018 ◽

Author(s):

Aiying Zhang ◽

Jian Fang ◽

Vince D. Calhoun ◽

Yu-ping Wang

Keyword(s):

Imaging Genetics ◽

Mixed Data ◽

Gaussian Copula ◽

High Dimensional ◽

Copula Model

Download Full-text

Joint analysis of gene expression levels and histological images identifies genes associated with tissue morphology

10.1101/458711 ◽

2018 ◽

Cited By ~ 2

Author(s):

Jordan T. Ash ◽

Gregory Darnell ◽

Daniel Munro ◽

Barbara E. Engelhardt

Keyword(s):

Gene Expression ◽

Image Features ◽

Population Variation ◽

Joint Analysis ◽

High Dimensional ◽

Specific Cell ◽

Histological Image ◽

Expression Levels ◽

Histological Images ◽

Gene Expression Levels

Histological images are used to identify and to characterize complex phenotypes such as tumor stage. Our goal is to associate histological image phenotypes with high-dimensional genomic markers; the limitations to incorporating histological image phenotypes in genomic studies are that the relevant image features are difficult to identify and extract in an automated way, and confounders are difficult to control in this high-dimensional setting. In this paper, we use convolutional autoencoders and sparse canonical correlation analysis (CCA) on histological images and gene expression levels from paired samples to find subsets of genes whose expression values in a tissue sample correlate with subsets of morphological features from the corresponding sample image. We apply our approach, ImageCCA, to three data sets, two from TCGA and one from GTEx v6, and we find three types of biological associations. In TCGA, we find gene sets associated with the structure of the extracellular matrix and cell wall infrastructure, implicating uncharacterized genes in extracellular processes. Across studies, we find sets of genes associated with specific cell types, including muscle tissue and neuronal cells, and with cell type proportions in heterogeneous tissues. In the GTEx v6 data, we find image features that capture population variation in thyroid and in colon tissues associated with genetic variants, suggesting that genetic variation regulates population variation in tissue morphological traits. The software is publicly available at: https://github.com/daniel-munro/imageCCA.

Download Full-text

Learning Moral Graphs in Construction of High-Dimensional Bayesian Networks for Mixed Data

Neural Computation ◽

10.1162/neco_a_01190 ◽

2019 ◽

Vol 31 (6) ◽

pp. 1183-1214 ◽

Cited By ~ 2

Author(s):

Suwa Xu ◽

Bochao Jia ◽

Faming Liang

Keyword(s):

Computational Complexity ◽

Bayesian Networks ◽

Bayesian Network ◽

Learning Algorithm ◽

Hill Climbing ◽

Mixed Data ◽

High Dimensional ◽

Large Set ◽

Worst Case ◽

Markov Network

Bayesian networks have been widely used in many scientific fields for describing the conditional independence relationships for a large set of random variables. This letter proposes a novel algorithm, the so-called p-learning algorithm, for learning moral graphs for high-dimensional Bayesian networks. The moral graph is a Markov network representation of the Bayesian network and also the key to construction of the Bayesian network for constraint-based algorithms. The consistency of the p-learning algorithm is justified under the small- n, large- p scenario. The numerical results indicate that the p-learning algorithm significantly outperforms the existing ones, such as the PC, grow-shrink, incremental association, semi-interleaved hiton, hill-climbing, and max-min hill-climbing. Under the sparsity assumption, the p-learning algorithm has a computational complexity of O(p2) even in the worst case, while the existing algorithms have a computational complexity of O(p3) in the worst case.

Download Full-text

Latent class models for joint analysis of disease prevalence and high-dimensional semicontinuous biomarker data

Biostatistics ◽

10.1093/biostatistics/kxr024 ◽

2011 ◽

Vol 13 (1) ◽

pp. 74-88 ◽

Cited By ~ 10

Author(s):

Bo Zhang ◽

Zhen Chen ◽

Paul S. Albert

Keyword(s):

Latent Class ◽

Disease Prevalence ◽

Latent Class Models ◽

Joint Analysis ◽

High Dimensional ◽

Class Models ◽

Biomarker Data

Download Full-text

High dimensional semiparametric latent graphical model for mixed data

Journal of the Royal Statistical Society Series B (Statistical Methodology) ◽

10.1111/rssb.12168 ◽

2016 ◽

Vol 79 (2) ◽

pp. 405-421 ◽

Cited By ~ 25

Author(s):

Jianqing Fan ◽

Han Liu ◽

Yang Ning ◽

Hui Zou

Keyword(s):

Graphical Model ◽

Mixed Data ◽

High Dimensional

Download Full-text

Raking and Relabeling for Imbalanced Data

10.36227/techrxiv.17712122.v1 ◽

2022 ◽

Author(s):

Seunghwan Park ◽

Hae-Wwan Lee ◽

Jongho Im

Keyword(s):

High Dimensional Data ◽

Imbalanced Data ◽

Sampling Strategy ◽

Classification Performance ◽

Mixed Data ◽

Categorical Variables ◽

High Dimensional ◽

Data Generation ◽

Minority Class ◽

Imbalanced Data Classification

<div>We consider the binary classification of imbalanced data. A dataset is imbalanced if the proportion of classes are heavily skewed. Imbalanced data classification is often challengeable, especially for high-dimensional data, because unequal classes deteriorate classifier performance. Under sampling the majority class or oversampling the minority class are popular methods to construct balanced samples, facilitating classification performance improvement. However, many existing sampling methods cannot be easily extended to high-dimensional data and mixed data, including categorical variables, because they often require approximating the attribute distributions, which becomes another critical issue. In this paper, we propose a new sampling strategy employing raking and relabeling procedures, such that the attribute values of the majority class are imputed for the values of the minority class in the construction of balanced samples. The proposed algorithms produce comparable performance as existing popular methods but are more flexible regarding the data shape and attribute size. The sampling algorithm is attractive in practice, considering that it does not require density estimation for synthetic data generation in oversampling and is not bothered by mixed-type variables. In addition, the proposed sampling strategy is robust to classifiers in the sense that classification performance is not sensitive to choosing the classifiers.</div>

Download Full-text