tmap: topological analysis of population-scale microbiome data

Mapping Intimacies ◽

10.1101/396960 ◽

2018 ◽

Cited By ~ 1

Author(s):

Tianhua Liao ◽

Yuchen Wei ◽

Mingjing Luo ◽

Guoping Zhao ◽

Haokui Zhou

Keyword(s):

Data Analysis ◽

Large Scale ◽

Topological Analysis ◽

Topological Data Analysis ◽

Link Type ◽

Online Documentation ◽

Microbiome Research ◽

Population Scale ◽

Microbiome Data ◽

Complex Dataset

AbstractPopulation-scale microbiome study poses specific challenges in data analysis, from enterotype analysis, identification of driver species, to microbiome-wide association of host covariates. Application of advanced data mining techniques to high-dimensional complex dataset is expected to meet the rapid advancement in large scale and integrative microbiome research. Here, we present tmap, a topological data analysis framework for population-scale microbiome study. This framework can capture complex shape of large scale microbiome data into a compressive network representation. We also develop network-based statistical analysis for driver species identification and microbiome-wide association analysis. tmap can be used for exploring variations in a population-scale microbiome landscape to study host-microbiome association.Availability and implementationtmap is available at GitHub (https://github.com/GPZ-Bioinfo/tmap), accompanied with online documentation and tutorial (http://tmap.readthedocs.io).Contacthttp://[email protected]

Download Full-text

Spectral and topological analysis of the cortical representation of the head position: does hypnotizability matter?

10.1101/442053 ◽

2018 ◽

Author(s):

Esther Ibanez-Marcelo ◽

Lisa Campioni ◽

Diego Manzoni ◽

Enrica L Santarcangelo ◽

Giovanni Petri

Keyword(s):

Spectral Analysis ◽

Data Analysis ◽

Large Scale ◽

Topological Analysis ◽

Head Position ◽

Topological Data Analysis ◽

Proprioceptive Information ◽

Head Positions ◽

First Time ◽

Topological Data

The aim of the study was to assess the EEG correlates of head positions, which have never been studied in humans, in participants with different psychophysiological characteristics, as encoded by their hypnotizability scores. This choice is motivated by earlier studies suggesting different processing of the vestibular/neck proprioceptive information in subjects with high (highs) and low (lows) hypnotizability scores maintaining their head rotated toward one side (RH). We analysed EEG signals recorded in 20 highs and 19 lows in basal conditions (head forward) and during RH, using spectral analysis, which captures changes localized to specific recording sites, and Topological Data Analysis (TDA), which instead describes large-scale differences in processing and representing sensorimotor information. Spectral analysis revealed significant differences related to the head position for alpha1, beta2, beta3, gamma bands, but not to hypnotizability. TDA instead revealed global hypnotizability-related differences in the strengths of the correlations among recording sites during RH. Significant changes were observed in lows on the left parieto-occipital side and in highs in right fronto-parietal region. Significant differences between the two groups were found in the occipital region, where changes were larger in lows than in highs. The study reports findings of the EEG correlates of the head posture for the first time, indicates that hypnotizability modulates its representation/processing on large-scale and that spectral and topological data analysis provide complementary results.

Download Full-text

Identification of Stem Cells from Large Cell Populations with Topological Scoring

10.1101/2020.04.08.032102 ◽

2020 ◽

Author(s):

Mihaela E. Sardiu ◽

Box C. Andrew ◽

Jeff Haug ◽

Michael P. Washburn

Keyword(s):

Machine Learning ◽

Flow Cytometry ◽

Data Analysis ◽

Large Scale ◽

Topological Analysis ◽

Topological Data Analysis ◽

Cell Populations ◽

Hematopoietic Stem ◽

Flow Cytometry Data ◽

Genomics And Proteomics

AbstractMachine learning and topological analysis methods are becoming increasingly used on various large-scale omics datasets. Modern high dimensional flow cytometry data sets share many features with other omics datasets like genomics and proteomics. For example, genomics or proteomics datasets can be sparse and have high dimensionality, and flow cytometry datasets can also share these features. This makes flow cytometry data potentially a suitable candidate for employing machine learning and topological scoring strategies, for example, to gain novel insights into patterns within the data. We have previously developed the Topological Score (TopS) and implemented it for the analysis of quantitative protein interaction network datasets. Here we show that the TopS approach for large scale data analysis is applicable to the analysis of a previously described flow cytometry sorted human hematopoietic stem cell dataset. We demonstrate that TopS is capable of effectively sorting this dataset into cell populations and identify rare cell populations. We demonstrate the utility of TopS when coupled with multiple approaches including topological data analysis, X-shift clustering, and t-Distributed Stochastic Neighbor Embedding (t-SNE). Our results suggest that TopS could be effectively used to analyze large scale flow cytometry datasets to find rare cell populations.

Download Full-text

tmap: an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies

Genome Biology ◽

10.1186/s13059-019-1871-4 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 4

Author(s):

Tianhua Liao ◽

Yuchen Wei ◽

Mingjing Luo ◽

Guo-Ping Zhao ◽

Haokui Zhou

Keyword(s):

Data Analysis ◽

Large Scale ◽

Association Studies ◽

Topological Data Analysis ◽

Association Patterns ◽

Integrative Framework ◽

Environmental Features ◽

Analytic Methods ◽

Population Scale ◽

Topological Data

AbstractUntangling the complex variations of microbiome associated with large-scale host phenotypes or environment types challenges the currently available analytic methods. Here, we present tmap, an integrative framework based on topological data analysis for population-scale microbiome stratification and association studies. The performance of tmap in detecting nonlinear patterns is validated by different scenarios of simulation, which clearly demonstrate its superiority over the most commonly used methods. Application of tmap to several population-scale microbiomes extensively demonstrates its strength in revealing microbiome-associated host or environmental features and in understanding the systematic interrelations among their association patterns. tmap is available at https://github.com/GPZ-Bioinfo/tmap.

Download Full-text

Tree-Aggregated Predictive Modeling of Microbiome Data

10.1101/2020.09.01.277632 ◽

2020 ◽

Author(s):

Jacob Bien ◽

Xiaohan Yan ◽

Léo Simpson ◽

Christian L. Müller

Keyword(s):

Data Analysis ◽

Predictive Modeling ◽

Large Scale ◽

High Throughput Sequencing ◽

Compositional Data ◽

Low Cost ◽

Primary Data ◽

Compositional Data Analysis ◽

Taxonomic Rank ◽

Microbiome Data

AbstractModern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven, parameter-free, and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call trac (tree-aggregation of compositional data), learns data-adaptive taxon aggregation levels for predictive modeling making user-defined aggregation obsolete while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human-gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbial ecologists gain insights into the structure and functioning of the underlying ecosystem of interest.

Download Full-text

Topological data analysis and diagnostics of compressible magnetohydrodynamic turbulence

Journal of Plasma Physics ◽

10.1017/s0022377818000752 ◽

2018 ◽

Vol 84 (4) ◽

Cited By ~ 2

Author(s):

I. Makarenko ◽

P. Bushby ◽

A. Fletcher ◽

R. Henderson ◽

N. Makarenko ◽

...

Keyword(s):

Data Analysis ◽

Random Fields ◽

Large Scale ◽

Betti Numbers ◽

Mean Field ◽

Topological Data Analysis ◽

Topological Measures ◽

Magnetohydrodynamic Simulations ◽

Random Flows ◽

Topological Data

The predictions of mean-field electrodynamics can now be probed using direct numerical simulations of random flows and magnetic fields. When modelling astrophysical magnetohydrodynamics, it is important to verify that such simulations are in agreement with observations. One of the main challenges in this area is to identify robust quantitative measures to compare structures found in simulations with those inferred from astrophysical observations. A similar challenge is to compare quantitatively results from different simulations. Topological data analysis offers a range of techniques, including the Betti numbers and persistence diagrams, that can be used to facilitate such a comparison. After describing these tools, we first apply them to synthetic random fields and demonstrate that, when the data are standardized in a straightforward manner, some topological measures are insensitive to either large-scale trends or the resolution of the data. Focusing upon one particular astrophysical example, we apply topological data analysis to H iobservations of the turbulent interstellar medium (ISM) in the Milky Way and to recent magnetohydrodynamic simulations of the random, strongly compressible ISM. We stress that these topological techniques are generic and could be applied to any complex, multi-dimensional random field.

Download Full-text

Topological Data Analysis Highlights Novel Geographical Signatures of the Human Gut Microbiome

Frontiers in Artificial Intelligence ◽

10.3389/frai.2021.680564 ◽

2021 ◽

Vol 4 ◽

Author(s):

Eva Lymberopoulos ◽

Giorgia Isabella Gentili ◽

Muhannad Alomari ◽

Nikhil Sharma

Keyword(s):

Data Analysis ◽

Gut Microbiome ◽

Large Population ◽

Enrichment Analysis ◽

Population Based ◽

Topological Data Analysis ◽

Microbiome Composition ◽

Demographical Data ◽

Microbiome Data ◽

Topological Data

Background: There is growing interest in the connection between the gut microbiome and human health and disease. Conventional approaches to analyse microbiome data typically entail dimensionality reduction and assume linearity of the observed relationships, however, the microbiome is a highly complex ecosystem marked by non-linear relationships. In this study, we use topological data analysis (TDA) to explore differences and similarities between the gut microbiome across several countries.Methods: We used curated adult microbiome data at the genus level from the GMrepo database. The dataset contains OTU and demographical data of over 4,400 samples from 19 studies, spanning 12 countries. We analysed the data with tmap, an integrative framework for TDA specifically designed for stratification and enrichment analysis of population-based gut microbiome datasets.Results: We find associations between specific microbial genera and groups of countries. Specifically, both the USA and UK were significantly co-enriched with the proinflammatory genera Lachnoclostridium and Ruminiclostridium, while France and New Zealand were co-enriched with other, butyrate-producing, taxa of the order Clostridiales.Conclusion: The TDA approach demonstrates the overlap and distinctions of microbiome composition between and within countries. This yields unique insights into complex associations in the dataset, a finding not possible with conventional approaches. It highlights the potential utility of TDA as a complementary tool in microbiome research, particularly for large population-scale datasets, and suggests further analysis on the effects of diet and other regionally varying factors.

Download Full-text

Tree-aggregated predictive modeling of microbiome data

Scientific Reports ◽

10.1038/s41598-021-93645-3 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Jacob Bien ◽

Xiaohan Yan ◽

Léo Simpson ◽

Christian L. Müller

Keyword(s):

Data Analysis ◽

Predictive Modeling ◽

Large Scale ◽

High Throughput Sequencing ◽

Compositional Data ◽

Low Cost ◽

Primary Data ◽

Compositional Data Analysis ◽

Taxonomic Rank ◽

Microbiome Data

AbstractModern high-throughput sequencing technologies provide low-cost microbiome survey data across all habitats of life at unprecedented scale. At the most granular level, the primary data consist of sparse counts of amplicon sequence variants or operational taxonomic units that are associated with taxonomic and phylogenetic group information. In this contribution, we leverage the hierarchical structure of amplicon data and propose a data-driven and scalable tree-guided aggregation framework to associate microbial subcompositions with response variables of interest. The excess number of zero or low count measurements at the read level forces traditional microbiome data analysis workflows to remove rare sequencing variants or group them by a fixed taxonomic rank, such as genus or phylum, or by phylogenetic similarity. By contrast, our framework, which we call (ee-ggregation of ompositional data), learns data-adaptive taxon aggregation levels for predictive modeling, greatly reducing the need for user-defined aggregation in preprocessing while simultaneously integrating seamlessly into the compositional data analysis framework. We illustrate the versatility of our framework in the context of large-scale regression problems in human gut, soil, and marine microbial ecosystems. We posit that the inferred aggregation levels provide highly interpretable taxon groupings that can help microbiome researchers gain insights into the structure and functioning of the underlying ecosystem of interest.

Download Full-text

Efficient management and analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr

10.1101/190926 ◽

2017 ◽

Cited By ~ 1

Author(s):

Florian Privé ◽

Hugues Aschard ◽

Michael G.B. Blum

Keyword(s):

Data Analysis ◽

Large Scale ◽

Genomic Data ◽

Supplementary Information ◽

Risk Scores ◽

Analysis Pipeline ◽

Polygenic Risk ◽

Link Type ◽

Genome Wide ◽

R Packages

AbstractMotivation:Genome-wide datasets produced for association studies have dramatically increased in size over the past few years, with modern datasets commonly including millions of variants measured in dozens of thousands of individuals. This increase in data size is a major challenge severely slowing down genomic analyses. Specialized software for every part of the analysis pipeline have been developed to handle large genomic data. However, combining all these software into a single data analysis pipeline might be technically difficult.Results:Here we present two R packages, bigstatsr and bigsnpr, allowing for management and analysis of large scale genomic data to be performed within a single comprehensive framework. To address large data size, the packages use memory-mapping for accessing data matrices stored on disk instead of in RAM. To perform data pre-processing and data analysis, the packages integrate most of the tools that are commonly used, either through transparent system calls to existing software, or through updated or improved implementation of existing methods. In particular, the packages implement a fast derivation of Principal Component Analysis, functions to remove SNPs in Linkage Disequilibrium, and algorithms to learn Polygenic Risk Scores on millions of SNPs. We illustrate applications of the two R packages by analysing a case-control genomic dataset for the celiac disease, performing an association study and computing Polygenic Risk Scores. Finally, we demonstrate the scalability of the R packages by analyzing a simulated genome-wide dataset including 500,000 individuals and 1 million markers on a single desktop computer.Availability:https://privefl.github.io/bigstatsr/ & https://privefl.github.io/bigsnpr/Contact:[email protected] & [email protected] information:Supplementary data are available at Bioinformatics online.

Download Full-text

Semantic segmentation of microscopic neuroanatomical data by combining topological priors with encoder-decoder deep networks

10.1101/2020.02.18.955237 ◽

2020 ◽

Cited By ~ 1

Author(s):

Samik Banerjee ◽

Lucas Magee ◽

Dingkang Wang ◽

Xu Li ◽

Bingxing Huo ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Large Scale ◽

Image Data ◽

Semantic Segmentation ◽

Scientific Data ◽

Error Rates ◽

Topological Data Analysis ◽

Hybrid Architecture ◽

Deep Networks

Understanding of neuronal circuitry at cellular resolution within the brain has relied on tract tracing methods which involve careful observation and interpretation by experienced neuroscientists. With recent developments in imaging and digitization, this approach is no longer feasible with the large scale (terabyte to petabyte range) images. Machine learning based techniques, using deep networks, provide an efficient alternative to the problem. However, these methods rely on very large volumes of annotated images for training and have error rates that are too high for scientific data analysis, and thus requires a significant volume of human-in-the-loop proofreading. Here we introduce a hybrid architecture combining prior structure in the form of topological data analysis methods, based on discrete Morse theory, with the best-in-class deep-net architectures for the neuronal connectivity analysis. We show significant performance gains using our hybrid architecture on detection of topological structure (e.g. connectivity of neuronal processes and local intensity maxima on axons corresponding to synaptic swellings) with precision/recall close to 90% compared with human observers. We have adapted our architecture to a high performance pipeline capable of semantic segmentation of light microscopic whole-brain image data into a hierarchy of neuronal compartments. We expect that the hybrid architecture incorporating discrete Morse techniques into deep nets will generalize to other data domains.

Download Full-text

Topological data analysis of zebrafish patterns

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1917763117 ◽

2020 ◽

Vol 117 (10) ◽

pp. 5113-5124 ◽

Cited By ~ 4

Author(s):

Melissa R. McGuirl ◽

Alexandria Volkening ◽

Björn Sandstede

Keyword(s):

Data Analysis ◽

Pattern Formation ◽

Large Scale ◽

Model Organism ◽

Topological Data Analysis ◽

Wild Type ◽

Cell Dynamics ◽

Global Pattern ◽

Agent Based ◽

Topological Data

Self-organized pattern behavior is ubiquitous throughout nature, from fish schooling to collective cell dynamics during organism development. Qualitatively these patterns display impressive consistency, yet variability inevitably exists within pattern-forming systems on both microscopic and macroscopic scales. Quantifying variability and measuring pattern features can inform the underlying agent interactions and allow for predictive analyses. Nevertheless, current methods for analyzing patterns that arise from collective behavior capture only macroscopic features or rely on either manual inspection or smoothing algorithms that lose the underlying agent-based nature of the data. Here we introduce methods based on topological data analysis and interpretable machine learning for quantifying both agent-level features and global pattern attributes on a large scale. Because the zebrafish is a model organism for skin pattern formation, we focus specifically on analyzing its skin patterns as a means of illustrating our approach. Using a recent agent-based model, we simulate thousands of wild-type and mutant zebrafish patterns and apply our methodology to better understand pattern variability in zebrafish. Our methodology is able to quantify the differential impact of stochasticity in cell interactions on wild-type and mutant patterns, and we use our methods to predict stripe and spot statistics as a function of varying cellular communication. Our work provides an approach to automatically quantifying biological patterns and analyzing agent-based dynamics so that we can now answer critical questions in pattern formation at a much larger scale.

Download Full-text