cytometree: a binary tree algorithm for automatic gating in cytometry analysis

Mapping Intimacies ◽

10.1101/335554 ◽

2018 ◽

Cited By ~ 1

Author(s):

Daniel Commenges ◽

Chariff Alkhassim ◽

Raphael Gottardo ◽

Boris Hejblum ◽

Rodolphe Thiébaut

Keyword(s):

Flow Cytometry ◽

Binary Tree ◽

Computation Time ◽

R Package ◽

Information Criteria ◽

Supplementary Information ◽

Flow Cytometry Data ◽

Human Immunology ◽

Supplementary Material ◽

Unsupervised Algorithms

AbstractMotivationFlow cytometry is a powerful technology that allows the high-throughput quantification of dozens of surface and intracellular proteins at the single-cell level. It has become the most widely used technology for immunophenotyping of cells over the past three decades. Due to the increasing complexity of cytometry experiments (more cells and more markers), traditional manual flow cytometry data analysis has become untenable due to its subjectivity and time-consuming nature.ResultsWe present a new unsupervised algorithm called “cytometree” to perform automated population discovery (aka gating) in flow cytometry. cytometree is based on the construction of a binary tree, the nodes of which are subpopulations of cells. At each node, the marker distributions are modeled by mixtures of normal distribution. Node splitting is done according to a normalized difference of Akaike information criteria (AIC) between the two models. Post-processing of the tree structure and derived populations allows us to complete the annotation of the derived populations. The algorithm is shown to perform better than the state-of-the-art unsupervised algorithms previously proposed on panels introduced by the Flow Cytometry: Critical Assessment of Population Identification Methods (FlowCAP I) project. The algorithm is also applied to a T-cell panel proposed by the Human Immunology Project Consortium (HIPC) program; it also outperforms the best unsupervised open-source available algorithm while requiring the shortest computation time.AvailabilityAn R package named “cytometree” is available on the CRAN [email protected]; [email protected] informationSupplementary data are available.

Download Full-text

optimalFlow: optimal transport approach to flow cytometry gating and population matching

BMC Bioinformatics ◽

10.1186/s12859-020-03795-w ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Eustasio del Barrio ◽

Hristo Inouzhe ◽

Jean-Michel Loubes ◽

Carlos Matrán ◽

Agustín Mayo-Íscar

Keyword(s):

Flow Cytometry ◽

Supervised Learning ◽

Optimal Transport ◽

Cell Types ◽

R Package ◽

Supervised Machine Learning ◽

Intrinsic Variability ◽

Flow Cytometry Data ◽

Different Characteristics ◽

Better Than

Abstract Background Data obtained from flow cytometry present pronounced variability due to biological and technical reasons. Biological variability is a well-known phenomenon produced by measurements on different individuals, with different characteristics such as illness, age, sex, etc. The use of different settings for measurement, the variation of the conditions during experiments and the different types of flow cytometers are some of the technical causes of variability. This mixture of sources of variability makes the use of supervised machine learning for identification of cell populations difficult. The present work is conceived as a combination of strategies to facilitate the task of supervised gating. Results We propose optimalFlowTemplates, based on a similarity distance and Wasserstein barycenters, which clusters cytometries and produces prototype cytometries for the different groups. We show that supervised learning, restricted to the new groups, performs better than the same techniques applied to the whole collection. We also present optimalFlowClassification, which uses a database of gated cytometries and optimalFlowTemplates to assign cell types to a new cytometry. We show that this procedure can outperform state of the art techniques in the proposed datasets. Our code is freely available as optimalFlow, a Bioconductor R package at https://bioconductor.org/packages/optimalFlow. Conclusions optimalFlowTemplates + optimalFlowClassification addresses the problem of using supervised learning while accounting for biological and technical variability. Our methodology provides a robust automated gating workflow that handles the intrinsic variability of flow cytometry data well. Our main innovation is the methodology itself and the optimal transport techniques that we apply to flow cytometry analysis.

Download Full-text

Single cell network analysis with a mixture of Nested Effects Models

10.1101/258202 ◽

2018 ◽

Author(s):

Martin Pirkl ◽

Niko Beerenwinkel

Keyword(s):

Single Cell ◽

New Technologies ◽

Single Cells ◽

R Package ◽

Supplementary Information ◽

Data Sets ◽

Cell Network ◽

A Cell ◽

Supplementary Material ◽

Cell Data

AbstractMotivationNew technologies allow for the elaborate measurement of different traits of single cells. These data promise to elucidate intra-cellular networks in unprecedented detail and further help to improve treatment of diseases like cancer. However, cell populations can be very heterogeneous.ResultsWe developed a mixture of Nested Effects Models (M&NEM) for single-cell data to simultaneously identify different cellular sub-populations and their corresponding causal networks to explain the heterogeneity in a cell population. For inference, we assign each cell to a network with a certain probability and iteratively update the optimal networks and cell probabilities in an Expectation Maximization scheme. We validate our method in the controlled setting of a simulation study and apply it to three data sets of pooled CRISPR screens generated previously by two novel experimental techniques, namely Crop-Seq and Perturb-Seq.AvailabilityThe mixture Nested Effects Model (M&NEM) is available as the R-package mnem at https://github.com/cbgethz/mnem/[email protected], [email protected] informationSupplementary data are available.online.

Download Full-text

cyanoFilter: An R package to identify phytoplankton populations from flow cytometry data using cell pigmentation and granularity

Ecological Modelling ◽

10.1016/j.ecolmodel.2021.109743 ◽

2021 ◽

Vol 460 ◽

pp. 109743

Author(s):

Oluwafemi D. Olusoji ◽

Jurg W. Spaak ◽

Mark Holmes ◽

Thomas Neyens ◽

Marc Aerts ◽

...

Keyword(s):

Flow Cytometry ◽

R Package ◽

Flow Cytometry Data

Download Full-text

Random Forest of Perfect Trees: Concept, Performance, Applications, and Perspectives

Bioinformatics ◽

10.1093/bioinformatics/btab074 ◽

2021 ◽

Author(s):

Jean-Michel Nguyen ◽

Pascal Jézéquel ◽

Pierre Gillois ◽

Luisa Silva ◽

Faouda Ben Azzouz ◽

...

Keyword(s):

Random Forest ◽

Information Criterion ◽

R Package ◽

Information Criteria ◽

Three Dimensions ◽

Supplementary Information ◽

Recursive Feature Elimination ◽

Support Vector ◽

Classification Errors ◽

New Type

Abstract Motivation The principle of Breiman's random forest (RF) is to build and assemble complementary classification trees in a way that maximizes their variability. We propose a new type of random forest that disobeys Breiman’s principles and involves building trees with no classification errors in very large quantities. We used a new type of decision tree that uses a neuron at each node as well as an in-innovative half Christmas tree structure. With these new RFs, we developed a score, based on a family of ten new statistical information criteria, called Nguyen information criteria (NICs), to evaluate the predictive qualities of features in three dimensions. Results The first NIC allowed the Akaike information criterion to be minimized more quickly than data obtained with the Gini index when the features were introduced in a logistic regression model. The selected features based on the NICScore showed a slight advantage compared to the support vector machines—recursive feature elimination (SVM-RFE) method. We demonstrate that the inclusion of artificial neurons in tree nodes allows a large number of classifiers in the same node to be taken into account simultaneously and results in perfect trees without classification errors. Availability and implementation The methods used to build the perfect trees in this article were implemented in the “ROP” R package, archived at https://cran.r-project.org/web/packages/ROP/index.html Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Automated identification of maximal differential cell populations in flow cytometry data

10.1101/837765 ◽

2019 ◽

Author(s):

Alice Yue ◽

Cedric Chauve ◽

Maxwell Libbrecht ◽

Ryan R. Brinkman

Keyword(s):

Flow Cytometry ◽

Cell Population ◽

R Package ◽

Cell Populations ◽

Visualization Tool ◽

Automated Identification ◽

Flow Cytometry Data ◽

New Class ◽

Differential Cell ◽

Related Population

AbstractWe introduce a new cell population score called SpecEnr (specific enrichment) and describe a method that discovers robust and accurate candidate biomarkers from flow cytometry data. Our approach identifies a new class of candidate biomarkers we define as driver cell populations, whose abundance is associated with a sample class (e.g. disease), but not as a result of a change in a related population. We show that the driver cell populations we find are also easily interpretable using a lattice-based visualization tool. Our method is implemented in the R package flowGraph, freely available on GitHub (github.com/aya49/flowGraph) and will be available BioConductor.

Download Full-text

SIMLR: a tool for large-scale single-cell analysis by multi-kernel learning

10.1101/118901 ◽

2017 ◽

Cited By ~ 9

Author(s):

Bo Wang ◽

Daniele Ramazzotti ◽

Luca De Sano ◽

Junjie Zhu ◽

Emma Pierson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cell Analysis ◽

R Package ◽

Supplementary Information ◽

Cell Analysis ◽

Rna Seq ◽

A Cell ◽

Supplementary Material ◽

Public Datasets

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

MOSim: Multi-Omics Simulation in R

10.1101/421834 ◽

2018 ◽

Cited By ~ 5

Author(s):

Carlos Martínez-Mira ◽

Ana Conesa ◽

Sonia Tarazona

Keyword(s):

Time Series Data ◽

Simulated Data ◽

R Package ◽

Experimental Designs ◽

Supplementary Information ◽

Series Data ◽

Data Sets ◽

Expression Data ◽

Supplementary Material ◽

Omic Data

AbstractMotivationAs new integrative methodologies are being developed to analyse multi-omic experiments, validation strategies are required for benchmarking. In silico approaches such as simulated data are popular as they are fast and cheap. However, few tools are available for creating synthetic multi-omic data sets.ResultsMOSim is a new R package for easily simulating multi-omic experiments consisting of gene expression data, other regulatory omics and the regulatory relationships between them. MOSim supports different experimental designs including time series data.AvailabilityThe package is freely available under the GPL-3 license from the Bitbucket repository (https://bitbucket.org/ConesaLab/mosim/)[email protected] informationSupplementary material is available at bioRxiv online.

Download Full-text

flowPloidy: An R package for genome size and ploidy assessment of flow cytometry data

Applications in Plant Sciences ◽

10.1002/aps3.1164 ◽

2018 ◽

Vol 6 (7) ◽

pp. e01164 ◽

Cited By ~ 6

Author(s):

Tyler William Smith ◽

Paul Kron ◽

Sara L. Martin

Keyword(s):

Flow Cytometry ◽

Genome Size ◽

R Package ◽

Flow Cytometry Data

Download Full-text

metagenomeFeatures: an R package for working with 16S rRNA reference databases and marker-gene survey feature data

Bioinformatics ◽

10.1093/bioinformatics/btz136 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3870-3872 ◽

Cited By ~ 1

Author(s):

Nathan D Olson ◽

Nidhi Shah ◽

Jayaram Kancherla ◽

Justin Wagner ◽

Joseph N Paulson ◽

...

Keyword(s):

16S Rrna ◽

Marker Gene ◽

R Package ◽

Supplementary Information ◽

Bioconductor Package ◽

Rrna Sequence ◽

16S Rrna Sequence ◽

Reference Databases ◽

Supplementary Material ◽

Database Comparison

Abstract Summary We developed the metagenomeFeatures R Bioconductor package along with annotation packages for three 16S rRNA databases (Greengenes, RDP and SILVA) to facilitate working with 16S rRNA databases and marker-gene survey feature data. The metagenomeFeatures package defines two classes, MgDb for working with 16S rRNA sequence databases, and mgFeatures for marker-gene survey feature data. The associated annotation packages provide a consistent interface to the different databases facilitating database comparison and exploration. The mgFeatures-class represents a crucial step in the development of a common data structure for working with 16S marker-gene survey data in R. Availability and implementation https://bioconductor.org/packages/release/bioc/html/metagenomeFeatures.html. Supplementary information Supplementary material is available at Bioinformatics online.

Download Full-text

mmgenome: a toolbox for reproducible genome extraction from metagenomes

10.1101/059121 ◽

2016 ◽

Cited By ~ 42

Author(s):

Søeren M. Karst ◽

Rasmus H. Kirkegaard ◽

Mads Albertsen

Keyword(s):

Optimal Strategy ◽

R Package ◽

Supplementary Information ◽

Data Generation ◽

Supplementary Data ◽

High Quality ◽

Standard Analysis ◽

Specific Population ◽

The Core ◽

Supplementary Material

ABSTRACTSummaryRecovery of population genomes is becoming a standard analysis in metagenomics and a multitude of different approaches exists. However, the workflows are complex, requiring data generation, binning, validation and finishing to generate high quality population genome bins. In addition, several different approaches are often used on the same dataset as the optimal strategy to extract a specific population genome varies. Here we introduce mmgenome: a toolbox for reproducible genome extraction from metagenomes. At the core of mmgenome is an R package that facilitates effortless integration of different binning strategies by collecting information on scaffolds. Genome binning is facilitated through integrated tools that support effortless visualizations, validation and calculation of key statistics. Full reproducibility and transparency is obtained through Rmarkdown, whereby every step can be recreated.Availability and implementationThe binning framework of mmge-nome is implemented in R. Wrapper scripts for data generation and finishing is written in Perl. The mmgenome toolbox and associated step-by-step guides are available at http://madsal-bertsen.github.io/mmgenome/[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text