projectR: an R/Bioconductor package for transfer learning via PCA, NMF, correlation and clustering

Gaurav Sharma; Carlo Colantuoni; Loyal A Goff; Elana J Fertig; Genevieve Stein-O’Brien

doi:10.1093/bioinformatics/btaa183

projectR: an R/Bioconductor package for transfer learning via PCA, NMF, correlation and clustering

Bioinformatics ◽

10.1093/bioinformatics/btaa183 ◽

2020 ◽

Vol 36 (11) ◽

pp. 3592-3593 ◽

Cited By ~ 2

Author(s):

Gaurav Sharma ◽

Carlo Colantuoni ◽

Loyal A Goff ◽

Elana J Fertig ◽

Genevieve Stein-O’Brien

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Single Cell Analysis ◽

Ground Truth ◽

Biological Data ◽

Supplementary Information ◽

Bioconductor Package ◽

Reduction Techniques ◽

Biological Phenomena ◽

Feature Discovery

Abstract Motivation Dimension reduction techniques are widely used to interpret high-dimensional biological data. Features learned from these methods are used to discover both technical artifacts and novel biological phenomena. Such feature discovery is critically importent in analysis of large single-cell datasets, where lack of a ground truth limits validation and interpretation. Transfer learning (TL) can be used to relate the features learned from one source dataset to a new target dataset to perform biologically driven validation by evaluating their use in or association with additional sample annotations in that independent target dataset. Results We developed an R/Bioconductor package, projectR, to perform TL for analyses of genomics data via TL of clustering, correlation and factorization methods. We then demonstrate the utility TL for integrated data analysis with an example for spatial single-cell analysis. Availability and implementation projectR is available on Bioconductor and at https://github.com/genesofeve/projectR. Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

projectR: An R/Bioconductor package for transfer learning via PCA, NMF, correlation, and clustering

10.1101/726547 ◽

2019 ◽

Cited By ~ 2

Author(s):

Gaurav Sharma ◽

Carlo Colantuoni ◽

Loyal A Goff ◽

Elana J Fertig ◽

Genevieve Stein-O’Brien

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Single Cell Analysis ◽

Ground Truth ◽

Biological Data ◽

Bioconductor Package ◽

Reduction Techniques ◽

Biological Phenomena ◽

Feature Discovery ◽

Dimension Reduction Techniques

AbstractMotivationDimension reduction techniques are widely used to interpret high-dimensional biological data. Features learned from these methods are used to discover both technical artifacts and novel biological phenomena. Such feature discovery is critically import to large single-cell datasets, where lack of a ground truth limits validation and interpretation. Transfer learning (TL) can be used to relate the features learned from one source dataset to a new target dataset to perform biologically-driven validation by evaluating their use in or association with additional sample annotations in that independent target dataset.ResultsWe developed an R/Bioconductor package, projectR, to perform TL for analyses of genomics data via TL of clustering, correlation, and factorization methods. We then demonstrate the utility TL for integrated data analysis with an example for spatial single-cell analysis.AvailabilityprojectR is available on Bioconductor and at https://github.com/genesofeve/[email protected]; [email protected]

Download Full-text

Presto scales Wilcoxon and auROC analyses to millions of observations

10.1101/653253 ◽

2019 ◽

Cited By ~ 6

Author(s):

Ilya Korsunsky ◽

Aparna Nathan ◽

Nghia Millard ◽

Soumya Raychaudhuri

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

Sparse Matrices ◽

R Package ◽

Biological Data ◽

Supplementary Information ◽

Wilcoxon Rank Sum Test ◽

Biological Data Analysis ◽

Simple Interface ◽

Operator Curve

AbstractSummaryThe related Wilcoxon rank sum test and area under the receiver operator curve are ubiquitous in high dimensional biological data analysis. Current implementations do not scale readily to the increasingly large datasets generated by novel high-throughput technologies, such as single cell RNAseq. We introduce a simple and scalable implementation of both analyses, available through the R package Presto. Presto scales to big datasets, with functions optimized for both dense and sparse matrices. On a sparse dataset of 1 million observations, 10 groups, and 1,000 features, Presto performed both rank-sum and auROC analyses in only 17 seconds, compared to 6.4 hours with base R functions. Presto also includes functions to seamlessly integrate with the Seurat single cell analysis pipeline and the Bioconductor SingleCellExperiment class. Presto enables the use of robust classical analyses on big data with a simple interface and optimized implementation.Availability and ImplementationPresto is available as an R package at https://github.com/immunogenomics/[email protected] InformationVignettes are available with the Presto package.

Download Full-text

breakpointR: an R/Bioconductor package to localize strand state changes in Strand-seq data

Bioinformatics ◽

10.1093/bioinformatics/btz681 ◽

2019 ◽

Cited By ~ 4

Author(s):

David Porubsky ◽

Ashley D Sanders ◽

Aaron Taudt ◽

Maria Colomé-Tatché ◽

Peter M Lansdorp ◽

...

Keyword(s):

Single Cell ◽

Sister Chromatid Exchanges ◽

Supplementary Information ◽

Specific Information ◽

Bioconductor Package ◽

Sequencing Data ◽

Sequencing Technique ◽

Important Addition ◽

State Changes ◽

Chromatid Exchanges

Abstract Motivation Strand-seq is a specialized single-cell DNA sequencing technique centered around the directionality of single-stranded DNA. Computational tools for Strand-seq analyses must capture the strand-specific information embedded in these data. Results Here we introduce breakpointR, an R/Bioconductor package specifically tailored to process and interpret single-cell strand-specific sequencing data obtained from Strand-seq. We developed breakpointR to detect local changes in strand directionality of aligned Strand-seq data, to enable fine-mapping of sister chromatid exchanges, germline inversion and to support global haplotype assembly. Given the broad spectrum of Strand-seq applications we expect breakpointR to be an important addition to currently available tools and extend the accessibility of this novel sequencing technique. Availability and implementation R/Bioconductor package https://bioconductor.org/packages/breakpointR. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

CellBench: R/Bioconductor software for comparing single-cell RNA-seq analysis methods

Bioinformatics ◽

10.1093/bioinformatics/btz889 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2288-2290 ◽

Cited By ~ 3

Author(s):

Shian Su ◽

Luyi Tian ◽

Xueyi Dong ◽

Peter F Hickey ◽

Saskia Freytag ◽

...

Keyword(s):

Single Cell ◽

Ad Hoc ◽

Performance Metrics ◽

Single Cell Analysis ◽

Ground Truth ◽

Bioinformatic Analysis ◽

Rna Seq ◽

Effective Manner ◽

Cell Gene Expression ◽

The Many

Abstract Motivation Bioinformatic analysis of single-cell gene expression data is a rapidly evolving field. Hundreds of bespoke methods have been developed in the past few years to deal with various aspects of single-cell analysis and consensus on the most appropriate methods to use under different settings is still emerging. Benchmarking the many methods is therefore of critical importance and since analysis of single-cell data usually involves multi-step pipelines, effective evaluation of pipelines involving different combinations of methods is required. Current benchmarks of single-cell methods are mostly implemented with ad-hoc code that is often difficult to reproduce or extend, and exhaustive manual coding of many combinations is infeasible in most instances. Therefore, new software is needed to manage pipeline benchmarking. Results The CellBench R software facilitates method comparisons in either a task-centric or combinatorial way to allow pipelines of methods to be evaluated in an effective manner. CellBench automatically runs combinations of methods, provides facilities for measuring running time and delivers output in tabular form which is highly compatible with tidyverse R packages for summary and visualization. Our software has enabled comprehensive benchmarking of single-cell RNA-seq normalization, imputation, clustering, trajectory analysis and data integration methods using various performance metrics obtained from data with available ground truth. CellBench is also amenable to benchmarking other bioinformatics analysis tasks. Availability and implementation Available from https://bioconductor.org/packages/CellBench.

Download Full-text

Destin: toolkit for single-cell analysis of chromatin accessibility

Bioinformatics ◽

10.1093/bioinformatics/btz141 ◽

2019 ◽

Vol 35 (19) ◽

pp. 3818-3820 ◽

Cited By ~ 10

Author(s):

Eugene Urrutia ◽

Li Chen ◽

Haibo Zhou ◽

Yuchao Jiang

Keyword(s):

Single Cell ◽

Single Cell Analysis ◽

New Technology ◽

R Package ◽

Chromatin Accessibility ◽

Supplementary Information ◽

Cell Type ◽

Statistical Framework ◽

Specific Association ◽

Accessible Chromatin

Abstract Summary Single-cell assay of transposase-accessible chromatin followed by sequencing (scATAC-seq) is an emerging new technology for the study of gene regulation with single-cell resolution. The data from scATAC-seq are unique—sparse, binary and highly variable even within the same cell type. As such, neither methods developed for bulk ATAC-seq nor single-cell RNA-seq data are appropriate. Here, we present Destin, a bioinformatic and statistical framework for comprehensive scATAC-seq data analysis. Destin performs cell-type clustering via weighted principle component analysis, weighting accessible chromatin regions by existing genomic annotations and publicly available regulomic datasets. The weights and additional tuning parameters are determined via model-based likelihood. We evaluated the performance of Destin using downsampled bulk ATAC-seq data of purified samples and scATAC-seq data from seven diverse experiments. Compared to existing methods, Destin was shown to outperform across all datasets and platforms. For demonstration, we further applied Destin to 2088 adult mouse forebrain cells and identified cell-type-specific association of previously reported schizophrenia GWAS loci. Availability and implementation Destin toolkit is freely available as an R package at https://github.com/urrutiag/destin. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Agile workflow for interactive analysis of mass cytometry data

10.1101/2020.05.28.120527 ◽

2020 ◽

Author(s):

Julia Casado ◽

Oskari Lehtonen ◽

Ville Rantanen ◽

Katja Kaipio ◽

Luca Pasquini ◽

...

Keyword(s):

Single Cell ◽

Peripheral Blood ◽

Large Scale ◽

Immune Cell ◽

Single Cell Analysis ◽

Supplementary Information ◽

Mass Cytometry ◽

Interactive Analysis ◽

Link Type ◽

Cell Subpopulations

AbstractMotivationSingle-cell proteomics technologies, such as mass cytometry, have enabled characterization of cell-to-cell variation and cell populations at a single cell resolution. These large amounts of data, however, require dedicated, interactive tools for translating the data into knowledge.ResultsWe present a comprehensive, interactive method called Cyto to streamline analysis of large-scale cytometry data. Cyto is a workflow-based open-source solution that automatizes the use of of state-of-the-art single-cell analysis methods with interactive visualization. We show the utility of Cyto by applying it to mass cytometry data from peripheral blood and high-grade serous ovarian cancer (HGSOC) samples. Our results show that Cyto is able to reliably capture the immune cell sub-populations from peripheral blood as well as cellular compositions of unique immune- and cancer cell subpopulations in HGSOC tumor and ascites samples.AvailabilityThe method is available as a Docker container at https://hub.docker.com/r/anduril/cyto and the user guide and source code are available at https://bitbucket.org/anduril-dev/[email protected] informationSupplementary material is available and FCS files are hosted at flowrepository.org/id/FR-FCM-Z2LW

Download Full-text

SIMLR: a tool for large-scale single-cell analysis by multi-kernel learning

10.1101/118901 ◽

2017 ◽

Cited By ~ 9

Author(s):

Bo Wang ◽

Daniele Ramazzotti ◽

Luca De Sano ◽

Junjie Zhu ◽

Emma Pierson ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Single Cell Analysis ◽

R Package ◽

Supplementary Information ◽

Cell Analysis ◽

Rna Seq ◽

A Cell ◽

Supplementary Material ◽

Public Datasets

AbstractMotivationWe here present SIMLR (Single-cell Interpretation via Multi-kernel LeaRning), an open-source tool that implements a novel framework to learn a cell-to-cell similarity measure from single-cell RNA-seq data. SIMLR can be effectively used to perform tasks such as dimension reduction, clustering, and visualization of heterogeneous populations of cells. SIMLR was benchmarked against state-of-the-art methods for these three tasks on several public datasets, showing it to be scalable and capable of greatly improving clustering performance, as well as providing valuable insights by making the data more interpretable via better a visualization.Availability and ImplementationSIMLR is available on GitHub in both R and MATLAB implementations. Furthermore, it is also available as an R package on [email protected] or [email protected] InformationSupplementary data are available at Bioinformatics online.

Download Full-text

NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

10.1101/2021.08.02.453487 ◽

2021 ◽

Author(s):

Federico Agostinis ◽

Chiara Romualdi ◽

Gabriele Sales ◽

Davide Risso

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Bioconductor Package ◽

Rna Seq ◽

Sequencing Data ◽

Bioconductor Project ◽

Single Cell Rna Sequencing

Summary: We present NewWave, a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation: NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

Single-cell genomics to understand disease pathogenesis

Journal of Human Genetics ◽

10.1038/s10038-020-00844-3 ◽

2020 ◽

Vol 66 (1) ◽

pp. 75-84

Author(s):

Seitaro Nomura

Keyword(s):

Single Cell ◽

Molecular Mechanisms ◽

Single Cell Analysis ◽

Cell Analysis ◽

Disease Pathogenesis ◽

Single Cell Genomics ◽

Molecular Systems ◽

Biological Phenomena ◽

Molecular Behavior ◽

Unbiased Manner

AbstractCells are minimal functional units in biological phenomena, and therefore single-cell analysis is needed to understand the molecular behavior leading to cellular function in organisms. In addition, omics analysis technology can be used to identify essential molecular mechanisms in an unbiased manner. Recently, single-cell genomics has unveiled hidden molecular systems leading to disease pathogenesis in patients. In this review, I summarize the recent advances in single-cell genomics for the understanding of disease pathogenesis and discuss future perspectives.

Download Full-text

Agile workflow for interactive analysis of mass cytometry data

Bioinformatics ◽

10.1093/bioinformatics/btaa946 ◽

2020 ◽

Author(s):

Julia Casado ◽

Oskari Lehtonen ◽

Ville Rantanen ◽

Katja Kaipio ◽

Luca Pasquini ◽

...

Keyword(s):

Single Cell ◽

Peripheral Blood ◽

Large Scale ◽

Immune Cell ◽

Single Cell Analysis ◽

Supplementary Information ◽

Mass Cytometry ◽

Interactive Analysis ◽

Cell Subpopulations ◽

Cell Variation

Abstract Motivation Single-cell proteomics technologies, such as mass cytometry, have enabled characterization of cell-to-cell variation and cell populations at a single-cell resolution. These large amounts of data, require dedicated, interactive tools for translating the data into knowledge. Results We present a comprehensive, interactive method called Cyto to streamline analysis of large-scale cytometry data. Cyto is a workflow-based open-source solution that automates the use of state-of-the-art single-cell analysis methods with interactive visualization. We show the utility of Cyto by applying it to mass cytometry data from peripheral blood and high-grade serous ovarian cancer (HGSOC) samples. Our results show that Cyto is able to reliably capture the immune cell sub-populations from peripheral blood and cellular compositions of unique immune- and cancer cell subpopulations in HGSOC tumor and ascites samples. Availabilityand implementation The method is available as a Docker container at https://hub.docker.com/r/anduril/cyto and the user guide and source code are available at https://bitbucket.org/anduril-dev/cyto. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text