Applications of community detection algorithms to large biological datasets

Mapping Intimacies ◽

10.1101/547570 ◽

2019 ◽

Cited By ~ 1

Author(s):

Itamar Kanter ◽

Gur Yaari ◽

Tomer Kalisky

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Community Detection ◽

Large Scale ◽

Heuristic Algorithms ◽

Relevant Information ◽

Biological Data ◽

Sequence Information ◽

Single Experiment ◽

Or Genes

ABSTRACTRecent advances in data acquiring technologies in biology have led to major challenges in mining relevant information from large datasets. For example, single-cell RNA sequencing technologies are producing expression and sequence information from tens of thousands of cells in every single experiment. A common task in analyzing biological data is to cluster samples or features (e.g. genes) into groups sharing common characteristics. This is an NP-hard problem for which numerous heuristic algorithms have been developed. However, in many cases, the clusters created by these algorithms do not reflect biological reality. To overcome this, a Networks Based Clustering (NBC) approach was recently proposed, by which the samples or genes in the dataset are first mapped to a network and then community detection (CD) algorithms are used to identify clusters of nodes.Here, we created an open and flexible python-based toolkit for NBC that enables easy and accessible network construction and community detection. We then tested the applicability of NBC for identifying clusters of cells or genes from previously published large-scale single-cell and bulk RNA-seq datasets.We show that NBC can be used to accurately and efficiently analyze large-scale datasets of RNA sequencing experiments.

Download Full-text

Comparison of high-throughput single-cell RNA sequencing data processing pipelines

Briefings in Bioinformatics ◽

10.1093/bib/bbaa116 ◽

2020 ◽

Author(s):

Mingxuan Gao ◽

Mingyi Ling ◽

Xinwei Tang ◽

Shun Wang ◽

Xu Xiao ◽

...

Keyword(s):

Data Processing ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Evaluation Framework ◽

Integrated Analysis ◽

Sequencing Data ◽

Single Experiment ◽

Single Cell Rna Sequencing

Abstract With the development of single-cell RNA sequencing (scRNA-seq) technology, it has become possible to perform large-scale transcript profiling for tens of thousands of cells in a single experiment. Many analysis pipelines have been developed for data generated from different high-throughput scRNA-seq platforms, bringing a new challenge to users to choose a proper workflow that is efficient, robust and reliable for a specific sequencing platform. Moreover, as the amount of public scRNA-seq data has increased rapidly, integrated analysis of scRNA-seq data from different sources has become increasingly popular. However, it remains unclear whether such integrated analysis would be biassed if the data were processed by different upstream pipelines. In this study, we encapsulated seven existing high-throughput scRNA-seq data processing pipelines with Nextflow, a general integrative workflow management framework, and evaluated their performance in terms of running time, computational resource consumption and data analysis consistency using eight public datasets generated from five different high-throughput scRNA-seq platforms. Our work provides a useful guideline for the selection of scRNA-seq data processing pipelines based on their performance on different real datasets. In addition, these guidelines can serve as a performance evaluation framework for future developments in high-throughput scRNA-seq data processing.

Download Full-text

Comparison of High-Throughput Single-Cell RNA Sequencing Data Processing Pipelines

10.1101/2020.02.09.940221 ◽

2020 ◽

Cited By ~ 2

Author(s):

Mingxuan Gao ◽

Mingyi Ling ◽

Xinwei Tang ◽

Shun Wang ◽

Xu Xiao ◽

...

Keyword(s):

Data Processing ◽

Single Cell ◽

Rna Sequencing ◽

High Throughput ◽

Large Scale ◽

Evaluation Framework ◽

Integrated Analysis ◽

Sequencing Data ◽

Single Experiment ◽

Single Cell Rna Sequencing

AbstractWith the development of single-cell RNA sequencing (scRNA-seq) technology, it has become possible to perform large-scale transcript profiling for tens of thousands of cells in a single experiment. Many analysis pipelines have been developed for data generated from different high-throughput scRNA-seq platforms, bringing a new challenge to users to choose a proper workflow that is efficient, robust and reliable for a specific sequencing platform. Moreover, as the amount of public scRNA-seq data has increased rapidly, integrated analysis of scRNA-seq data from different sources has become increasingly popular. How-ever, it remains unclear whether such integrated analysis would be biased if the data were processed by different upstream pipelines. In this study, we encapsulated seven existing high-throughput scRNA-seq data processing pipelines with Nextflow, a general integrative workflow management framework, and evaluated their performances in terms of running time, computational resource consumption, and data processing consistency using nine public datasets generated from five different high-throughput scRNA-seq platforms. Our work provides a useful guideline for the selection of scRNA-seq data processing pipelines based on their performances on different real datasets. In addition, these guidelines can serve as a performance evaluation framework for future developments in high-throughput scRNA-seq data processing.

Download Full-text

Orchestrating Single-Cell Analysis with Bioconductor

10.1101/590562 ◽

2019 ◽

Cited By ~ 5

Author(s):

Robert A. Amezquita ◽

Vince J. Carey ◽

Lindsay N. Carpp ◽

Ludwig Geistlinger ◽

Aaron T. L. Lun ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Biological Data ◽

High Dimensional ◽

Software Project ◽

Sequencing Analysis ◽

Data Generation ◽

Data Infrastructure ◽

Single Cell Rna Sequencing

AbstractRecent developments in experimental technologies such as single-cell RNA sequencing have enabled the profiling a high-dimensional number of genome-wide features in individual cells, inspiring the formation of large-scale data generation projects quantifying unprecedented levels of biological variation at the single-cell level. The data generated in such projects exhibits unique characteristics, including increased sparsity and scale, in terms of both the number of features and the number of samples. Due to these unique characteristics, specialized statistical methods are required along with fast and efficient software implementations in order to successfully derive biological insights. Bioconductor - an open-source, open-development software project based on the R programming language - has pioneered the analysis of such high-throughput, high-dimensional biological data, leveraging a rich history of software and methods development that has spanned the era of sequencing. Featuring state-of-the-art computational methods, standardized data infrastructure, and interactive data visualization tools that are all easily accessible as software packages, Bioconductor has made it possible for a diverse audience to analyze data derived from cutting-edge single-cell assays. Here, we present an overview of single-cell RNA sequencing analysis for prospective users and contributors, highlighting the contributions towards this effort made by Bioconductor.

Download Full-text

Vacuum-Driven Micropump with Support Columns: Toward Large Scale Single-Cell RNA-Sequencing

2018 International Conference on Manipulation, Automation and Robotics at Small Scales (MARSS) ◽

10.1109/marss.2018.8481157 ◽

2018 ◽

Author(s):

Kento Hisa ◽

Musashi Kakugawa ◽

Takayuki Shibata ◽

Moeto Nagai

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Single Cell Rna Sequencing

Download Full-text

Prokaryotic Single-Cell RNA Sequencing by In Situ Combinatorial Indexing

10.1101/866244 ◽

2019 ◽

Cited By ~ 3

Author(s):

Sydney B. Blattman ◽

Wenyan Jiang ◽

Panos Oikonomou ◽

Saeed Tavazoie

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Low Cost ◽

Bacterial Cells ◽

Single Experiment ◽

Bacterial Populations ◽

Single Cell Rna Sequencing ◽

Gene Expression Heterogeneity ◽

Mrna Polyadenylation

AbstractDespite longstanding appreciation of gene expression heterogeneity in isogenic bacterial populations, affordable and scalable technologies for studying single bacterial cells have been limited. While single-cell RNA sequencing (scRNA-seq) has revolutionized studies of transcriptional heterogeneity in diverse eukaryotic systems, application of scRNA-seq to prokaryotes has been hindered by their extremely low mRNA abundance, lack of mRNA polyadenylation, and thick cell walls. Here, we present Prokaryotic Expression-profiling by Tagging RNA In Situ and sequencing (PETRI-seq), a low-cost, high-throughput, prokaryotic scRNA-seq pipeline that overcomes these technical obstacles. PETRI-seq uses in situ combinatorial indexing to barcode transcripts from tens of thousands of cells in a single experiment. PETRI-seq captures single cell transcriptomes of Gram-negative and Gram-positive bacteria with high purity and low bias, with median capture rates >200 mRNAs/cell for exponentially growing E. coli. These characteristics enable robust discrimination of cell-states corresponding to different phases of growth. When applied to wild-type S. aureus, PETRI-seq revealed a rare sub-population of cells undergoing prophage induction. We anticipate broad utility of PETRI-seq in defining single-cell states and their dynamics in complex microbial communities.

Download Full-text

Assessing the measurement transfer function of single-cell RNA sequencing

10.1101/045450 ◽

2016 ◽

Author(s):

Hannah R. Dueck ◽

Rizi Ai ◽

Adrian Camarena ◽

Bo Ding ◽

Reymundo Dominguez ◽

...

Keyword(s):

Transfer Function ◽

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Transfer Functions ◽

Single Cells ◽

Biological Information ◽

Gene Detection ◽

Single Cell Rna Sequencing ◽

Scale Control

AbstractRecently, measurement of RNA at single cell resolution has yielded surprising insights. Methods for single-cell RNA sequencing (scRNA-seq) have received considerable attention, but the broad reliability of single cell methods and the factors governing their performance are still poorly known. Here, we conducted a large-scale control experiment to assess the transfer function of three scRNA-seq methods and factors modulating the function. All three methods detected greater than 70% of the expected number of genes and had a 50% probability of detecting genes with abundance greater than 2 to 4 molecules. Despite the small number of molecules, sequencing depth significantly affected gene detection. While biases in detection and quantification were qualitatively similar across methods, the degree of bias differed, consistent with differences in molecular protocol. Measurement reliability increased with expression level for all methods and we conservatively estimate the measurement transfer functions to be linear above ~5-10 molecules. Based on these extensive control studies, we propose that RNA-seq of single cells has come of age, yielding quantitative biological information.

Download Full-text

Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing

Nature Neuroscience ◽

10.1038/nn.3881 ◽

2014 ◽

Vol 18 (1) ◽

pp. 145-153 ◽

Cited By ~ 849

Author(s):

Dmitry Usoskin ◽

Alessandro Furlan ◽

Saiful Islam ◽

Hind Abdo ◽

Peter Lönnerberg ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Sensory Neuron ◽

Large Scale ◽

Single Cell Rna Sequencing

Download Full-text

A map of tumor–host interactions in glioma at single-cell resolution

GigaScience ◽

10.1093/gigascience/giaa109 ◽

2020 ◽

Vol 9 (10) ◽

Cited By ~ 3

Author(s):

Francesca Pia Caruso ◽

Luciano Garofano ◽

Fulvio D'Angelo ◽

Kai Yu ◽

Fuchou Tang ◽

...

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Cross Talk ◽

Large Scale ◽

Expression Profiles ◽

Gene Expression Profiles ◽

Sequencing Data ◽

Host Interaction ◽

Receptor Interactions ◽

Single Cell Rna Sequencing

ABSTRACT Background Single-cell RNA sequencing is the reference technique for characterizing the heterogeneity of the tumor microenvironment. The composition of the various cell types making up the microenvironment can significantly affect the way in which the immune system activates cancer rejection mechanisms. Understanding the cross-talk signals between immune cells and cancer cells is of fundamental importance for the identification of immuno-oncology therapeutic targets. Results We present a novel method, single-cell Tumor–Host Interaction tool (scTHI), to identify significantly activated ligand–receptor interactions across clusters of cells from single-cell RNA sequencing data. We apply our approach to uncover the ligand–receptor interactions in glioma using 6 publicly available human glioma datasets encompassing 57,060 gene expression profiles from 71 patients. By leveraging this large-scale collection we show that unexpected cross-talk partners are highly conserved across different datasets in the majority of the tumor samples. This suggests that shared cross-talk mechanisms exist in glioma. Conclusions Our results provide a complete map of the active tumor–host interaction pairs in glioma that can be therapeutically exploited to reduce the immunosuppressive action of the microenvironment in brain tumor.

Download Full-text

SSCC: a novel computational framework for rapid and accurate clustering large single cell RNA-seq data

10.1101/344242 ◽

2018 ◽

Cited By ~ 2

Author(s):

Xianwen Ren ◽

Liangtao Zheng ◽

Zemin Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Large Scale ◽

Random Projection ◽

Rna Seq ◽

Sequencing Data ◽

Computational Framework ◽

Human Blood Cells ◽

Single Cell Rna Sequencing ◽

Data Volume

ABSTRACTClustering is a prevalent analytical means to analyze single cell RNA sequencing data but the rapidly expanding data volume can make this process computational challenging. New methods for both accurate and efficient clustering are of pressing needs. Here we proposed a new clustering framework based on random projection and feature construction for large scale single-cell RNA sequencing data, which greatly improves clustering accuracy, robustness and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells, our method reached 20% improvements for clustering accuracy and 50-fold acceleration but only consumed 66% memory usage compared to the widely-used software package SC3. Compared to k-means, the accuracy improvement can reach 3-fold depending on the concrete dataset. An R implementation of the framework is available from https://github.com/Japrin/sscClust.

Download Full-text

A bivariate zero-inflated negative binomial model for identifying underlying dependence with application to single cell RNA sequencing data

10.1101/2020.03.06.977728 ◽

2020 ◽

Author(s):

Hunyong Cho ◽

Chuwen Liu ◽

John S. Preisser ◽

Di Wu

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Latent Variable ◽

Large Scale ◽

Negative Binomial ◽

Model Fitting ◽

Sequencing Data ◽

Excess Zeros ◽

Binomial Distributions ◽

Single Cell Rna Sequencing

SummaryMeasuring gene-gene dependence in single cell RNA sequencing (scRNA-seq) count data is often of interest and remains challenging, because an unidentified portion of the zero counts represent non-detected RNA due to technical reasons. Conventional statistical methods that fail to account for technical zeros incorrectly measure the dependence among genes. To address this problem, we propose a bivariate zero-inflated negative binomial (BZINB) model constructed using a bivariate Poisson-gamma mixture with dropout indicators for the technical (excess) zeros. Parameters are estimated based on the EM algorithm and are used to measure the underlying dependence by decomposing the two sources of zeros. Compared to existing models, the proposed BZINB model is specifically designed for estimating dependence and is more flexible, while preserving the marginal zero-inflated negative binomial distributions. Additionally, it has a simple latent variable framework, allowing parameters to have clear and intuitive interpretations, and its computation is feasible with large scale data. Using a recent scRNA-seq dataset, we illustrate model fitting and how the model-based measures can be different from naive measures. The inferential ability of the proposed model is evaluated in a simulation study. An R package ‘bzinb’ is available on CRAN.

Download Full-text