UMI-tools: Modelling sequencing errors in Unique Molecular Identifiers to improve quantification accuracy

K-mer counting with low memory consumption enables fast clustering of single-cell sequencing data without read alignment

10.1101/723833 ◽

2019 ◽

Author(s):

Christina Huan Shi ◽

Kevin Y. Yip

Keyword(s):

Single Cell ◽

State Of The Art ◽

Rna Seq ◽

Sequencing Data ◽

Memory Consumption ◽

Analysis Pipeline ◽

Cell Clusters ◽

Single Cell Sequencing ◽

Sequencing Errors ◽

Full Analysis

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.

Download Full-text

Standardizing unique molecular identifiers in SAM flags would benefit more than RNA-Seq

10.7287/peerj.preprints.2465v1 ◽

2016 ◽

Author(s):

Elisha D Roberson

Keyword(s):

Pcr Amplification ◽

Amplicon Sequencing ◽

Abundance Estimation ◽

Rna Seq ◽

Specific Expression ◽

High Coverage ◽

Different Types ◽

Pcr Duplicates ◽

Allele Specific ◽

Break Points

Unique Molecular Identifiers (UMIs) have been incorporated into RNA-Seq experiments to overcome issues with abundance estimation from samples that may have many PCR amplification cycles. However, the use of UMIs in many different types of sequencing experiments could be beneficial, including amplicon sequencing, ATAC-Seq, and ChIP-Seq. Furthermore, UMIs help to overcome artifacts in high-coverage DNA-Seq, and would enable more accurate RNA-Seq genotyping and allele-specific expression calculation. The main advantage of using UMIs is that identical molecules that are true PCR duplicates can be discerned from unique molecules with identical break points.

Download Full-text

Standardizing unique molecular identifiers in SAM flags would benefit more than RNA-Seq

10.7287/peerj.preprints.2465 ◽

2016 ◽

Author(s):

Elisha D Roberson

Keyword(s):

Pcr Amplification ◽

Amplicon Sequencing ◽

Abundance Estimation ◽

Rna Seq ◽

Specific Expression ◽

High Coverage ◽

Different Types ◽

Pcr Duplicates ◽

Allele Specific ◽

Break Points

Unique Molecular Identifiers (UMIs) have been incorporated into RNA-Seq experiments to overcome issues with abundance estimation from samples that may have many PCR amplification cycles. However, the use of UMIs in many different types of sequencing experiments could be beneficial, including amplicon sequencing, ATAC-Seq, and ChIP-Seq. Furthermore, UMIs help to overcome artifacts in high-coverage DNA-Seq, and would enable more accurate RNA-Seq genotyping and allele-specific expression calculation. The main advantage of using UMIs is that identical molecules that are true PCR duplicates can be discerned from unique molecules with identical break points.

Download Full-text

CellBench: R/Bioconductor software for comparing single-cell RNA-seq analysis methods

Bioinformatics ◽

10.1093/bioinformatics/btz889 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2288-2290 ◽

Cited By ~ 3

Author(s):

Shian Su ◽

Luyi Tian ◽

Xueyi Dong ◽

Peter F Hickey ◽

Saskia Freytag ◽

...

Keyword(s):

Single Cell ◽

Ad Hoc ◽

Performance Metrics ◽

Single Cell Analysis ◽

Ground Truth ◽

Bioinformatic Analysis ◽

Rna Seq ◽

Effective Manner ◽

Cell Gene Expression ◽

The Many

Abstract Motivation Bioinformatic analysis of single-cell gene expression data is a rapidly evolving field. Hundreds of bespoke methods have been developed in the past few years to deal with various aspects of single-cell analysis and consensus on the most appropriate methods to use under different settings is still emerging. Benchmarking the many methods is therefore of critical importance and since analysis of single-cell data usually involves multi-step pipelines, effective evaluation of pipelines involving different combinations of methods is required. Current benchmarks of single-cell methods are mostly implemented with ad-hoc code that is often difficult to reproduce or extend, and exhaustive manual coding of many combinations is infeasible in most instances. Therefore, new software is needed to manage pipeline benchmarking. Results The CellBench R software facilitates method comparisons in either a task-centric or combinatorial way to allow pipelines of methods to be evaluated in an effective manner. CellBench automatically runs combinations of methods, provides facilities for measuring running time and delivers output in tabular form which is highly compatible with tidyverse R packages for summary and visualization. Our software has enabled comprehensive benchmarking of single-cell RNA-seq normalization, imputation, clustering, trajectory analysis and data integration methods using various performance metrics obtained from data with available ground truth. CellBench is also amenable to benchmarking other bioinformatics analysis tasks. Availability and implementation Available from https://bioconductor.org/packages/CellBench.

Download Full-text

Addressing the looming identity crisis in single cell RNA-seq

10.1101/150524 ◽

2017 ◽

Cited By ~ 3

Author(s):

Megan Crow ◽

Anirban Paul ◽

Sara Ballouz ◽

Z. Josh Huang ◽

Jesse Gillis

Keyword(s):

Single Cell ◽

Large Scale ◽

Ad Hoc ◽

Meta Analysis ◽

High Specificity ◽

Cell Types ◽

Marker Genes ◽

Rna Seq ◽

Cortical Interneuron ◽

Large Sets

AbstractSingle cell RNA-sequencing technology (scRNA-seq) provides a new avenue to discover and characterize cell types, but the experiment-specific technical biases and analytic variability inherent to current pipelines may undermine the replicability of these studies. Meta-analysis of rapidly accumulating data is further hampered by the use of ad hoc naming conventions. Here we demonstrate our replication framework, MetaNeighbor, that allows researchers to quantify the degree to which cell types replicate across datasets, and to rapidly identify clusters with high similarity for further testing. We first measure the replicability of neuronal identity by comparing more than 13 thousand individual scRNA-seq transcriptomes, sampling with high specificity from within the data to define a range of robust practices. We then assess cross-dataset evidence for novel cortical interneuron subtypes identified by scRNA-seq and find that 24/45 cortical interneuron subtypes have evidence of replication in at least one other study. Identifying these putative replicates allows us to re-analyze the data for differential expression and provide lists of robust candidate marker genes. Across tasks we find that large sets of variably expressed genes can identify replicable cell types and subtypes with high accuracy, suggesting a general route forward for large-scale evaluation of scRNA-seq data.

Download Full-text

MetaCell: analysis of single cell RNA-seq data using k-NN graph partitions

10.1101/437665 ◽

2018 ◽

Cited By ~ 10

Author(s):

Yael Baran ◽

Arnau Sebe-Pedros ◽

Yaniv Lubling ◽

Amir Giladi ◽

Elad Chomsky ◽

...

Keyword(s):

Single Cell ◽

Software Package ◽

Building Blocks ◽

Cell Populations ◽

Compact Groups ◽

Sampling Variance ◽

Statistical Control ◽

Rna Seq ◽

Cell Type ◽

Graph Partitions

ABSTRACTSingle cell RNA-seq (scRNA-seq) has become the method of choice for analyzing mRNA distributions in heterogeneous cell populations. scRNA-seq only partially samples the cells in a tissue and the RNA in each cell, resulting in sparse data that challenge analysis. We develop a methodology that addresses scRNA-seq’s sparsity through partitioning the data into metacells: disjoint, homogenous and highly compact groups of cells, each exhibiting only sampling variance. Metacells constitute local building blocks for clustering and quantitative analysis of gene expression, while not enforcing any global structure on the data, thereby maintaining statistical control and minimizing biases. We illustrate the MetaCell framework by re-analyzing cell type and transcriptional gradients in peripheral blood and whole organism scRNA-seq maps. Our algorithms are implemented in the new MetaCell R/C++ software package.

Download Full-text

An Interpretable Framework for Clustering Single-Cell RNA-Seq Datasets

10.1101/191254 ◽

2017 ◽

Author(s):

Jesse M. Zhang ◽

Jue Fan ◽

H. Christina Fan ◽

David Rosenfeld ◽

David N. Tse

Keyword(s):

Feature Selection ◽

Single Cell ◽

Computational Efficiency ◽

Software Package ◽

Rna Seq ◽

Cell Type ◽

Clustering Problem ◽

Unsupervised Analysis ◽

Multiple Levels ◽

Definition Of

ABSTRACTBackgroundWith the recent proliferation of single-cell RNA-Seq experiments, several methods have been developed for unsupervised analysis of the resulting datasets. These methods often rely on unintuitive hyperparameters and do not explicitly address the subjectivity associated with clustering.ResultsIn this work, we present DendroSplit, an interpretable framework for analyzing single-cell RNA-Seq datasets that addresses both the clustering interpretability and clustering subjectivity issues. DendroSplit offers a novel perspective on the single-cell RNA-Seq clustering problem motivated by the definition of “cell type,” allowing us to cluster using feature selection to uncover multiple levels of biologically meaningful populations in the data. We analyze several landmark single-cell datasets, demonstrating both the method’s efficacy and computational efficiency.ConclusionDendroSplit offers a clustering framework that is comparable to existing methods in terms of accuracy and speed but is novel in its emphasis on interpretabilty. We provide the full DendroSplit software package at https://github.com/jessemzhang/dendrosplit.

Download Full-text

Exploring the single-cell RNA-seq analysis landscape with the scRNA-tools database

10.1101/206573 ◽

2017 ◽

Cited By ~ 2

Author(s):

Luke Zappia ◽

Belinda Phipson ◽

Alicia Oshlack

Keyword(s):

Single Cell ◽

Open Source ◽

Rapid Development ◽

Analysis Tool ◽

Rna Seq ◽

Link Type ◽

Analysis Tools ◽

Rapid Pace ◽

Cell Data ◽

Selection Of

AbstractAs single-cell RNA-sequencing (scRNA-seq) datasets have become more widespread the number of tools designed to analyse these data has dramatically increased. Navigating the vast sea of tools now available is becoming increasingly challenging for researchers. In order to better facilitate selection of appropriate analysis tools we have created the scRNA-tools database (www.scRNA-tools.org) to catalogue and curate analysis tools as they become available. Our database collects a range of information on each scRNA-seq analysis tool and categorises them according to the analysis tasks they perform. Exploration of this database gives insights into the areas of rapid development of analysis methods for scRNA-seq data. We see that many tools perform tasks specific to scRNA-seq analysis, particularly clustering and ordering of cells. We also find that the scRNA-seq community embraces an open-source approach, with most tools available under open-source licenses and preprints being extensively used as a means to describe methods. The scRNA-tools database provides a valuable resource for researchers embarking on scRNA-seq analysis and records of the growth of the field over time.Author summaryIn recent years single-cell RNA-sequeing technologies have emerged that allow scientists to measure the activity of genes in thousands of individual cells simultaneously. This means we can start to look at what each cell in a sample is doing instead of considering an average across all cells in a sample, as was the case with older technologies. However, while access to this kind of data presents a wealth of opportunities it comes with a new set of challenges. Researchers across the world have developed new methods and software tools to make the most of these datasets but the field is moving at such a rapid pace it is difficult to keep up with what is currently available. To make this easier we have developed the scRNA-tools database and website (www.scRNA-tools.org). Our database catalogues analysis tools, recording the tasks they can be used for, where they can be downloaded from and the publications that describe how they work. By looking at this database we can see that developers have focued on methods specific to single-cell data and that they embrace an open-source approach with permissive licensing, sharing of code and preprint publications.

Download Full-text

Model-based deep embedding for constrained clustering analysis of single cell RNA-seq data

Nature Communications ◽

10.1038/s41467-021-22008-3 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Tian Tian ◽

Jie Zhang ◽

Xiang Lin ◽

Zhi Wei ◽

Hakon Hakonarson

Keyword(s):

Single Cell ◽

Domain Knowledge ◽

Ad Hoc ◽

A Priori ◽

Unsupervised Clustering ◽

Rna Seq ◽

Clustering Methods ◽

Cell Type ◽

Type Assignment ◽

Deep Embedding

AbstractClustering is a critical step in single cell-based studies. Most existing methods support unsupervised clustering without the a priori exploitation of any domain knowledge. When confronted by the high dimensionality and pervasive dropout events of scRNA-Seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment. In such cases, the only recourse is for the user to manually and repeatedly tweak clustering parameters until acceptable clusters are found. Consequently, the path to obtaining biologically meaningful clusters can be ad hoc and laborious. Here we report a principled clustering method named scDCC, that integrates domain knowledge into the clustering step. Experiments on various scRNA-seq datasets from thousands to tens of thousands of cells show that scDCC can significantly improve clustering performance, facilitating the interpretability of clusters and downstream analyses, such as cell type assignment.

Download Full-text

Effects of duplicated mapped read PCR artifacts on RNA-seq differential expression analysis based on qRNA-seq

10.1101/301259 ◽

2018 ◽

Author(s):

Anna C. Salzberg ◽

Jiafen Hu ◽

Elizabeth J. Conroy ◽

Nancy M. Cladel ◽

Robert M. Brucklacher ◽

...

Keyword(s):

Differentially Expressed Genes ◽

Differential Expression Analysis ◽

Pcr Amplification ◽

False Positives ◽

Differentially Expressed ◽

Rna Seq ◽

Gold Standard Method ◽

Pcr Duplicates ◽

Cdna Molecule

AbstractBest practices to handling duplicated mapped reads in RNA-seq analyses has long been discussed but a gold standard method has yet to be established, as such duplicates could originate from valid biological transcripts or they could be PCR-related artifacts. Here we used the NEXTflex™qRNA-SeqTM(aka Molecular Indexing™) technology to identify PCR duplicates via the random attachment of unique molecular labels to each cDNA molecule prior to PCR amplification. We found that up to 64.3% of the single end and 19.3% of the mouse paired end duplicates originated from valid biological transcripts rather than PCR artifacts. For single end reads, either removing or retaining all duplicates resulted in a substantial number of false positives (up to 47.0%) and false negatives (up to 12.1%) in the sets of significantly differentially expressed genes. For paired end reads, only the alignment retaining all duplicates resulted in a substantial number of false positives. This is the first effort to evaluate the performance of qRNA-seq using ‘real-world’ biomedical samples, and we found that PCR duplicate identification provided minor benefits for paired end reads but greatly improved the sensitivity and specificity in the determination of the significantly differentially expressed genes for single end reads.

Download Full-text