CaMelia: imputation in single-cell methylomes based on local similarities between cells

Author(s):  
Jianxiong Tang ◽  
Jianxiao Zou ◽  
Mei Fan ◽  
Qi Tian ◽  
Jiyang Zhang ◽  
...  

Abstract Motivation Single-cell DNA methylation sequencing detects methylation levels with single-cell resolution, while this technology is upgrading our understanding of the regulation of gene expression through epigenetic modifications. Meanwhile, almost all current technologies suffer from the inherent problem of detecting low coverage of the number of CpGs. Therefore, addressing the inherent sparsity of raw data is essential for quantitative analysis of the whole genome. Results Here, we reported CaMelia, a CatBoost gradient boosting method for predicting the missing methylation states based on the locally paired similarity of intercellular methylation patterns. On real single-cell methylation datasets, CaMelia yielded significant imputation performance gains over previous methods. Furthermore, applying the imputed data to the downstream analysis of cell-type identification, we found that CaMelia helped to discover more intercellular differentially methylated loci that were masked by the sparsity in raw data, and the clustering results demonstrated that CaMelia could preserve cell-cell relationships and improve the identification of cell types and cell subpopulations. Availability and implementation Python code is available at https://github.com/JxTang-bioinformatics/CaMelia. Supplementary information Supplementary data are available at Bioinformatics online.

2019 ◽  
Author(s):  
Marcus Alvarez ◽  
Elior Rahmani ◽  
Brandon Jew ◽  
Kristina M. Garske ◽  
Zong Miao ◽  
...  

AbstractSingle-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. Contrary to single-cell RNA seq (scRNA-seq), we observe that snRNA-seq is commonly subject to contamination by high amounts of extranuclear background RNA, which can lead to identification of spurious cell types in downstream clustering analyses if overlooked. We present a novel approach to remove debris-contaminated droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: 1) human differentiating preadipocytes in vitro, 2) fresh mouse brain tissue, and 3) human frozen adipose tissue (AT) from six individuals. All three data sets showed various degrees of extranuclear RNA contamination. We observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq data, we also successfully applied DIEM to single-cell data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem.


2020 ◽  
Vol 52 (10) ◽  
pp. 468-477
Author(s):  
Alexander C. Zambon ◽  
Tom Hsu ◽  
Seunghee Erin Kim ◽  
Miranda Klinck ◽  
Jennifer Stowe ◽  
...  

Much of our understanding of the regulatory mechanisms governing the cell cycle in mammals has relied heavily on methods that measure the aggregate state of a population of cells. While instrumental in shaping our current understanding of cell proliferation, these approaches mask the genetic signatures of rare subpopulations such as quiescent (G0) and very slowly dividing (SD) cells. Results described in this study and those of others using single-cell analysis reveal that even in clonally derived immortalized cancer cells, ∼1–5% of cells can exhibit G0 and SD phenotypes. Therefore to enable the study of these rare cell phenotypes we established an integrated molecular, computational, and imaging approach to track, isolate, and genetically perturb single cells as they proliferate. A genetically encoded cell-cycle reporter (K67p-FUCCI) was used to track single cells as they traversed the cell cycle. A set of R-scripts were written to quantify K67p-FUCCI over time. To enable the further study G0 and SD phenotypes, we retrofitted a live cell imaging system with a micromanipulator to enable single-cell targeting for functional validation studies. Single-cell analysis revealed HT1080 and MCF7 cells had a doubling time of ∼24 and ∼48 h, respectively, with high duration variability in G1 and G2 phases. Direct single-cell microinjection of mRNA encoding (GFP) achieves detectable GFP fluorescence within ∼5 h in both cell types. These findings coupled with the possibility of targeting several hundreds of single cells improves throughput and sensitivity over conventional methods to study rare cell subpopulations.


Author(s):  
Samuel Melton ◽  
Sharad Ramanathan

Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.


2019 ◽  
Vol 36 (6) ◽  
pp. 1779-1784 ◽  
Author(s):  
Chuanqi Wang ◽  
Jun Li

Abstract Motivation Scaling by sequencing depth is usually the first step of analysis of bulk or single-cell RNA-seq data, but estimating sequencing depth accurately can be difficult, especially for single-cell data, risking the validity of downstream analysis. It is thus of interest to eliminate the use of sequencing depth and analyze the original count data directly. Results We call an analysis method ‘scale-invariant’ (SI) if it gives the same result under different estimates of sequencing depth and hence can use the original count data without scaling. For the problem of classifying samples into pre-specified classes, such as normal versus cancerous, we develop a deep-neural-network based SI classifier named scale-invariant deep neural-network classifier (SINC). On nine bulk and single-cell datasets, the classification accuracy of SINC is better than or competitive to the best of eight other classifiers. SINC is easier to use and more reliable on data where proper sequencing depth is hard to determine. Availability and implementation This source code of SINC is available at https://www.nd.edu/∼jli9/SINC.zip. Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (11) ◽  
pp. 3585-3587
Author(s):  
Lin Wang ◽  
Francisca Catalan ◽  
Karin Shamardani ◽  
Husam Babikir ◽  
Aaron Diaz

Abstract Summary Single-cell data are being generated at an accelerating pace. How best to project data across single-cell atlases is an open problem. We developed a boosted learner that overcomes the greatest challenge with status quo classifiers: low sensitivity, especially when dealing with rare cell types. By comparing novel and published data from distinct scRNA-seq modalities that were acquired from the same tissues, we show that this approach preserves cell-type labels when mapping across diverse platforms. Availability and implementation https://github.com/diazlab/ELSA Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Qinglin Mei ◽  
Guojun Li ◽  
Zhengchang Su

AbstractMotivationRecent breakthroughs of single-cell RNA sequencing (scRNA-seq) technologies offer an exciting opportunity to identify heterogeneous cell types in complex tissues. However, the unavoidable biological noise and technical artifacts in scRNA-seq data as well as the high dimensionality of expression vectors make the problem highly challenging. Consequently, although numerous tools have been developed, their accuracy remains to be improved.ResultsHere, we introduce a novel clustering algorithm and tool RCSL (Rank Constrained Similarity Learning) to accurately identify various cell types using scRNA-seq data from a complex tissue. RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types. RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbour representation of a cell as its local similarity. The overall similarity of a cell to other cells is a linear combination of its global similarity and local similarity. RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized. Each block-diagonal submatrix is a cell cluster/type, corresponding to a connected component in the cognate similarity graph. When tested on 16 benchmark scRNA-seq datasets in which the cell types are well-annotated, RCSL substantially outperformed six state-of-the-art methods in accuracy and robustness as measured by three metrics.AvailabilityThe RCSL algorithm is implemented in R and can be freely downloaded at https://github.com/QinglinMei/[email protected], [email protected] informationSupplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Hanbyeol Kim ◽  
Joongho Lee ◽  
Keunsoo Kang ◽  
Seokhyun Yoon

Abstract Cell type identification is a key step to downstream analysis of single cell RNA-seq experiments. Indispensible information for this is gene expression, which is used to cluster cells, train the model and set rejection thresholds. Problem is they are subject to batch effect arising from different platforms and preprocessing. We present MarkerCount, which uses the number of markers expressed regardless of their expression level to initially identify cell types and, then, reassign cell type in cluster-basis. MarkerCount works both in reference and marker-based mode, where the latter utilizes only the existing lists of markers, while the former required pre-annotated dataset to train the model. The performance was evaluated and compared with the existing identifiers, both marker and reference-based, that can be customized with publicly available datasets and marker DB. The results show that MarkerCount provides a stable performance when comparing with other reference-based and marker-based cell type identifiers.


2021 ◽  
Author(s):  
Ye Zheng ◽  
Siqi Shen ◽  
Sündüz Keleş

AbstractSingle-cell high-throughput chromatin conformation capture methodologies (scHi-C) enable profiling long-range genomic interactions at the single-cell resolution; however, data from these technologies are prone to technical noise and bias that, when unaccounted for, hinder downstream analysis. Here we developed a fast band normalization approach, BandNorm, and a deep generative modeling framework, 3DVI, to explicitly account for scHi-C specific technical biases. We present robust performances of BandNorm and 3DVI compared to existing state-of-the-art methods. BandNorm is effective in separating cell types, identification of interaction features, and recovery of cell-cell relationship, whereas de-noising by 3DVI successfully enables 3D compartments and domains recovery, especially for rare cell types.


2021 ◽  
Author(s):  
Jiayi Dong ◽  
Yin Zhang ◽  
Fei Wang

Abstract Background: With the development of modern sequencing technology, hundreds of thousands of single-cell RNA-sequencing(scRNA-seq) profiles allow to explore the heterogeneity in the cell level, but it faces the challenges of high dimensions and high sparsity. Dimensionality reduction is essential for downstream analysis, such as clustering to identify cell subpopulations. Usually, dimensionality reduction follows unsupervised approach. Results: In this paper, we introduce a semi-supervised dimensionality reduction method named scSemiAE, which is based on an autoencoder model. It transfers the information contained in available datasets with cell subpopulation labels to guide the search of better low-dimensional representations, which can ease further analysis. Conclusions: Experiments on five public datasets show that, scSemiAE outperforms both unsupervised and semi-supervised baselines whether the transferred information embodied in the number of labeled cells and labeled cell subpopulations is much or less.


2021 ◽  
Vol 32 (3) ◽  
pp. 614-627
Author(s):  
Amin Abedini ◽  
Yuan O. Zhu ◽  
Shatakshee Chatterjee ◽  
Gabor Halasz ◽  
Kishor Devalaraja-Narashimha ◽  
...  

BackgroundMicroscopic analysis of urine sediment is probably the most commonly used diagnostic procedure in nephrology. The urinary cells, however, have not yet undergone careful unbiased characterization.MethodsSingle-cell transcriptomic analysis was performed on 17 urine samples obtained from five subjects at two different occasions, using both spot and 24-hour urine collection. A pooled urine sample from multiple healthy individuals served as a reference control. In total 23,082 cells were analyzed. Urinary cells were compared with human kidney and human bladder datasets to understand similarities and differences among the observed cell types.ResultsAlmost all kidney cell types can be identified in urine, such as podocyte, proximal tubule, loop of Henle, and collecting duct, in addition to macrophages, lymphocytes, and bladder cells. The urinary cell–type composition was subject specific and reasonably stable using different collection methods and over time. Urinary cells clustered with kidney and bladder cells, such as urinary podocytes with kidney podocytes, and principal cells of the kidney and urine, indicating their similarities in gene expression.ConclusionsA reference dataset for cells in human urine was generated. Single-cell transcriptomics enables detection and quantification of almost all types of cells in the kidney and urinary tract.


Sign in / Sign up

Export Citation Format

Share Document