Computational methods for the integrative analysis of single-cell data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa042 ◽

2020 ◽

Cited By ~ 2

Author(s):

Mattia Forcato ◽

Oriana Romano ◽

Silvio Bicciato

Keyword(s):

Single Cell ◽

Computational Methods ◽

Genomic Data ◽

Integrative Analysis ◽

Joint Analysis ◽

New Wave ◽

Multimodal Signals ◽

And Function ◽

Genomic Signals ◽

Molecular Layers

Abstract Recent advances in single-cell technologies are providing exciting opportunities for dissecting tissue heterogeneity and investigating cell identity, fate and function. This is a pristine, exploding field that is flooding biologists with a new wave of data, each with its own specificities in terms of complexity and information content. The integrative analysis of genomic data, collected at different molecular layers from diverse cell populations, holds promise to address the full-scale complexity of biological systems. However, the combination of different single-cell genomic signals is computationally challenging, as these data are intrinsically heterogeneous for experimental, technical and biological reasons. Here, we describe the computational methods for the integrative analysis of single-cell genomic data, with a focus on the integration of single-cell RNA sequencing datasets and on the joint analysis of multimodal signals from individual cells.

Download Full-text

From single nuclei to whole genome assemblies

10.1101/625814 ◽

2019 ◽

Cited By ~ 3

Author(s):

Merce Montoliu-Nerin ◽

Marisol Sánchez-García ◽

Claudia Bergin ◽

Manfred Grabherr ◽

Barbara Ellis ◽

...

Keyword(s):

Single Cell ◽

Large Scale ◽

Genomic Data ◽

Life Cycles ◽

Genomic Research ◽

Metagenomic Data ◽

Model Organisms ◽

Genomic Study ◽

And Function ◽

Genome Assemblies

SummaryA large proportion of Earth's biodiversity constitutes organisms that cannot be cultured, have cryptic life-cycles and/or live submerged within their substrates1–4. Genomic data are key to unravel both their identity and function5. The development of metagenomic methods6,7 and the advent of single cell sequencing8–10 have revolutionized the study of life and function of cryptic organisms by upending the need for large and pure biological material, and allowing generation of genomic data from complex or limited environmental samples. Genome assemblies from metagenomic data have so far been restricted to organisms with small genomes, such as bacteria11, archaea12 and certain eukaryotes13. On the other hand, single cell technologies have allowed the targeting of unicellular organisms, attaining a better resolution than metagenomics8,9,14–16, moreover, it has allowed the genomic study of cells from complex organisms one cell at a time17,18. However, single cell genomics are not easily applied to multicellular organisms formed by consortia of diverse taxa, and the generation of specific workflows for sequencing and data analysis is needed to expand genomic research to the entire tree of life, including sponges19, lichens3,20, intracellular parasites21,22, and plant endophytes23,24. Among the most important plant endophytes are the obligate mutualistic symbionts, arbuscular mycorrhizal (AM) fungi, that pose an additional challenge with their multinucleate coenocytic mycelia25. Here, the development of a novel single nuclei sequencing and assembly workflow is reported. This workflow allows, for the first time, the generation of reference genome assemblies from large scale, unbiased sorted, and sequenced AM fungal nuclei circumventing tedious, and often impossible, culturing efforts. This method opens infinite possibilities for studies of evolution and adaptation in these important plant symbionts and demonstrates that reference genomes can be generated from complex non-model organisms by isolating only a handful of their nuclei.

Download Full-text

Coupled co-clustering-based unsupervised transfer learning for the integrative analysis of single-cell genomic data

Briefings in Bioinformatics ◽

10.1093/bib/bbaa347 ◽

2020 ◽

Author(s):

Pengcheng Zeng ◽

Jiaxuan Wangwu ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Genomic Data ◽

Integrative Analysis ◽

Data Sets ◽

Clustering Methods ◽

Genomic Features ◽

Multiple Data ◽

Multiple Data Sets ◽

Cell Data

Abstract Unsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. The most current clustering methods are designed for one data type only, such as single-cell RNA sequencing (scRNA-seq), single-cell ATAC sequencing (scATAC-seq) or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. The integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. In this paper, we propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. In co-clustering, both the cells and the genomic features are simultaneously clustered. Clustering similar genomic features reduces the noise in single-cell data and facilitates transfer of knowledge across single-cell datasets. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. Our method coupleCoC is also computationally efficient and can scale up to large datasets. Availability: The software and datasets are available at https://github.com/cuhklinlab/coupleCoC.

Download Full-text

Integrative analysis of single cell genomics data by coupled nonnegative matrix factorizations

10.1101/312348 ◽

2018 ◽

Author(s):

Zhana Duren ◽

Xi Chen ◽

Mahdi Zamanighomi ◽

Wanwen Zeng ◽

Ansuman T Satpathy ◽

...

Keyword(s):

Single Cell ◽

Nonnegative Matrix ◽

Integrative Analysis ◽

Matrix Factorizations ◽

Joint Analysis ◽

Rna Seq ◽

Single Cell Genomics ◽

Clustering Problem ◽

Two Samples ◽

Different Types

AbstractWhen different types of functional genomics data are generated on single cells from different samples of cells from the same heterogeneous population, the clustering of cells in the different samples should be coupled. We formulate this “coupled clustering” problem as an optimization problem, and propose the method of coupled nonnegative matrix factorizations (coupled NMF) for its solution. The method is illustrated by the integrative analysis of single cell RNA-seq and single cell ATAC-seq data.Significance StatementsBiological samples are often heterogeneous mixtures of different types of cells. Suppose we have two single cell data sets, each providing information on a different cellular feature and generated on a different sample from this mixture. Then, the clustering of cells in the two samples should be coupled as both clusterings are reflecting the underlying cell types in the same mixture. This “coupled clustering” problem is a new problem not covered by existing clustering methods. In this paper we develop an approach for its solution based the coupling of two nonnegative matrix factorizations. The method should be useful for integrative single cell genomics analysis tasks such as the joint analysis of single cell RNA-seq and single cell ATAC-seq data.

Download Full-text

coupleCoC+: an information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

10.1101/2021.02.17.431728 ◽

2021 ◽

Author(s):

Pengcheng Zeng ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Genomic Data ◽

Cell Types ◽

Integrative Analysis ◽

Computationally Efficient ◽

Information Theoretic ◽

Mouse Cortex ◽

Source Data ◽

Target Data

AbstractTechnological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, and mouse cortex sc-methylation and scRNA-seq data, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC plus.

Download Full-text

scAMACE: Model-based approach to the joint analysis of single-cell data on chromatin accessibility, gene expression and methylation

10.1101/2021.03.29.437485 ◽

2021 ◽

Author(s):

Jiaxuan Wangwu ◽

Zexuan Sun ◽

Zhixiang Lin

Keyword(s):

Gene Expression ◽

Single Cell ◽

Cell Types ◽

Chromatin Accessibility ◽

Integrative Analysis ◽

Joint Analysis ◽

Data Types ◽

Link Type ◽

Complex Biological Process ◽

Cell Data

AbstractThe advancement in technologies and the growth of available single-cell datasets motivate integrative analysis of multiple single-cell genomic datasets. Integrative analysis of multimodal single-cell datasets combines complementary information offered by single-omic datasets and can offer deeper insights on complex biological process. Clustering methods that identify the unknown cell types are among the first few steps in the analysis of single-cell datasets, and they are important for downstream analysis built upon the identified cell types. We propose scAMACE for the integrative analysis and clustering of single-cell data on chromatin accessibility, gene expression and methylation. We demonstrate that cell types are better identified and characterized through analyzing the three data types jointly. We develop an efficient expectation-maximization (EM) algorithm to perform statistical inference, and evaluate our methods on both simulation study and real data applications. We also provide the GPU implementation of scAMACE, making it scalable to large datasets. The software and datasets are available at https://github.com/cuhklinlab/scAMACE_py (pythom implementation) and https://github.com/cuhklinlab/scAMACE (R implementation).

Download Full-text

Coupled Co-clustering-based Unsupervised Transfer Learning for the Integrative Analysis of Single-Cell Genomic Data

10.1101/2020.03.28.013938 ◽

2020 ◽

Author(s):

Pengcheng Zeng ◽

Jiaxuan WangWu ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Learning Algorithm ◽

Genomic Data ◽

Integrative Analysis ◽

Data Sets ◽

Clustering Methods ◽

Data Types ◽

Multiple Data ◽

Multiple Data Sets

AbstractUnsupervised methods, such as clustering methods, are essential to the analysis of single-cell genomic data. Most current clustering methods are designed for one data type only, such as scRNA-seq, scATAC-seq or sc-methylation data alone, and a few are developed for the integrative analysis of multiple data types. Integrative analysis of multimodal single-cell genomic data sets leverages the power in multiple data sets and can deepen the biological insight. We propose a coupled co-clustering-based unsupervised transfer learning algorithm (coupleCoC) for the integrative analysis of multimodal single-cell data. Our proposed coupleCoC builds upon the information theoretic co-clustering framework. We applied coupleCoC for the integrative analysis of scATAC-seq and scRNA-seq data, sc-methylation and scRNA-seq data, and scRNA-seq data from mouse and human. We demonstrate that coupleCoC improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic data sets. The software and data sets are available at https://github.com/cuhklinlab/coupleCoC.

Download Full-text

coupleCoC+: An information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data

PLoS Computational Biology ◽

10.1371/journal.pcbi.1009064 ◽

2021 ◽

Vol 17 (6) ◽

pp. e1009064

Author(s):

Pengcheng Zeng ◽

Zhixiang Lin

Keyword(s):

Single Cell ◽

Transfer Learning ◽

Genomic Data ◽

Cell Types ◽

Integrative Analysis ◽

Computationally Efficient ◽

Information Theoretic ◽

Mouse Cortex ◽

Source Data ◽

Target Data

Technological advances have enabled us to profile multiple molecular layers at unprecedented single-cell resolution and the available datasets from multiple samples or domains are growing. These datasets, including scRNA-seq data, scATAC-seq data and sc-methylation data, usually have different powers in identifying the unknown cell types through clustering. So, methods that integrate multiple datasets can potentially lead to a better clustering performance. Here we propose coupleCoC+ for the integrative analysis of single-cell genomic data. coupleCoC+ is a transfer learning method based on the information-theoretic co-clustering framework. In coupleCoC+, we utilize the information in one dataset, the source data, to facilitate the analysis of another dataset, the target data. coupleCoC+ uses the linked features in the two datasets for effective knowledge transfer, and it also uses the information of the features in the target data that are unlinked with the source data. In addition, coupleCoC+ matches similar cell types across the source data and the target data. By applying coupleCoC+ to the integrative clustering of mouse cortex scATAC-seq data and scRNA-seq data, mouse and human scRNA-seq data, mouse cortex sc-methylation and scRNA-seq data, and human blood dendritic cells scRNA-seq data from two batches, we demonstrate that coupleCoC+ improves the overall clustering performance and matches the cell subpopulations across multimodal single-cell genomic datasets. coupleCoC+ has fast convergence and it is computationally efficient. The software is available at https://github.com/cuhklinlab/coupleCoC_plus.

Download Full-text

Computational Approaches to Predict the Non-canonical DNAs

Current Bioinformatics ◽

10.2174/1574893614666190126143438 ◽

2019 ◽

Vol 14 (6) ◽

pp. 470-479 ◽

Cited By ~ 3

Author(s):

Nazia Parveen ◽

Amen Shamim ◽

Seunghee Cho ◽

Kyeong Kyu Kim

Keyword(s):

Computational Methods ◽

Genetic Instability ◽

Computational Prediction ◽

Structure And Function ◽

Sequence Motifs ◽

Computational Approaches ◽

Functional Roles ◽

And Function ◽

Genetic Events ◽

Insight Into

Background: Although most nucleotides in the genome form canonical double-stranded B-DNA, many repeated sequences transiently present as non-canonical conformations (non-B DNA) such as triplexes, quadruplexes, Z-DNA, cruciforms, and slipped/hairpins. Those noncanonical DNAs (ncDNAs) are not only associated with many genetic events such as replication, transcription, and recombination, but are also related to the genetic instability that results in the predisposition to disease. Due to the crucial roles of ncDNAs in cellular and genetic functions, various computational methods have been implemented to predict sequence motifs that generate ncDNA. Objective: Here, we review strategies for the identification of ncDNA motifs across the whole genome, which is necessary for further understanding and investigation of the structure and function of ncDNAs. Conclusion: There is a great demand for computational prediction of non-canonical DNAs that play key functional roles in gene expression and genome biology. In this study, we review the currently available computational methods for predicting the non-canonical DNAs in the genome. Current studies not only provide an insight into the computational methods for predicting the secondary structures of DNA but also increase our understanding of the roles of non-canonical DNA in the genome.

Download Full-text

Corrigendum: FB5P-seq: FACS-Based 5-Prime End Single-Cell RNA-seq for Integrative Analysis of Transcriptome and Antigen Receptor Repertoire in B and T Cells

Frontiers in Immunology ◽

10.3389/fimmu.2020.02047 ◽

2020 ◽

Vol 11 ◽

Author(s):

Noudjoud Attaf ◽

Iñaki Cervera-Marzal ◽

Chuang Dong ◽

Laurine Gil ◽

Amédée Renand ◽

...

Keyword(s):

T Cells ◽

Single Cell ◽

Antigen Receptor ◽

Integrative Analysis ◽

Rna Seq ◽

Receptor Repertoire ◽

Prime End

Download Full-text

Discovery of multi-dimensional modules by integrative analysis of cancer genomic data

Nucleic Acids Research ◽

10.1093/nar/gks725 ◽

2012 ◽

Vol 40 (19) ◽

pp. 9379-9391 ◽

Cited By ~ 172

Author(s):

Shihua Zhang ◽

Chun-Chi Liu ◽

Wenyuan Li ◽

Hui Shen ◽

Peter W. Laird ◽

...

Keyword(s):

Genomic Data ◽

Integrative Analysis

Download Full-text