scholarly journals GPseudoRank: a permutation sampler for single cell orderings

2017 ◽  
Author(s):  
Magdalena E Strauß ◽  
John E Reid ◽  
Lorenz Wernisch

AbstractMotivationA number of pseudotime methods have provided point estimates of the ordering of cells for scRNA-seq data. A still limited number of methods also model the uncertainty of the pseudotime estimate. However, there is still a need for a method to sample from complicated and multi-modal distributions of orders, and to estimate changes in the amount of the uncertainty of the order during the course of a biological development, as this can support the selection of suitable cells for the clustering of genes or for network inference.ResultsIn an application to a microarray data set our proposed method, GPseudoRank, identifies two modes of the distribution, each of them corresponding to point estimates of orders obtained by a different established method. In an application to scRNA-seq data we demonstrate the potential of GPseudoRank to identify phases of lower and higher pseudotime uncertainty during a biological process. GPseudoRank also correctly identifies cells precocious in their antiviral response.Availability and implementationOur method is available on github: https://github.com/magStra/GPseudoRank.Contactmagdalena.strauss@mrc-bsu.cam.ac.ukSupplementary informationSupplementary materials are available.

2020 ◽  
Author(s):  
Jianhao Peng ◽  
Ullas V. Chembazhi ◽  
Sushant Bangru ◽  
Ian M. Traniello ◽  
Auinash Kalsotra ◽  
...  

AbstractMotivationWith the use of single-cell RNA sequencing (scRNA-Seq) technologies, it is now possible to acquire gene expression data for each individual cell in samples containing up to millions of cells. These cells can be further grouped into different states along an inferred cell differentiation path, which are potentially characterized by similar, but distinct enough, gene regulatory networks (GRNs). Hence, it would be desirable for scRNA-Seq GRN inference methods to capture the GRN dynamics across cell states. However, current GRN inference methods produce a unique GRN per input dataset (or independent GRNs per cell state), failing to capture these regulatory dynamics.ResultsWe propose a novel single-cell GRN inference method, named SimiC, that jointly infers the GRNs corresponding to each state. SimiC models the GRN inference problem as a LASSO optimization problem with an added similarity constraint, on the GRNs associated to contiguous cell states, that captures the inter-cell-state homogeneity. We show on a mouse hepatocyte single-cell data generated after partial hepatectomy that, contrary to previous GRN methods for scRNA-Seq data, SimiC is able to capture the transcription factor (TF) dynamics across liver regeneration, as well as the cell-level behavior for the regulatory program of each TF across cell states. In addition, on a honey bee scRNA-Seq experiment, SimiC is able to capture the increased heterogeneity of cells on whole-brain tissue with respect to a regional analysis tissue, and the TFs associated specifically to each sequenced tissue.AvailabilitySimiC is written in Python and includes an R API. It can be downloaded from https://github.com/jianhao2016/[email protected], [email protected] informationSupplementary data are available at the code repository.


Author(s):  
Andres M Cifuentes-Bernal ◽  
Vu Vh Pham ◽  
Xiaomei Li ◽  
Lin Liu ◽  
Jiuyong Li ◽  
...  

Abstract Motivation microRNAs (miRNAs) are important gene regulators and they are involved in many biological processes, including cancer progression. Therefore, correctly identifying miRNA–mRNA interactions is a crucial task. To this end, a huge number of computational methods has been developed, but they mainly use the data at one snapshot and ignore the dynamics of a biological process. The recent development of single cell data and the booming of the exploration of cell trajectories using ‘pseudotime’ concept have inspired us to develop a pseudotime-based method to infer the miRNA–mRNA relationships characterizing a biological process by taking into account the temporal aspect of the process. Results We have developed a novel approach, called pseudotime causality, to find the causal relationships between miRNAs and mRNAs during a biological process. We have applied the proposed method to both single cell and bulk sequencing datasets for Epithelia to Mesenchymal Transition, a key process in cancer metastasis. The evaluation results show that our method significantly outperforms existing methods in finding miRNA–mRNA interactions in both single cell and bulk data. The results suggest that utilizing the pseudotemporal information from the data helps reveal the gene regulation in a biological process much better than using the static information. Availability and implementation R scripts and datasets can be found at https://github.com/AndresMCB/PTC. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
I.Y. Boyko ◽  
D.S. Anisimov ◽  
L.L. Smolyakova ◽  
M.A. Ryazanov

In modern biomedical research aimed at finding methods for early diagnosis of cancer, microarrays containing certain biological information about patients are used. Based on these data, patients are assigned to one of two classes, corresponding to the presence and absence of some diagnosis. When solving this problem, one of the steps that have a decisive influence on the quality of classification is the significant features selection. This paper proposes a criterion for the selection of significant features, based on the ledge-coefficient of correlation. The ledge-coefficient was previously used to estimate the degree of interrelation of numerical and binary features. For two sets of microarray data, comparative examples of their binary classification are presented using three feature selection algorithms, three dimensionality reduction methods, six classification models. The use of the ledge-criterion for feature selection made it possible to obtain a classification quality comparable to the results of using common methods of feature selection, such as t-test and U-test. For the data set of the peptide microarrays considered in the paper, the effectiveness of applying the projection method to latent structures had previously been identified. The use of this method in combination with the significant features’ selection using the ledge-criterion made it possible to obtain a higher classification quality measure.


2019 ◽  
Vol 56 (2) ◽  
pp. 117-138
Author(s):  
Małgorzata Ćwiklińska-Jurkowska

SummaryThe usefulness of combining methods is examined using the example of microarray cancer data sets, where expression levels of huge numbers of genes are reported. Problems of discrimination into two groups are examined on three data sets relating to the expression of huge numbers of genes. For the three examined microarray data sets, the cross-validation errors evaluated on the remaining half of the whole data set, not used earlier for the selection of genes, were used as measures of classifier performance. Common single procedures for the selection of genes—Prediction Analysis of Microarrays (PAM) and Significance Analysis of Microarrays (SAM)—were compared with the fusion of eight selection procedures, or of a smaller subset of five of them, excluding SAM or PAM. Merging five or eight selection methods gave similar results. Based on the misclassification rates for the three examined microarray data sets, for any examined ensemble of classifiers, the combining of gene selection methods was not superior to single PAM or SAM selection for two of the examined data sets. Additionally, the procedure of heterogeneous combining of five base classifiers—k-nearest neighbors, SVM linear and SVM radial with parameter c=1, shrunken centroids regularized classifier (SCRDA) and nearest mean classifier—proved to significantly outperform resampling classifiers such as bagging decision trees. Heterogeneously combined classifiers also outperformed double bagging for some ranges of gene numbers and data sets, but merging is generally not superior to random forests. The preliminary step of combining gene rankings was generally not essential for the performance for either heterogeneously or homogeneously combined classifiers.


2020 ◽  
Author(s):  
Hongyu Li ◽  
Zhichao Xu ◽  
Taylor Adams ◽  
Naftali Kaminski ◽  
Hongyu Zhao

AbstractMotivationRecent development of single cell sequencing technologies has made it possible to identify genes with different expression (DE) levels at the cell type level between different groups of samples. However, the often-low sample size of single cell data limits the statistical power to identify DE genes.ResultsIn this article, we propose to borrow information through known biological networks. Our approach is based on a Markov Random Field (MRF) model to appropriately accommodate gene network information as well as dependencies among cells to identify cell-type specific DE genes. We implement an Expectation-Maximization (EM) algorithm with mean field-like approximation to estimate model parameters and a Gibbs sampler to infer DE status. Simulation study shows that our method has better power to detect cell-type specific DE genes than conventional methods while appropriately controlling type I error rate. The usefulness of our method is demonstrated through its application to study the pathogenesis and biological processes of idiopathic pulmonary fibrosis (IPF) using a single-cell RNA-sequencing (scRNA-seq) data set, which contains 18,150 protein-coding genes across 38 cell types on lung tissues from 32 IPF patients and 28 normal controls.AvailabilityThe algorithm is implemented in R. The source code can be downloaded at https://github.com/eddiehli/[email protected] informationSupplementary data are available online.


2019 ◽  
Author(s):  
Atul Deshpande ◽  
Li-Fang Chu ◽  
Ron Stewart ◽  
Anthony Gitter

AbstractAdvances in single-cell transcriptomics enable measuring the gene expression of individual cells, allowing cells to be ordered by their state in a dynamic biological process. Many algorithms assign ‘pseudotimes’ to each cell, representing the progress along the biological process. Ordering the expression data according to such pseudotimes can be valuable for understanding the underlying regulator-gene interactions in a biological process, such as differentiation. However, the distribution of cells sampled along a transitional process, and hence that of the pseudotimes assigned to them, is not uniform. This prevents using many standard mathematical methods for analyzing the ordered gene expression states. We present Single-Cell Inference of Networks using Granger Ensembles (SCINGE), an algorithm for gene regulatory network inference from single-cell gene expression data. Given ordered single-cell data, SCINGE uses kernel-based Granger Causality regression, which smooths the irregular pseudotimes and missing expression values. It then aggregates the predictions from an ensemble of regression analyses with a modified Borda method to compile a ranked list of candidate interactions between transcriptional regulators and their target genes. In two mouse embryonic stem cell differentiation case studies, SCINGE outperforms other contemporary algorithms for gene network reconstruction. However, a more detailed examination reveals caveats about transcriptional network reconstruction with single-cell RNA-seq data. Network inference methods, including SCINGE, may have near random performance for predicting the targets of many individual regulators even if the aggregate performance is good. In addition, in some cases including cells’ pseudotime values can hurt the performance of network reconstruction methods. A MATLAB implementation of SCINGE is available at https://github.com/gitter-lab/SCINGE.


2015 ◽  
Author(s):  
Majid Mohammadi ◽  
Hossein Sharifi Noghabi ◽  
Ghosheh Abed Hodtani ◽  
Habib Rajabi Mashhadi

One of the central challenges in cancer research is identifying significant genes among thousands of others on a microarray. Since preventing outbreak and progression of cancer is the ultimate goal in bioinformatics and computational biology, detection of genes that are most involved is vital and crucial. In this article, we propose a Maximum-Minimum Correntropy Criterion (MMCC) approach for selection of biologically meaningful genes from microarray data sets which is stable, fast and robust against diverse noise and outliers and competitively accurate in comparison with other algorithms. Moreover, via an evolutionary optimization process, the optimal number of features for each data set is determined. Through broad experimental evaluation, MMCC is proved to be significantly better compared to other well-known gene selection algorithms for 25 commonly used microarray data sets. Surprisingly, high accuracy in classification by Support Vector Machine (SVM) is achieved by less than 10 genes selected by MMCC in all of the cases.


Author(s):  
Irzam Sarfraz ◽  
Muhammad Asif ◽  
Joshua D Campbell

Abstract Motivation R Experiment objects such as the SummarizedExperiment or SingleCellExperiment are data containers for storing one or more matrix-like assays along with associated row and column data. These objects have been used to facilitate the storage and analysis of high-throughput genomic data generated from technologies such as single-cell RNA sequencing. One common computational task in many genomics analysis workflows is to perform subsetting of the data matrix before applying down-stream analytical methods. For example, one may need to subset the columns of the assay matrix to exclude poor-quality samples or subset the rows of the matrix to select the most variable features. Traditionally, a second object is created that contains the desired subset of assay from the original object. However, this approach is inefficient as it requires the creation of an additional object containing a copy of the original assay and leads to challenges with data provenance. Results To overcome these challenges, we developed an R package called ExperimentSubset, which is a data container that implements classes for efficient storage and streamlined retrieval of assays that have been subsetted by rows and/or columns. These classes are able to inherently provide data provenance by maintaining the relationship between the subsetted and parent assays. We demonstrate the utility of this package on a single-cell RNA-seq dataset by storing and retrieving subsets at different stages of the analysis while maintaining a lower memory footprint. Overall, the ExperimentSubset is a flexible container for the efficient management of subsets. Availability and implementation ExperimentSubset package is available at Bioconductor: https://bioconductor.org/packages/ExperimentSubset/ and Github: https://github.com/campbio/ExperimentSubset. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document