scholarly journals genesorteR: Feature Ranking in Clustered Single Cell Data

2019 ◽  
Author(s):  
Mahmoud M Ibrahim ◽  
Rafael Kramann

ABSTRACTMarker genes identified in single cell experiments are expected to be highly specific to a certain cell type and highly expressed in that cell type. Detecting a gene by differential expression analysis does not necessarily satisfy those two conditions and is typically computationally expensive for large cell numbers.Here we present genesorteR, an R package that ranks features in single cell data in a manner consistent with the expected definition of marker genes in experimental biology research. We benchmark genesorteR using various data sets and show that it is distinctly more accurate in large single cell data sets compared to other methods. genesorteR is orders of magnitude faster than current implementations of differential expression analysis methods, can operate on data containing millions of cells and is applicable to both single cell RNA-Seq and single cell ATAC-Seq data.genesorteR is available at https://github.com/mahmoudibrahim/genesorteR.

2021 ◽  
Author(s):  
Xi Yu ◽  
Xiaofei Lv

Abstract Tongue cancer, as one of the most malignant oral cancers, is highly invasive and has a high risk of recurrence. At present, tongue cancer in the advanced stage is not obvious, easy to miss the opportunity of early diagnosis. It is important to find markers that can predict the occurrence and progression of tongue cancer. Bioinformatics analysis plays an important role in the acquisition of marker genes. GEO and TCGA data are very important public databases. In addition to expression data, TCGA database also contains corresponding clinical data. In this study, we screened three GEO datasets included GSE13601, GSE34105 and GSE34106 that met the standard. These data sets were combined using the SVA package to prepare the data for differential expression analysis, and then the LIMMA package was used to set the standard to p<0.05 and |log2 (FC)| ≥1.5. We got 170 DEGs (104, raised 66 downgrade). Besides, the DEseq package was used for differential expression analysis using the same criteria for samples in TCGA database. It ended up with 1589 DEGs (644 up-regulated, 945 down-regulated). By merging these two sets of DEGs, 5 common up-regulated DEGs (CCL20, SCG5, SPP1, KRT75 and FOLR3) and 15 common down-regulated DEGs were obtained. Further functional analysis of the DEGs showed that CCL20, SCG5 and SPP1 is closely related to prognosis and may be a therapeutic target of TSCC.


2017 ◽  
Author(s):  
Charlotte Soneson ◽  
Mark D. Robinson

AbstractBackgroundAs single-cell RNA-seq (scRNA-seq) is becoming increasingly common, the amount of publicly available data grows rapidly, generating a useful resource for computational method development and extension of published results. Although processed data matrices are typically made available in public repositories, the procedure to obtain these varies widely between data sets, which may complicate reuse and cross-data set comparison. Moreover, while many statistical methods for performing differential expression analysis of scRNA-seq data are becoming available, their relative merits and the performance compared to methods developed for bulk RNA-seq data are not sufficiently well understood.ResultsWe present conquer, a collection of consistently processed, analysis-ready public single-cell RNA-seq data sets. Each data set has count and transcripts per million (TPM) estimates for genes and transcripts, as well as quality control and exploratory analysis reports. We use a subset of the data sets available in conquer to perform an extensive evaluation of the performance and characteristics of statistical methods for differential gene expression analysis, evaluating a total of 30 statistical approaches on both experimental and simulated scRNA-seq data.ConclusionsConsiderable differences are found between the methods in terms of the number and characteristics of the genes that are called differentially expressed. Pre-filtering of lowly expressed genes can have important effects on the results, particularly for some of the methods originally developed for analysis of bulk RNA-seq data. Generally, however, methods developed for bulk RNA-seq analysis do not perform notably worse than those developed specifically for scRNA-seq.


2020 ◽  
Vol 36 (10) ◽  
pp. 3156-3161 ◽  
Author(s):  
Chong Chen ◽  
Changjing Wu ◽  
Linjie Wu ◽  
Xiaochen Wang ◽  
Minghua Deng ◽  
...  

Abstract Motivation Single cell RNA-sequencing (scRNA-seq) technology enables whole transcriptome profiling at single cell resolution and holds great promises in many biological and medical applications. Nevertheless, scRNA-seq often fails to capture expressed genes, leading to the prominent dropout problem. These dropouts cause many problems in down-stream analysis, such as significant increase of noises, power loss in differential expression analysis and obscuring of gene-to-gene or cell-to-cell relationship. Imputation of these dropout values can be beneficial in scRNA-seq data analysis. Results In this article, we model the dropout imputation problem as robust matrix decomposition. This model has minimal assumptions and allows us to develop a computational efficient imputation method called scRMD. Extensive data analysis shows that scRMD can accurately recover the dropout values and help to improve downstream analysis such as differential expression analysis and clustering analysis. Availability and implementation The R package scRMD is available at https://github.com/XiDsLab/scRMD. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Author(s):  
Jordan W. Squair ◽  
Matthieu Gautier ◽  
Claudia Kathe ◽  
Mark A. Anderson ◽  
Nicholas D. James ◽  
...  

Differential expression analysis in single-cell transcriptomics enables the dissection of cell-type-specific responses to perturbations such as disease, trauma, or experimental manipulation. While many statistical methods are available to identify differentially expressed genes, the principles that distinguish these methods and their performance remain unclear. Here, we show that the relative performance of these methods is contingent on their ability to account for variation between biological replicates. Methods that ignore this inevitable variation are biased and prone to false discoveries. Indeed, the most widely used methods can discover hundreds of differentially expressed genes in the absence of biological differences. Our results suggest an urgent need for a paradigm shift in the methods used to perform differential expression analysis in single-cell data.


Author(s):  
Yixuan Qiu ◽  
Jiebiao Wang ◽  
Jing Lei ◽  
Kathryn Roeder

Abstract Motivation Marker genes, defined as genes that are expressed primarily in a single cell type, can be identified from the single cell transcriptome; however, such data are not always available for the many uses of marker genes, such as deconvolution of bulk tissue. Marker genes for a cell type, however, are highly correlated in bulk data, because their expression levels depend primarily on the proportion of that cell type in the samples. Therefore, when many tissue samples are analyzed, it is possible to identify these marker genes from the correlation pattern. Results To capitalize on this pattern, we develop a new algorithm to detect marker genes by combining published information about likely marker genes with bulk transcriptome data in the form of a semi-supervised algorithm. The algorithm then exploits the correlation structure of the bulk data to refine the published marker genes by adding or removing genes from the list. Availability and implementation We implement this method as an R package markerpen, hosted on CRAN (https://CRAN.R-project.org/package=markerpen). Supplementary information Supplementary data are available at Bioinformatics online.


2020 ◽  
Author(s):  
Jinjin Tian ◽  
Jiebiao Wang ◽  
Kathryn Roeder

AbstractMotivationGene-gene co-expression networks (GCN) are of biological interest for the useful information they provide for understanding gene-gene interactions. The advent of single cell RNA-sequencing allows us to examine more subtle gene co-expression occurring within a cell type. Many imputation and denoising methods have been developed to deal with the technical challenges observed in single cell data; meanwhile, several simulators have been developed for benchmarking and assessing these methods. Most of these simulators, however, either do not incorporate gene co-expression or generate co-expression in an inconvenient manner.ResultsTherefore, with the focus on gene co-expression, we propose a new simulator, ESCO, which adopts the idea of the copula to impose gene co-expression, while preserving the highlights of available simulators, which perform well for simulation of gene expression marginally. Using ESCO, we assess the performance of imputation methods on GCN recovery and find that imputation generally helps GCN recovery when the data are not too sparse, and the ensemble imputation method works best among leading methods. In contrast, imputation fails to help in the presence of an excessive fraction of zero counts, where simple data aggregating methods are a better choice. These findings are further verified with mouse and human brain cell data.AvailabilityThe ESCO implementation is available as R package SplatterESCO (https://github.com/JINJINT/SplatterESCO)[email protected]


Author(s):  
Jonas C. Schupp ◽  
Taylor S. Adams ◽  
Carlos Cosme Jr. ◽  
Micha Sam Brickman Raredon ◽  
Yifan Yuan ◽  
...  

Background: The cellular diversity of the lung endothelium has not been systematically characterized in humans. Here, we provide a reference atlas of human lung endothelial cells (ECs) to facilitate a better understanding of the phenotypic diversity and composition of cells comprising the lung endothelium. Methods: We reprocessed human control single cell RNA sequencing (scRNAseq) data from six datasets. EC populations were characterized through iterative clustering with subsequent differential expression analysis. Marker genes were validated by fluorescent microscopy and in situ hybridization. scRNAseq of primary lung ECs cultured in-vitro was performed. The signaling network between different lung cell types was studied. For cross species analysis or disease relevance, we applied the same methods to scRNAseq data obtained from mouse lungs or from human lungs with pulmonary hypertension. Results: Six lung scRNAseq datasets were reanalyzed and annotated to identify over 15,000 vascular EC cells from 73 individuals. Differential expression analysis of EC revealed signatures corresponding to endothelial lineage, including pan-endothelial, pan-vascular and subpopulation-specific marker gene sets. Beyond the broad cellular categories of lymphatic, capillary, arterial and venous ECs, we found previously indistinguishable subpopulations: among venous EC, we identified two previously indistinguishable populations, pulmonary-venous ECs (COL15A1neg) localized to the lung parenchyma and systemic-venous ECs (COL15A1pos) localized to the airways and the visceral pleura; among capillary EC, we confirmed their subclassification into recently discovered aerocytes characterized by EDNRB, SOSTDC1 and TBX2 and general capillary EC. We confirmed that all six endothelial cell types, including the systemic-venous EC and aerocytes, are present in mice and identified endothelial marker genes conserved in humans and mice. Ligand-receptor connectome analysis revealed important homeostatic crosstalk of EC with other lung resident cell types. scRNAseq of commercially available primary lung ECs demonstrated a loss of their native lung phenotype in culture. scRNAseq revealed that the endothelial diversity is maintained in pulmonary hypertension. Our manuscript is accompanied by an online data mining tool (www.LungEndothelialCellAtlas.com). Conclusions: Our integrated analysis provides the comprehensive and well-crafted reference atlas of lung endothelial cells in the normal lung and confirms and describes in detail previously unrecognized endothelial populations across a large number of humans and mice.


2018 ◽  
Vol 34 (19) ◽  
pp. 3340-3348 ◽  
Author(s):  
Zhijin Wu ◽  
Yi Zhang ◽  
Michael L Stitzel ◽  
Hao Wu

2018 ◽  
Vol 19 (1) ◽  
Author(s):  
Wenan Chen ◽  
Yan Li ◽  
John Easton ◽  
David Finkelstein ◽  
Gang Wu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document