Outlier detection for improved differential splicing quantification from RNA-Seq experiments with replicates

Mapping Intimacies ◽

10.1101/104059 ◽

2017 ◽

Cited By ~ 1

Author(s):

Scott Norton ◽

Jorge Vaquero-Garcia ◽

Yoseph Barash

Keyword(s):

Outlier Detection ◽

Statistical Power ◽

Probability Model ◽

Real Life ◽

Supplementary Information ◽

Rna Seq ◽

Differential Splicing ◽

Experimental Conditions ◽

Link Type ◽

Extensive Evaluation

AbstractMotivationA key component in many RNA-Seq based studies is contrasting multiple replicates from different experimental conditions. In this setup replicates play a key role as they allow to capture underlying biological variability inherent to the compared conditions, as well as experimental variability. However, what constitutes a “bad” replicate is not necessarily well defined. Consequently, researchers might discard valuable data or downstream analysis may be hampered by failed experiments.ResultsHere we develop a probability model to weigh a given RNA-Seq sample as a representative of an experimental condition when performing alternative splicing analysis. We demonstrate that this model detects outlier samples which are consistently and significantly different compared to other samples from the same condition. Moreover, we show that instead of discarding such samples the proposed weighting scheme can be used to downweight samples and specific splicing variations suspected as outliers, gaining statistical power. These weights can then be used for differential splicing (DS) analysis, where the resulting algorithm offers a generalization of the MAJIQ algorithm. Using both synthetic and real-life data we perform an extensive evaluation of the improved MAJIQ algorithm in different scenarios involving perturbed samples, mislabeled samples, no-signal groups, and different levels of coverage, showing it compares favorably to other tools. Overall, this work offers an outlier detection algorithm that can be combined with any splicing pipeline, a generalized and improved version of MAJIQ for differential splicing detection, and an evaluation pipeline researchers can use to evaluate which algorithm may work best for their needs.AvailabilityProgram is accessible via http://majiq.biociphers.org/norton_et_al_2017/Contacthttp://[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Style transfer with variational autoencoders is a promising approach to RNA-Seq data harmonization and analysis

10.1101/791962 ◽

2019 ◽

Author(s):

N. Russkikh ◽

D. Antonets ◽

D. Shtokalo ◽

A. Makarov ◽

Y. Vyatkin ◽

...

Keyword(s):

Prediction Accuracy ◽

Supplementary Information ◽

Rna Seq ◽

Style Transfer ◽

Data Harmonization ◽

Link Type ◽

Proposed Model ◽

Technical Factors ◽

Neural Network Classifiers

AbstractMotivationThe transcriptomic data is being frequently used in the research of biomarker genes of different diseases and biological states. The most common tasks there are data harmonization and treatment outcome prediction. Both of them can be addressed via the style transfer approach. Either technical factors or any biological details about the samples which we would like to control (gender, biological state, treatment etc.) can be used as style components.ResultsThe proposed style transfer solution is based on Conditional Variational Autoencoders, Y-Autoencoders and adversarial feature decomposition. In order to quantitatively measure the quality of the style transfer, neural network classifiers which predict the style and semantics after training on real expression were used. Comparison with several existing style-transfer based approaches shows that proposed model has the highest style prediction accuracy on all considered datasets while having comparable or the best semantics prediction accuracy.Availabilityhttps://github.com/NRshka/[email protected] informationFigShare.com (https://dx.doi.org/10.6084/m9.figshare.9925115)

Download Full-text

Metric Learning on Expression Data for Gene Function Prediction

10.1101/651042 ◽

2019 ◽

Author(s):

Stavros Makrodimitris ◽

Marcel J.T. Reinders ◽

Roeland C.H.J. van Ham

Keyword(s):

Pearson Correlation ◽

Metric Learning ◽

Specific Weight ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Expression Of Genes ◽

Guilt By Association ◽

Python Package

AbstractMotivationCo-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental conditions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes that the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest.ResultsTo address both types of effects, we developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression, and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance.AvailabilityMLC is available as a Python package at www.github.com/stamakro/[email protected] informationSupplementary data are available online.

Download Full-text

orfipy: a fast and flexible tool for extracting ORFs

10.1101/2020.10.20.348052 ◽

2020 ◽

Author(s):

Urminder Singh ◽

Eve Syrkin Wurtele

Keyword(s):

Open Reading Frames ◽

Supplementary Information ◽

Rna Seq ◽

Flexible Tool ◽

Coding Regions ◽

Link Type ◽

Alternative Reading Frames ◽

Downstream Analysis ◽

Fine Tune ◽

Reading Frames

SummarySearching for ORFs in transcripts is a critical step prior to annotating coding regions in newly-sequenced genomes and to search for alternative reading frames within known genes. With the tremendous increase in RNA-Seq data, faster tools are needed to handle large input datasets. These tools should be versatile enough to fine-tune search criteria and allow efficient downstream analysis. Here we present a new python based tool, orfipy, which allows the user to flexibly search for open reading frames in fasta sequences. The search is rapid and is fully customizable, with a choice of Fasta and BED output formats.Availability and implementationorfipy is implemented in python and is compatible with python v3.6 and higher. Source code: https://github.com/urmi-21/orfipy. Installation: from the source, or via PyPi (https://pypi.org/project/orfipy) or bioconda (https://anaconda.org/bioconda/orfipy)[email protected], [email protected] informationSupplementary data are available at https://github.com/urmi-21/orfipy

Download Full-text

Shark: fishing relevant reads in an RNA-Seq sample

Bioinformatics ◽

10.1093/bioinformatics/btaa779 ◽

2020 ◽

Author(s):

Luca Denti ◽

Yuri Pirola ◽

Marco Previtali ◽

Tamara Ceccato ◽

Gianluca Della Vedova ◽

...

Keyword(s):

Alternative Splicing ◽

High Throughput ◽

Supplementary Information ◽

Supplementary Data ◽

Rna Seq ◽

Differential Splicing ◽

Massive Datasets ◽

Gene Assignment ◽

Alignment Free ◽

Input Dataset

Abstract Motivation Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. Results We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. Availability and implementation The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

ClusterMap: compare multiple single cell RNA-Seq datasets across different experimental conditions

Bioinformatics ◽

10.1093/bioinformatics/btz024 ◽

2019 ◽

Vol 35 (17) ◽

pp. 3038-3045 ◽

Cited By ~ 7

Author(s):

Xin Gao ◽

Deqing Hu ◽

Madelaine Gogol ◽

Hua Li

Keyword(s):

Single Cell ◽

Molecular Mechanisms ◽

Population Level ◽

Supplementary Information ◽

Marker Genes ◽

Rna Seq ◽

Matching Problem ◽

Experimental Conditions ◽

Cut Method ◽

Underlying Mechanisms

Abstract Motivation Single cell RNA-Seq (scRNA-Seq) facilitates the characterization of cell type heterogeneity and developmental processes. Further study of single cell profiles across different conditions enables the understanding of biological processes and underlying mechanisms at the sub-population level. However, developing proper methodology to compare multiple scRNA-Seq datasets remains challenging. Results We have developed ClusterMap, a systematic method and workflow to facilitate the comparison of scRNA-seq profiles across distinct biological contexts. Using hierarchical clustering of the marker genes of each sub-group, ClusterMap matches the sub-types of cells across different samples and provides ‘similarity’ as a metric to quantify the quality of the match. We introduce a purity tree cut method designed specifically for this matching problem. We use Circos plot and regrouping method to visualize the results concisely. Furthermore, we propose a new metric ‘separability’ to summarize sub-population changes among all sample pairs. In the case studies, we demonstrate that ClusterMap has the ability to provide us further insight into the different molecular mechanisms of cellular sub-populations across different conditions. Availability and implementation ClusterMap is implemented in R and available at https://github.com/xgaoo/ClusterMap. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

GREIN: An Interactive Web Platform for Reanalyzing GEO RNA-seq Data

10.1101/326223 ◽

2018 ◽

Cited By ~ 1

Author(s):

Naim Al Mahi ◽

Mehdi Fazel Najafabadi ◽

Marcin Pilarczyk ◽

Michal Kouril ◽

Mario Medvedovic

Keyword(s):

Gene Expression ◽

User Interfaces ◽

Web Application ◽

Statistical Power ◽

Functional Characterization ◽

Gene Expression Omnibus ◽

Rna Seq ◽

Link Type ◽

Front End ◽

User Friendly

ABSTRACTThe vast amount of RNA-seq data deposited in Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) is still a grossly underutilized resource for biomedical research. To remove technical roadblocks for reusing these data, we have developed a web-application GREIN (GEO RNA-seq Experiments Interactive Navigator) which provides user-friendly interfaces to manipulate and analyze GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>6,500) of already processed datasets. The front-end user interfaces provide a wealth of user-analytics options including sub-setting and downloading processed data, interactive visualization, statistical power analyses, construction of differential gene expression signatures and their comprehensive functional characterization, and connectivity analysis with LINCS L1000 data. The combination of the massive amount of back-end data and front-end analytics options driven by user-friendly interfaces makes GREIN a unique open-source resource for re-using GEO RNA-seq data. GREIN is accessible at: https://shiny.ilincs.org/grein, the source code at: https://github.com/uc-bd2k/grein, and the Docker container at: https://hub.docker.com/r/ucbd2k/grein.

Download Full-text

Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data

10.1101/532192 ◽

2019 ◽

Author(s):

Héctor Climente-González ◽

Chloé-Agathe Azencott ◽

Samuel Kaski ◽

Makoto Yamada

Keyword(s):

Single Cell ◽

Synthetic Data ◽

Real Data ◽

Supplementary Information ◽

Rna Seq ◽

Link Type ◽

Model Free ◽

Computational Overhead ◽

Expression Microarrays ◽

And Function

AbstractMotivationFinding nonlinear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have crucial drawbacks, among others lack of parsimony, non-convexity, and computational overhead. Here we present the block HSIC Lasso, a nonlinear feature selector that does not present the previous drawbacks.ResultsWe compare the block HSIC Lasso to other state-of-the-art feature selection techniques in synthetic data and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA-seq, and GWAS. In all the cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than features of other techniques. As a proof of concept, we applied the block HSIC Lasso to a single-cell RNA-seq experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons.AvailabilityBlock HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available in Github (https://github.com/riken-aip/pyHSICLasso) and PyPi (https://pypi.org/project/pyHSICLasso)[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

Comprehensive evaluation of computational cell-type quantification methods for immuno-oncology

10.1101/463828 ◽

2018 ◽

Cited By ~ 4

Author(s):

Gregor Sturm ◽

Francesca Finotello ◽

Florent Petitprez ◽

Jitao David Zhang ◽

Jan Baumbach ◽

...

Keyword(s):

Tumor Microenvironment ◽

Single Cell ◽

Computational Methods ◽

Immune Cell ◽

Comprehensive Evaluation ◽

Supplementary Information ◽

Rna Seq ◽

Cell Type ◽

Link Type ◽

Real World Datasets

AbstractMotivationThe composition and density of immune cells in the tumor microenvironment profoundly influence tumor progression and success of anti-cancer therapies. Flow cytometry, immunohistochemistry staining, or single-cell sequencing is often unavailable such that we rely on computational methods to estimate the immune-cell composition from bulk RNA-sequencing (RNA-seq) data. Various methods have been proposed recently, yet their capabilities and limitations have not been evaluated systematically. A general guideline leading the research community through cell type deconvolution is missing.ResultsWe developed a systematic approach for benchmarking such computational methods and assessed the accuracy of tools at estimating nine different immune- and stromal cells from bulk RNA-seq samples. We used a single-cell RNA-seq dataset of ∼11,000 cells from the tumor microenvironment to simulate bulk samples of known cell type proportions, and validated the results using independent, publicly available gold-standard estimates. This allowed us to analyze and condense the results of more than a hundred thousand predictions to provide an exhaustive evaluation across seven computational methods over nine cell types and ∼1,800 samples from five simulated and real-world datasets. We demonstrate that computational deconvolution performs at high accuracy for well-defined cell-type signatures and propose how fuzzy cell-type signatures can be improved. We suggest that future efforts should be dedicated to refining cell population definitions and finding reliable signatures.AvailabilityA snakemake pipeline to reproduce the benchmark is available at https://github.com/grst/immune_deconvolution_benchmark. An R package allows the community to perform integrated deconvolution using different methods (https://grst.github.io/immunedeconv)[email protected] informationSupplementary data are available at Bioinformatics online.

Download Full-text

RNA splicing analysis using heterogeneous and large RNA-seq1datasets

10.1101/2021.11.03.467086 ◽

2021 ◽

Author(s):

Jorge Vaquero-Garcia ◽

Joseph K Aicher ◽

Paul Jewell ◽

Matthew R K Gazzara ◽

Caleb Matthew Radens ◽

...

Keyword(s):

Rna Splicing ◽

Large Scale ◽

Splice Variants ◽

Synthetic Data ◽

Large Datasets ◽

Splicing Regulation ◽

Rna Seq ◽

Differential Splicing ◽

Experimental Conditions ◽

Benchmark Datasets

The ubiquity of RNA-seq has led to many methods that use RNA-seq data to analyze variations in RNA splicing. However, available methods are not well suited for handling heterogeneous and large datasets. Such datasets scale to housands of samples across dozens of experimental conditions, exhibit increased variability compared to biological replicates, and involve thousands of unannotated splice variants resulting in increased transcriptome complexity. We describe here a suite of algorithms and tools implemented in the MAJIQ v2 package to address challenges in detection, quantification, and visualization of splicing variations from such datasets. Using both large scale synthetic data and GTEx v8 as benchmark datasets, we demonstrate that the approaches in MAJIQ v2 outperform existing methods. We then apply MAJIQ v2 package to analyze differential splicing across 2,335 samples from 13 brain subregions, demonstrating its ability to offer new insights into brain subregion-specific splicing regulation.

Download Full-text

Metric learning on expression data for gene function prediction

Bioinformatics ◽

10.1093/bioinformatics/btz731 ◽

2019 ◽

Author(s):

Stavros Makrodimitris ◽

Marcel J T Reinders ◽

Roeland C H J van Ham

Keyword(s):

Pearson Correlation ◽

Metric Learning ◽

Specific Weight ◽

Supplementary Information ◽

Expression Data ◽

Rna Seq ◽

Experimental Conditions ◽

Expression Of Genes ◽

Guilt By Association ◽

Python Package

Abstract Motivation Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, when using RNA-Seq datasets with many experimental conditions from diverse sources, only a subset of the experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similarly functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest. Results To address this, we developed Metric Learning for Co-expression (MLC), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and, in addition, if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidopsis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric performance. Moreover, our method is particularly good at more specific terms, which are the most interesting. Finally, by observing the sample weights for a particular GO term, one can identify which experiments are important for learning that term and potentially identify novel conditions that are relevant, as demonstrated by experiments in both A. thaliana and Pseudomonas Aeruginosa. Availability and implementation MLC is available as a Python package at www.github.com/stamakro/MLC. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text