Batch Effect Correction of RNA-seq Data through Sample Distance Matrix Adjustment

Mapping Intimacies ◽

10.1101/669739 ◽

2019 ◽

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Simulated Data ◽

Distance Matrix ◽

Batch Effect ◽

Rna Seq ◽

Sequencing Data ◽

Gene Detection ◽

Gene Differential Expression ◽

Optimal Linear ◽

Sample Pattern ◽

Sample Distance

AbstractBatch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. We present scBatch, a numerical algorithm that conducts batch effect correction on the count matrix of RNA sequencing (RNA-seq) data. Different from traditional methods, scBatch starts with establishing an ideal correction of the sample distance matrix that effectively reflect the underlying biological subgroups, without considering the actual correction of the raw count matrix itself. It then seeks an optimal linear transformation of the count matrix to approximate the established sample pattern. The benefit of such an approach is the final result is not restricted by assumptions on the mechanism of the batch effect. As a result, the method yields good clustering and gene differential expression (DE) results. We compared the new method, scBatch, with leading batch effect removal methods ComBat and mnnCorrect on simulated data, real bulk RNA-seq data, and real single-cell RNA-seq data. The comparisons demonstrated that scBatch achieved better sample clustering and DE gene detection results.

Download Full-text

scBatch: batch-effect correction of RNA-seq data through sample distance matrix adjustment

Bioinformatics ◽

10.1093/bioinformatics/btaa097 ◽

2020 ◽

Vol 36 (10) ◽

pp. 3115-3123 ◽

Cited By ~ 3

Author(s):

Teng Fei ◽

Tianwei Yu

Keyword(s):

Single Cell ◽

Differential Expression Analysis ◽

Distance Matrix ◽

Real Data ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Rna Seq ◽

Sequencing Data ◽

Gene Differential Expression

Abstract Motivation Batch effect is a frequent challenge in deep sequencing data analysis that can lead to misleading conclusions. Existing methods do not correct batch effects satisfactorily, especially with single-cell RNA sequencing (RNA-seq) data. Results We present scBatch, a numerical algorithm for batch-effect correction on bulk and single-cell RNA-seq data with emphasis on improving both clustering and gene differential expression analysis. scBatch is not restricted by assumptions on the mechanism of batch-effect generation. As shown in simulations and real data analyses, scBatch outperforms benchmark batch-effect correction methods. Availability and implementation The R package is available at github.com/tengfei-emory/scBatch. The code to generate results and figures in this article is available at github.com/tengfei-emory/scBatch-paper-scripts. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

SimFuse: A Novel Fusion Simulator for RNA Sequencing (RNA-Seq) Data

BioMed Research International ◽

10.1155/2015/780519 ◽

2015 ◽

Vol 2015 ◽

pp. 1-5 ◽

Cited By ~ 2

Author(s):

Yuxiang Tan ◽

Yann Tambouret ◽

Stefano Monti

Keyword(s):

Sample Size ◽

Rna Sequencing ◽

High Throughput Sequencing ◽

Performance Metrics ◽

Simulated Data ◽

Real Data ◽

Rna Seq ◽

Sequencing Data ◽

Detection Algorithms ◽

Fusion Detection

The performance evaluation of fusion detection algorithms from high-throughput sequencing data crucially relies on the availability of data with known positive and negative cases of gene rearrangements. The use of simulated data circumvents some shortcomings of real data by generation of an unlimited number of true and false positive events, and the consequent robust estimation of accuracy measures, such as precision and recall. Although a few simulated fusion datasets from RNA Sequencing (RNA-Seq) are available, they are of limited sample size. This makes it difficult to systematically evaluate the performance of RNA-Seq based fusion-detection algorithms. Here, we present SimFuse to address this problem. SimFuse utilizes real sequencing data as the fusions’ background to closely approximate the distribution of reads from a real sequencing library and uses a reference genome as the template from which to simulate fusions’ supporting reads. To assess the supporting read-specific performance, SimFuse generates multiple datasets with various numbers of fusion supporting reads. Compared to an extant simulated dataset, SimFuse gives users control over the supporting read features and the sample size of the simulated library, based on which the performance metrics needed for the validation and comparison of alternative fusion-detection algorithms can be rigorously estimated.

Download Full-text

Circall: fast and accurate methodology for discovery of circular RNAs from paired-end RNA-sequencing data

BMC Bioinformatics ◽

10.1186/s12859-021-04418-8 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Dat Thanh Nguyen ◽

Quang Thinh Trac ◽

Thi-Hau Nguyen ◽

Ha-Nam Nguyen ◽

Nir Ohad ◽

...

Keyword(s):

Rna Sequencing ◽

Simulated Data ◽

High Sensitivity ◽

Circular Rna ◽

Computational Time ◽

Circular Rnas ◽

Rna Seq ◽

Sequencing Data ◽

Mapping Algorithm ◽

False Discovery Rate Method

Abstract Background Circular RNA (circRNA) is an emerging class of RNA molecules attracting researchers due to its potential for serving as markers for diagnosis, prognosis, or therapeutic targets of cancer, cardiovascular, and autoimmune diseases. Current methods for detection of circRNA from RNA sequencing (RNA-seq) focus mostly on improving mapping quality of reads supporting the back-splicing junction (BSJ) of a circRNA to eliminate false positives (FPs). We show that mapping information alone often cannot predict if a BSJ-supporting read is derived from a true circRNA or not, thus increasing the rate of FP circRNAs. Results We have developed Circall, a novel circRNA detection method from RNA-seq. Circall controls the FPs using a robust multidimensional local false discovery rate method based on the length and expression of circRNAs. It is computationally highly efficient by using a quasi-mapping algorithm for fast and accurate RNA read alignments. We applied Circall on two simulated datasets and three experimental datasets of human cell-lines. The results show that Circall achieves high sensitivity and precision in the simulated data. In the experimental datasets it performs well against current leading methods. Circall is also substantially faster than the other methods, particularly for large datasets. Conclusions With those better performances in the detection of circRNAs and in computational time, Circall facilitates the analyses of circRNAs in large numbers of samples. Circall is implemented in C++ and R, and available for use at https://www.meb.ki.se/sites/biostatwiki/circall and https://github.com/datngu/Circall.

Download Full-text

NewWave: a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA-seq data

10.1101/2021.08.02.453487 ◽

2021 ◽

Author(s):

Federico Agostinis ◽

Chiara Romualdi ◽

Gabriele Sales ◽

Davide Risso

Keyword(s):

Dimensionality Reduction ◽

Single Cell ◽

R Package ◽

Batch Effect ◽

Supplementary Information ◽

Bioconductor Package ◽

Rna Seq ◽

Sequencing Data ◽

Bioconductor Project ◽

Single Cell Rna Sequencing

Summary: We present NewWave, a scalable R/Bioconductor package for the dimensionality reduction and batch effect removal of single-cell RNA sequencing data. To achieve scalability, NewWave uses mini-batch optimization and can work with out-of-memory data, enabling users to analyze datasets with millions of cells. Availability and implementation: NewWave is implemented as an open-source R package available through the Bioconductor project at https://bioconductor.org/packages/NewWave/ Supplementary information: Supplementary data are available at Bioinformatics online.

Download Full-text

On the Analysis of Transcriptional Noise From RNA-sequencing Data

10.1101/2021.04.06.438605 ◽

2021 ◽

Author(s):

Kristoffer Vitting-Seerup

Keyword(s):

Rna Sequencing ◽

Simulated Data ◽

Cellular Biology ◽

Rna Seq ◽

Sequencing Data ◽

Transcriptional Noise ◽

Bioinformatic Tools ◽

Specific Focus ◽

Significant Step

RNA-sequencing (RNA-seq) has revolutionized our understanding of molecular and cellular biology. A central cornerstone in the analysis of RNA-seq is the bioinformatic tools that quantify the data. To evaluate the efficacy of these tools, scientists rely heavily on simulation of RNA-seq. Recently Varabyou et al. took simulation of RNA-seq data to the next level by providing simulated data, that includes simulation of transcriptional noise. While this represents a significant step forward in our ability to perform realistic benchmarks of RNA-seq tools, the data provided by Varabyou et al. need refinement. In the following, I suggest a few improvements with a specific focus on splicing noise.

Download Full-text

Adaptive Somatic Mutations Calls with Deep Learning and Semi-Simulated Data

10.1101/079087 ◽

2016 ◽

Cited By ~ 6

Author(s):

Remi Torracinta ◽

Laurent Mesnard ◽

Susan Levine ◽

Rita Shaknovich ◽

Maureen Hanson ◽

...

Keyword(s):

Deep Learning ◽

High Throughput Sequencing ◽

Probabilistic Models ◽

Somatic Mutations ◽

Simulated Data ◽

Good Representation ◽

Rna Seq ◽

Sequencing Data ◽

Somatic Variation ◽

Feed Forward Neural Network

ABSTRACTA number of approaches have been developed to call somatic variation in high-throughput sequencing data. Here, we present an adaptive approach to calling somatic variations. Our approach trains a deep feed-forward neural network with semi-simulated data. Semi-simulated datasets are constructed by planting somatic mutations in real datasets where no mutations are expected. Using semi-simulated data makes it possible to train the models with millions of training examples, a usual requirement for successfully training deep learning models. We initially focus on calling variations in RNA-Seq data. We derive semi-simulated datasets from real RNA-Seq data, which offer a good representation of the data the models will be applied to. We test the models on independent semi-simulated data as well as pure simulations. On independent semi-simulated data, models achieve an AUC of 0.973. When tested on semi-simulated exome DNA datasets, we find that the models trained on RNA-Seq data remain predictive (sens 0.4 & spec 0.9 at cutoff of P > = 0.9), albeit with lower overall performance (AUC=0.737). Interestingly, while the models generalize across assay, training on RNA-Seq data lowers the confidence for a group of mutations. Haloplex exome specific training was also performed, demonstrating that the approach can produce probabilistic models tuned for specific assays and protocols. We found that the method adapts to the characteristics of experimental protocol. We further illustrate these points by training a model for a trio somatic experimental design when germline DNA of both parents is available in addition to data about the individual. These models are distributed with Goby (http://goby.campagnelab.org).

Download Full-text

scCODE: an R package for personalized differentially expressed gene detection on single-cell RNA-sequencing data

10.1101/2021.11.18.469072 ◽

2021 ◽

Author(s):

jiawei Zou ◽

miaochen Wang ◽

zhen Zhang ◽

zheqi Liu ◽

xiaobin Zhang ◽

...

Keyword(s):

Single Cell ◽

Differentially Expressed Gene ◽

R Package ◽

Differentially Expressed ◽

Rna Seq ◽

Sequencing Data ◽

Gene Detection ◽

Gene Filtering ◽

Consensus Optimization ◽

Detection Strategies

Differential expression (DE) gene detection in single-cell RNA-seq (scRNA-seq) data is a key step to understand the biological question investigated. We find that DE methods together with gene filtering have profound impact on DE gene identification, and different datasets will benefit from personalized DE gene detection strategies. Existing tools don't take gene filtering into consideration, and couldn't evaluate DE performance on real datasets without prior knowledge of true results. Based on two new metrics, we propose scCODE (single cell Consensus Optimization of Differentially Expressed gene detection), an R package to automatically optimize DE gene detection for each experimental scRNA-seq dataset.

Download Full-text

Advancing clinical genomics and precision medicine with GVViZ: FAIR bioinformatics platform for variable gene-disease annotation, visualization, and expression analysis

Human Genomics ◽

10.1186/s40246-021-00336-1 ◽

2021 ◽

Vol 15 (1) ◽

Author(s):

Zeeshan Ahmed ◽

Eduard Gibert Renart ◽

Saman Zeeshan ◽

XinQi Dong

Keyword(s):

Data Analysis ◽

Patient Care ◽

Expression Analysis ◽

High Throughput ◽

Gene Annotation ◽

Next Generation Sequencing Data ◽

Rna Seq ◽

Sequencing Data ◽

Complex Disorders ◽

Transcriptomics Data

Abstract Background Genetic disposition is considered critical for identifying subjects at high risk for disease development. Investigating disease-causing and high and low expressed genes can support finding the root causes of uncertainties in patient care. However, independent and timely high-throughput next-generation sequencing data analysis is still a challenge for non-computational biologists and geneticists. Results In this manuscript, we present a findable, accessible, interactive, and reusable (FAIR) bioinformatics platform, i.e., GVViZ (visualizing genes with disease-causing variants). GVViZ is a user-friendly, cross-platform, and database application for RNA-seq-driven variable and complex gene-disease data annotation and expression analysis with a dynamic heat map visualization. GVViZ has the potential to find patterns across millions of features and extract actionable information, which can support the early detection of complex disorders and the development of new therapies for personalized patient care. The execution of GVViZ is based on a set of simple instructions that users without a computational background can follow to design and perform customized data analysis. It can assimilate patients’ transcriptomics data with the public, proprietary, and our in-house developed gene-disease databases to query, easily explore, and access information on gene annotation and classified disease phenotypes with greater visibility and customization. To test its performance and understand the clinical and scientific impact of GVViZ, we present GVViZ analysis for different chronic diseases and conditions, including Alzheimer’s disease, arthritis, asthma, diabetes mellitus, heart failure, hypertension, obesity, osteoporosis, and multiple cancer disorders. The results are visualized using GVViZ and can be exported as image (PNF/TIFF) and text (CSV) files that include gene names, Ensembl (ENSG) IDs, quantified abundances, expressed transcript lengths, and annotated oncology and non-oncology diseases. Conclusions We emphasize that automated and interactive visualization should be an indispensable component of modern RNA-seq analysis, which is currently not the case. However, experts in clinics and researchers in life sciences can use GVViZ to visualize and interpret the transcriptomics data, making it a powerful tool to study the dynamics of gene expression and regulation. Furthermore, with successful deployment in clinical settings, GVViZ has the potential to enable high-throughput correlations between patient diagnoses based on clinical and transcriptomics data.

Download Full-text

Transcriptome Analysis of Responses to Dengue Virus 2 Infection in Aedes albopictus (Skuse) C6/36 Cells

Viruses ◽

10.3390/v13020343 ◽

2021 ◽

Vol 13 (2) ◽

pp. 343

Author(s):

Manjin Li ◽

Dan Xing ◽

Duo Su ◽

Di Wang ◽

Heting Gao ◽

...

Keyword(s):

Dengue Virus ◽

Aedes Albopictus ◽

Bioinformatics Analysis ◽

Interaction Mechanism ◽

Transcriptional Analysis ◽

Functional Verification ◽

Mosquito Vector ◽

Rna Seq ◽

Sequencing Data ◽

Qrt Pcr

Dengue virus (DENV), a member of the Flavivirus genus of the Flaviviridae family, can cause dengue fever (DF) and more serious diseases and thus imposes a heavy burden worldwide. As the main vector of DENV, mosquitoes are a serious hazard. After infection, they induce a complex host–pathogen interaction mechanism. Our goal is to further study the interaction mechanism of viruses in homologous, sensitive, and repeatable C6/36 cell vectors. Transcriptome sequencing (RNA-Seq) technology was applied to the host transcript profiles of C6/36 cells infected with DENV2. Then, bioinformatics analysis was used to identify significant differentially expressed genes and the associated biological processes. Quantitative reverse transcription-polymerase chain reaction (qRT-PCR) was performed to verify the sequencing data. A total of 1239 DEGs were found by transcriptional analysis of Aedes albopictus C6/36 cells that were infected and uninfected with dengue virus, among which 1133 were upregulated and 106 were downregulated. Further bioinformatics analysis showed that the upregulated DEGs were significantly enriched in signaling pathways such as the MAPK, Hippo, FoxO, Wnt, mTOR, and Notch; metabolic pathways and cellular physiological processes such as autophagy, endocytosis, and apoptosis. Downregulated DEGs were mainly enriched in DNA replication, pyrimidine metabolism, and repair pathways, including BER, NER, and MMR. The qRT-PCR results showed that the concordance between the RNA-Seq and RT-qPCR data was very high (92.3%). The results of this study provide more information about DENV2 infection of C6/36 cells at the transcriptome level, laying a foundation for further research on mosquito vector–virus interactions. These data provide candidate antiviral genes that can be used for further functional verification in the future.

Download Full-text

Transcriptomic and ChIP-seq Integrative Analysis Reveals Important Roles of Epigenetically Regulated lncRNAs in Placental Development in Meishan Pigs

Genes ◽

10.3390/genes11040397 ◽

2020 ◽

Vol 11 (4) ◽

pp. 397

Author(s):

Dadong Deng ◽

Xihong Tan ◽

Kun Han ◽

Ruimin Ren ◽

Jianhua Cao ◽

...

Keyword(s):

Differentially Expressed ◽

Rna Seq ◽

Sequencing Data ◽

Placental Development ◽

Cytoskeleton Organization ◽

New Class ◽

Chromatin Immunoprecipitation Sequencing ◽

Non Coding Rnas ◽

Two Stages ◽

Regulatory Functions

The development of the placental fold, which increases the maternal–fetal interacting surface area, is of primary importance for the growth of the fetus throughout the whole pregnancy. However, the mechanisms involved remain to be fully elucidated. Increasing evidence has revealed that long non-coding RNAs (lncRNAs) are a new class of RNAs with regulatory functions and could be epigenetically regulated by histone modifications. In this study, 141 lncRNAs (including 73 up-regulated and 68 down-regulated lncRNAs) were identified to be differentially expressed in the placentas of pigs during the establishment and expanding stages of placental fold development. The differentially expressed lncRNAs and genes (DElncRNA-DEgene) co-expression network analysis revealed that these differentially expressed lncRNAs (DElncRNAs) were mainly enriched in pathways of cell adhesion, cytoskeleton organization, epithelial cell differentiation and angiogenesis, indicating that the DElncRNAs are related to the major events that occur during placental fold development. In addition, we integrated the RNA-seq (RNA sequencing) data with the ChIP-seq (chromatin immunoprecipitation sequencing) data of H3K4me3/H3K27ac produced from the placental samples of pigs from the two stages (gestational days 50 and 95). The analysis revealed that the changes in H3K4me3 and/or H3K27ac levels were significantly associated with the changes in the expression levels of 37 DElncRNAs. Furthermore, several H3K4me3/H3K27ac-lncRNAs were characterized to be significantly correlated with genes functionally related to placental development. Thus, this study provides new insights into understanding the mechanisms for the placental development of pigs.

Download Full-text