scholarly journals Consistent Reanalysis of Genome-wide Imprinting Studies in Plants Using Generalized Linear Models Increases Concordance across Datasets

2017 ◽  
Author(s):  
Stefan Wyder ◽  
Michael T. Raissig ◽  
Ueli Grossniklaus

ABSTRACTGenomic imprinting leads to different expression levels of maternally and paternally derived alleles. Over the last years, major progress has been made in identifying novel imprinted candidate genes in plants, owing to affordable next-generation sequencing technologies. However, reports on sequencing the transcriptome of hybrid F1 seed tissues strongly disagree about how many and which genes are imprinted. This raises questions about the relative impact of biological, environmental, technical, and analytic differences or biases. Here, we adopt a statistical approach, frequently used in RNA-seq data analysis, which properly models count overdispersion and considers replicate information of reciprocal crosses. We show that our statistical pipeline outperforms other methods in identifying imprinted genes in simulated and real data. Accordingly, reanalysis of genome-wide imprinting studies in Arabidopsis and maize shows that, at least for the Arabidopsis dataset, an increased agreement across datasets can be observed. For maize, however, consistent reanalysis did not yield in a larger overlap between the datasets. This suggests that the discrepancy across publications might be partially due to different analysis pipelines but that technical, biological, and environmental factors underlie much of the discrepancy between datasets. Finally, we show that the set of genes that can be characterized regarding allelic bias by all studies with minimal confidence is small (~8,000/27,416 genes for Arabidopsis and ~12,000/39,469 for maize). In conclusion, we propose to use biologically replicated reciprocal crosses, high sequence coverage, and a generalized linear model approach to identify differentially expressed alleles in developing seeds.

2020 ◽  
Author(s):  
Estefania Mancini ◽  
Andres Rabinovich ◽  
Javier Iserte ◽  
Marcelo Yanovsky ◽  
Ariel Chernomoretz

AbstractGenome-wide analysis of alternative splicing has been a very active field of research since the early days of NGS (Next generation sequencing) technologies. Since then, ever-growing data availability and the development of increasingly sophisticated analysis methods have uncovered the complexity of the general splicing repertoire. However, independently of the considered quantification methodology, very often changes in variant concentration profiles can be hard to disentangle. In order to tackle this problem we present ASpli2, a computational suite implemented in R, that allows the identification of changes in both, annotated and novel alternative splicing events, and can deal with complex experimental designs.Our analysis workflow relies on the analysis of differential usage of subgenic features in combination with a junction-based description of local splicing changes. Analyzing simulated and real data we found that the consolidation of these signals resulted in a robust proxy of the occurrence of splicing alterations. While junction-based signals allowed us to uncover annotated as well and non-annotated events, bin-associated signals notably increased recall capabilities at a very competitive performance in terms of precision.


Author(s):  
Mancini Estefania ◽  
Rabinovich Andres ◽  
Iserte Javier ◽  
Yanovsky Marcelo ◽  
Chernomoretz Ariel

Abstract Motivation Genome-wide analysis of alternative splicing has been a very active field of research since the early days of next generation sequencing technologies. Since then, ever-growing data availability and the development of increasingly sophisticated analysis methods have uncovered the complexity of the general splicing repertoire. A large number of splicing analysis methodologies exist, each of them presenting its own strengths and weaknesses. For instance, methods exclusively relying on junction information do not take advantage of the large majority of reads produced in an RNA-seq assay, isoform reconstruction methods might not detect novel intron retention events, some solutions can only handle canonical splicing events, and many existing methods can only perform pairwise comparisons. Results In this contribution, we present ASpli, a computational suite implemented in R statistical language, that allows the identification of changes in both, annotated and novel alternative-splicing events and can deal with simple, multi-factor or paired experimental designs. Our integrative computational workflow, that considers the same GLM model applied to different sets of reads and junctions, allows computation of complementary splicing signals. Analyzing simulated and real data, we found that the consolidation of these signals resulted in a robust proxy of the occurrence of splicing alterations. While the analysis of junctions allowed us to uncover annotated as well as non-annotated events, read coverage signals notably increased recall capabilities at a very competitive performance when compared against other state-of-the-art splicing analysis algorithms. Availability and implementation ASpli is freely available from the Bioconductor project site https://doi.org/doi:10.18129/B9.bioc.ASpli. Supplementary information Supplementary data are available at Bioinformatics online.


2017 ◽  
Vol 1 (Special Issue-Supplement) ◽  
pp. 237-237
Author(s):  
Reddaiah Bodanapu ◽  
Krishna Lalam ◽  
Durga Khandekar ◽  
Navitha Kokkonda ◽  
Sivarama Prasad Lekkala ◽  
...  

Author(s):  
Aaron T. L. Lun ◽  
Gordon K. Smyth

AbstractRNA sequencing (RNA-seq) is widely used to study gene expression changes associated with treatments or biological conditions. Many popular methods for detecting differential expression (DE) from RNA-seq data use generalized linear models (GLMs) fitted to the read counts across independent replicate samples for each gene. This article shows that the standard formula for the residual degrees of freedom (d.f.) in a linear model is overstated when the model contains fitted values that are exactly zero. Such fitted values occur whenever all the counts in a treatment group are zero as well as in more complex models such as those involving paired comparisons. This misspecification results in underestimation of the genewise variances and loss of type I error control. This article proposes a formula for the reduced residual d.f. that restores error control in simulated RNA-seq data and improves detection of DE genes in a real data analysis. The new approach is implemented in the quasi-likelihood framework of the edgeR software package. The results of this article also apply to RNA-seq analyses that apply linear models to log-transformed counts, such as those in the limma software package, and more generally to any count-based GLM where exactly zero fitted values are possible.


2017 ◽  
Author(s):  
Luke Zappia ◽  
Belinda Phipson ◽  
Alicia Oshlack

AbstractAs single-cell RNA sequencing technologies have rapidly developed, so have analysis methods. Many methods have been tested, developed and validated using simulated datasets. Unfortunately, current simulations are often poorly documented, their similarity to real data is not demonstrated, or reproducible code is not available.Here we present the Splatter Bioconductor package for simple, reproducible and well-documented simulation of single-cell RNA-seq data. Splatter provides an interface to multiple simulation methods including Splat, our own simulation, based on a gamma-Poisson distribution. Splat can simulate single populations of cells, populations with multiple cell types or differentiation paths.


F1000Research ◽  
2016 ◽  
Vol 5 ◽  
pp. 888
Author(s):  
Elizabeth Baskin ◽  
Peter DeFord ◽  
Allison F. Dennis ◽  
Ian Misner ◽  
Frederick J. Tan ◽  
...  

The rapid rise of high-throughput, data intensive experimental techniques has thrust many biologists into the role of data analyst – a role many biologists feel ill equipped to fill. Novices often struggle to find the resources and expertise they need to analyze their experimental results in a wet-lab environment. To fill this need, we developed an educational resource as part of a National Center for Biotechnology Information (NCBI) hackathon. Using RNA-seq as a model, our tutorial guides new users through the steps of data analysis, while placing an emphasis on understanding the motivation behind choices made in the process. To advance the goal of providing a deeper understanding of the analysis process, we developed a new tool, bamDiff. bamDiff allows users to compare the performance of multiple RNA-seq aligners, allowing users to select the most appropriate aligner for the data in question and experimental end-goal. Our tutorial is accessible via a GitHub wiki, with associated data and software provided on an Amazon Machine Image (AMI), which can be completed at no cost to the user through the Amazon Educate Program. Following the hackathon, our tutorial was integrated into the October 2015 offering of NCBI NOW (Next Generation Sequencing (NGS) Online Workshop) a free online experience targeting individuals new to NGS analysis.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Nicholas F. Lahens ◽  
Thomas G. Brooks ◽  
Dimitra Sarantopoulou ◽  
Soumyashant Nayak ◽  
Cris Lawrence ◽  
...  

Abstract Background The accurate interpretation of RNA-Seq data presents a moving target as scientists continue to introduce new experimental techniques and analysis algorithms. Simulated datasets are an invaluable tool to accurately assess the performance of RNA-Seq analysis methods. However, existing RNA-Seq simulators focus on modeling the technical biases and artifacts of sequencing, rather than on simulating the original RNA samples. A first step in simulating RNA-Seq is to simulate RNA. Results To fill this need, we developed the Configurable And Modular Program Allowing RNA Expression Emulation (CAMPAREE), a simulator using empirical data to simulate diploid RNA samples at the level of individual molecules. We demonstrated CAMPAREE’s use for generating idealized coverage plots from real data, and for adding the ability to generate allele-specific data to existing RNA-Seq simulators that do not natively support this feature. Conclusions Separating input sample modeling from library preparation/sequencing offers added flexibility for both users and developers to mix-and-match different sample and sequencing simulators to suit their specific needs. Furthermore, the ability to maintain sample and sequencing simulators independently provides greater agility to incorporate new biological findings about transcriptomics and new developments in sequencing technologies. Additionally, by simulating at the level of individual molecules, CAMPAREE has the potential to model molecules transcribed from the same genes as a heterogeneous population of transcripts with different states of degradation and processing (splicing, editing, etc.). CAMPAREE was developed in Python, is open source, and freely available at https://github.com/itmat/CAMPAREE.


Reproduction ◽  
2013 ◽  
Vol 145 (6) ◽  
pp. 587-596 ◽  
Author(s):  
Xiangyang Miao ◽  
Qingmiao Luo

The Small-tail Han sheep and the Surabaya fur sheep are two local breeds in North China, which are characterized by high-fecundity and low-prolificacy breed respectively. Significant genetic differences between these two breeds have provided increasing interests in the identification and utilization of major prolificacy genes in these sheep. High prolificacy is a complex trait, and it is difficult to comprehensively identify the candidate genes related to this trait using the single molecular biology technique. To understand the molecular mechanisms of fecundity and provide more information about high prolificacy candidate genes in high- and low-fecundity sheep, we explored the utility of next-generation sequencing technology in this work. A total of 1.8 Gb sequencing reads were obtained and resulted in more than 20 000 contigs that averaged ∼300 bp in length. Ten differentially expressed genes were further verified by quantitative real-time RT-PCR to confirm the reliability of RNA-seq results. Our work will provide a basis for the future research of the sheep reproduction.


2019 ◽  
Vol 21 (5) ◽  
pp. 1756-1765
Author(s):  
Bo Sun ◽  
Liang Chen

Abstract Mapping of expression quantitative trait loci (eQTLs) facilitates interpretation of the regulatory path from genetic variants to their associated disease or traits. High-throughput sequencing of RNA (RNA-seq) has expedited the exploration of these regulatory variants. However, eQTL mapping is usually confronted with the analysis challenges caused by overdispersion and excessive dropouts in RNA-seq. The heavy-tailed distribution of gene expression violates the assumption of Gaussian distributed errors in linear regression for eQTL detection, which results in increased Type I or Type II errors. Applying rank-based inverse normal transformation (INT) can make the expression values more normally distributed. However, INT causes information loss and leads to uninterpretable effect size estimation. After comprehensive examination of the impact from overdispersion and excessive dropouts, we propose to apply a robust model, quantile regression, to map eQTLs for genes with high degree of overdispersion or large number of dropouts. Simulation studies show that quantile regression has the desired robustness to outliers and dropouts, and it significantly improves eQTL mapping. From a real data analysis, the most significant eQTL discoveries differ between quantile regression and the conventional linear model. Such discrepancy becomes more prominent when the dropout effect or the overdispersion effect is large. All the results suggest that quantile regression provides more reliable and accurate eQTL mapping than conventional linear models. It deserves more attention for the large-scale eQTL mapping.


Sign in / Sign up

Export Citation Format

Share Document