scholarly journals Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

2014 ◽  
Author(s):  
Andreas Tuerk ◽  
Gregor Wiktorin

AbstractQuantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragmentation bias, which is not represented appropriately by current statistical models of RNA-Seq data. Another, less investigated, source of error is the inaccuracy of transcript start and end annotations.This article introduces the Mix2 (rd. “mixquare”) model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2 model can be efficiently trained with the EM algorithm and are tied between similar transcripts. Transcript specific shift and scale parameters allow the Mix2 model to automatically correct inaccurate transcript start and end annotations. Experiments are conducted on synthetic data covering 7 genes of different complexity, 4 types of fragment bias and correct as well as incorrect transcript start and end annotations. Abundance estimates obtained by Cufflinks 2.2.0, PennSeq and the Mix2 model show superior performance of the Mix2 model in the vast majority of test conditions.The Mix2 software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz, subject to the enclosed license.Additional experimental data are available in the supplement.

2014 ◽  
Author(s):  
Andreas Tuerk ◽  
Gregor Wiktorin ◽  
Serhat Güler

Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragment bias, which is not represented appropriately by current statistical models of RNA-Seq data. This article introduces the Mix2(rd. "mixquare") model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2model can be efficiently trained with the Expectation Maximization (EM) algorithm resulting in simultaneous estimates of the transcript abundances and transcript specific positional biases. Experiments are conducted on synthetic data and the Universal Human Reference (UHR) and Brain (HBR) sample from the Microarray quality control (MAQC) data set. Comparing the correlation between qPCR and FPKM values to state-of-the-art methods Cufflinks and PennSeq we obtain an increase in R2value from 0.44 to 0.6 and from 0.34 to 0.54. In the detection of differential expression between UHR and HBR the true positive rate increases from 0.44 to 0.71 at a false positive rate of 0.1. Finally, the Mix2model is used to investigate biases present in the MAQC data. This reveals 5 dominant biases which deviate from the common assumption of a uniform fragment distribution. The Mix2software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz.


Sensors ◽  
2020 ◽  
Vol 20 (20) ◽  
pp. 5966
Author(s):  
Ke Wang ◽  
Gong Zhang

The challenge of small data has emerged in synthetic aperture radar automatic target recognition (SAR-ATR) problems. Most SAR-ATR methods are data-driven and require a lot of training data that are expensive to collect. To address this challenge, we propose a recognition model that incorporates meta-learning and amortized variational inference (AVI). Specifically, the model consists of global parameters and task-specific parameters. The global parameters, trained by meta-learning, construct a common feature extractor shared between all recognition tasks. The task-specific parameters, modeled by probability distributions, can adapt to new tasks with a small amount of training data. To reduce the computation and storage cost, the task-specific parameters are inferred by AVI implemented with set-to-set functions. Extensive experiments were conducted on a real SAR dataset to evaluate the effectiveness of the model. The results of the proposed approach compared with those of the latest SAR-ATR methods show the superior performance of our model, especially on recognition tasks with limited data.


2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Ruizhu Huang ◽  
Charlotte Soneson ◽  
Pierre-Luc Germain ◽  
Thomas S.B. Schmidt ◽  
Christian Von Mering ◽  
...  

AbstracttreeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.


BMC Genomics ◽  
2021 ◽  
Vol 22 (1) ◽  
Author(s):  
Shuhua Zhan ◽  
Cortland Griswold ◽  
Lewis Lukens

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.


2019 ◽  
Vol 623 ◽  
pp. A156 ◽  
Author(s):  
H. E. Delgado ◽  
L. M. Sarro ◽  
G. Clementini ◽  
T. Muraveva ◽  
A. Garofalo

In a recent study we analysed period–luminosity–metallicity (PLZ) relations for RR Lyrae stars using theGaiaData Release 2 (DR2) parallaxes. It built on a previous work that was based on the firstGaiaData Release (DR1), and also included period–luminosity (PL) relations for Cepheids and RR Lyrae stars. The method used to infer the relations fromGaiaDR2 data and one of the methods used forGaiaDR1 data was based on a Bayesian model, the full description of which was deferred to a subsequent publication. This paper presents the Bayesian method for the inference of the parameters ofPL(Z) relations used in those studies, the main feature of which is to manage the uncertainties on observables in a rigorous and well-founded way. The method encodes the probability relationships between the variables of the problem in a hierarchical Bayesian model and infers the posterior probability distributions of thePL(Z) relationship coefficients using Markov chain Monte Carlo simulation techniques. We evaluate the method with several semi-synthetic data sets and apply it to a sample of 200 fundamental and first-overtone RR Lyrae stars for whichGaiaDR1 parallaxes and literatureKs-band mean magnitudes are available. We define and test several hyperprior probabilities to verify their adequacy and check the sensitivity of the solution with respect to the prior choice. The main conclusion of this work, based on the test with semi-syntheticGaiaDR1 parallaxes, is the absolute necessity of incorporating the existing correlations between the period, metallicity, and parallax measurements in the form of model priors in order to avoid systematically biased results, especially in the case of non-negligible uncertainties in the parallaxes. The relation coefficients obtained here have been superseded by those presented in our recent paper that incorporates the findings of this work and the more recentGaiaDR2 measurements.


Geophysics ◽  
1986 ◽  
Vol 51 (2) ◽  
pp. 332-346 ◽  
Author(s):  
Daniel H. Rothman

Conventional approaches to residual statics estimation obtain solutions by performing linear inversion of observed traveltime deviations. A crucial component of these procedures is picking time delays; gross errors in these picks are known as “cycle skips” or “leg jumps” and are the bane of linear traveltime inversion schemes. This paper augments Rothman (1985), which demonstrated that the estimation of large statics in noise‐contaminated data is posed better as a nonlinear, rather than as a linear, inverse problem. Cycle skips then appear as local (secondary) minima of the resulting nonlinear optimization problem. In the earlier paper, a Monte Carlo technique from statistical mechanics was adapted to perform global optimization, and the technique was applied to synthetic data. Here I present an application of a similar Monte Carlo method to field data from the Wyoming Overthrust belt. Key changes, however, have led to a more efficient and practical algorithm. The new technique performs explicit crosscorrelation of traces. Instead of picking the peaks of these crosscorrelation functions, the method transforms the crosscorrelation functions to probability distributions and then draws random numbers from the distributions. Estimates of statics are now iteratively updated by this procedure until convergence to the optimal stack is achieved. Here I also derive several theoretical properties of the algorithm. The method is expressed as a Markov chain, in which the equilibrium (steady‐state) distribution is the Gibbs distribution of statistical mechanics.


2019 ◽  
Vol 35 (23) ◽  
pp. 5039-5047 ◽  
Author(s):  
Gabrielle Deschamps-Francoeur ◽  
Vincent Boivin ◽  
Sherif Abou Elela ◽  
Michelle S Scott

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Yoshiaki Yasumizu ◽  
Atsushi Hara ◽  
Shimon Sakaguchi ◽  
Naganari Ohkura

Abstract Summary The possibility that RNA transcripts from clinical samples contain plenty of virus RNAs has not been pursued actively so far. We here developed a new tool for analyzing virus-transcribed mRNAs, not virus copy numbers, in the data of bulk and single-cell RNA-sequencing of human cells. Our pipeline, named VIRTUS (VIRal Transcript Usage Sensor), was able to detect 762 viruses including herpesviruses, retroviruses and even SARS-CoV-2 (COVID-19), and quantify their transcripts in the sequence data. This tool thus enabled simultaneously detecting infected cells, the composition of multiple viruses within the cell, and the endogenous host-gene expression profile of the cell. This bioinformatics method would be instrumental in addressing the possible effects of covertly infecting viruses on certain diseases and developing new treatments to target such viruses. Availability and implementation : VIRTUS is implemented using Common Workflow Language and Docker under a CC-NC license. VIRTUS is freely available at https://github.com/yyoshiaki/VIRTUS. Supplementary information Supplementary data are available at Bioinformatics online.


mBio ◽  
2011 ◽  
Vol 2 (2) ◽  
Author(s):  
Lindsey Bomar ◽  
Michele Maltz ◽  
Sophie Colston ◽  
Joerg Graf

ABSTRACTThe vast majority of bacterial species remain uncultured, and this severely limits the investigation of their physiology, metabolic capabilities, and role in the environment. High-throughput sequencing of RNA transcripts (RNA-seq) allows the investigation of the diverse physiologies from uncultured microorganisms in their natural habitat. Here, we report the use of RNA-seq for characterizing the metatranscriptome of the simple gut microbiome from the medicinal leechHirudo verbanaand for utilizing this information to design a medium for cultivating members of the microbiome. Expression data suggested that aRikenella-like bacterium, the most abundant but uncultured symbiont, forages on sulfated- and sialated-mucin glycans that are fermented, leading to the secretion of acetate. Histological stains were consistent with the presence of sulfated and sialated mucins along the crop epithelium. The second dominant symbiont,Aeromonas veronii, grows in two different microenvironments and is predicted to utilize either acetate or carbohydrates. Based on the metatranscriptome, a medium containing mucin was designed, which enabled the cultivation of theRikenella-like bacterium. Metatranscriptomes shed light on microbial metabolismin situand provide critical clues for directing the culturing of uncultured microorganisms. By choosing a condition under which the desired organism is rapidly proliferating and focusing on highly expressed genes encoding hydrolytic enzymes, binding proteins, and transporters, one can identify an organism’s nutritional preferences and design a culture medium.IMPORTANCEThe number of prokaryotes on the planet has been estimated to exceed 1030cells, and the overwhelming majority of them have evaded cultivation, making it difficult to investigate their ecological, medical, and industrial relevance. The application of transcriptomics based on high-throughput sequencing of RNA transcripts (RNA-seq) to microorganisms in their natural environment can provide investigators with insight into their physiologies under optimal growth conditions. We utilized RNA-seq to learn more about the uncultured and cultured symbionts that comprise the relatively simple digestive-tract microbiome of the medicinal leech. The expression data revealed highly expressed hydrolytic enzymes and transporters that provided critical clues for the design of a culture medium enabling the isolation of the previously unculturedRikenella-like symbiont. This directed culturing method will greatly aid efforts aimed at understanding uncultured microorganisms, including beneficial symbionts, pathogens, and ecologically relevant microorganisms, by facilitating genome sequencing, physiological characterization, and genetic manipulation of the previously uncultured microbes.


Author(s):  
Nanne J. Van Der Zijpp

The problem of estimating time-varying origin-destination matrices from time series of traffic counts is extended to allow for the use of partial vehicle trajectory observations. These may be obtained by using automated vehicle identification (AVI), for example, automated license plate recognition, but they may also originate from floating car data. The central problem definition allows for the use of data from induction loops and AVI equipment at arbitrary (but fixed) locations and allows for the presence of random error in traffic counts and misrecognition at the AVI stations. Although the described methods may be extended to more complex networks, the application addressed involves a single highway corridor in which no route choice alternatives exist. Analysis of the problem leads to an expression for the mutual dependencies between link volume observations and AVI data and the formulation of an estimation problem with inequality constraints. A number of traditional estimation procedures such as discounted constrained least squares (DCLS) and the Kalman filter are described, and a new procedure referred to as Bayesian updating is proposed. The advantage of this new procedure is that it deals with the inequality constraints in an appropriate statistical manner. Experiments with a large number of synthetic data sets indicate in all cases a reduction of the error of estimation due to usage of trajectory counts and, compared with the traditional DCLS and Kalman filtering methods, a superior performance of the Bayesian updating procedure.


Sign in / Sign up

Export Citation Format

Share Document