Simultaneous estimation of transcript abundances and transcript specific fragment distributions of RNA-Seq data with the Mix2 model

Mapping Intimacies ◽

10.1101/005918 ◽

2014 ◽

Author(s):

Andreas Tuerk ◽

Gregor Wiktorin

Keyword(s):

Probability Distributions ◽

Synthetic Data ◽

Superior Performance ◽

Simultaneous Estimation ◽

Rna Seq ◽

Scale Parameters ◽

Rna Transcripts ◽

Abundance Estimates ◽

The Em Algorithm ◽

Specific Fragment

AbstractQuantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragmentation bias, which is not represented appropriately by current statistical models of RNA-Seq data. Another, less investigated, source of error is the inaccuracy of transcript start and end annotations.This article introduces the Mix2 (rd. “mixquare”) model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2 model can be efficiently trained with the EM algorithm and are tied between similar transcripts. Transcript specific shift and scale parameters allow the Mix2 model to automatically correct inaccurate transcript start and end annotations. Experiments are conducted on synthetic data covering 7 genes of different complexity, 4 types of fragment bias and correct as well as incorrect transcript start and end annotations. Abundance estimates obtained by Cufflinks 2.2.0, PennSeq and the Mix2 model show superior performance of the Mix2 model in the vast majority of test conditions.The Mix2 software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz, subject to the enclosed license.Additional experimental data are available in the supplement.

Download Full-text

Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates

10.1101/011767 ◽

2014 ◽

Author(s):

Andreas Tuerk ◽

Gregor Wiktorin ◽

Serhat Güler

Keyword(s):

Probability Distributions ◽

False Positive Rate ◽

Synthetic Data ◽

True Positive Rate ◽

Rna Seq ◽

Microarray Quality Control ◽

Data Set ◽

Rna Transcripts ◽

Positive Rate ◽

Fragment Distribution

Quantification of RNA transcripts with RNA-Seq is inaccurate due to positional fragment bias, which is not represented appropriately by current statistical models of RNA-Seq data. This article introduces the Mix2(rd. "mixquare") model, which uses a mixture of probability distributions to model the transcript specific positional fragment bias. The parameters of the Mix2model can be efficiently trained with the Expectation Maximization (EM) algorithm resulting in simultaneous estimates of the transcript abundances and transcript specific positional biases. Experiments are conducted on synthetic data and the Universal Human Reference (UHR) and Brain (HBR) sample from the Microarray quality control (MAQC) data set. Comparing the correlation between qPCR and FPKM values to state-of-the-art methods Cufflinks and PennSeq we obtain an increase in R2value from 0.44 to 0.6 and from 0.34 to 0.54. In the detection of differential expression between UHR and HBR the true positive rate increases from 0.44 to 0.71 at a false positive rate of 0.1. Finally, the Mix2model is used to investigate biases present in the MAQC data. This reveals 5 dominant biases which deviate from the common assumption of a uniform fragment distribution. The Mix2software is available at http://www.lexogen.com/fileadmin/uploads/bioinfo/mix2model.tgz.

Download Full-text

SAR Target Recognition via Meta-Learning and Amortized Variational Inference

Sensors ◽

10.3390/s20205966 ◽

2020 ◽

Vol 20 (20) ◽

pp. 5966

Author(s):

Ke Wang ◽

Gong Zhang

Keyword(s):

Target Recognition ◽

Probability Distributions ◽

Automatic Target Recognition ◽

Variational Inference ◽

Training Data ◽

Superior Performance ◽

Small Data ◽

Meta Learning ◽

Radar Automatic Target Recognition ◽

Global Parameters

The challenge of small data has emerged in synthetic aperture radar automatic target recognition (SAR-ATR) problems. Most SAR-ATR methods are data-driven and require a lot of training data that are expensive to collect. To address this challenge, we propose a recognition model that incorporates meta-learning and amortized variational inference (AVI). Specifically, the model consists of global parameters and task-specific parameters. The global parameters, trained by meta-learning, construct a common feature extractor shared between all recognition tasks. The task-specific parameters, modeled by probability distributions, can adapt to new tasks with a small amount of training data. To reduce the computation and storage cost, the task-specific parameters are inferred by AVI implemented with set-to-set functions. Extensive experiments were conducted on a real SAR dataset to evaluate the effectiveness of the model. The results of the proposed approach compared with those of the latest SAR-ATR methods show the superior performance of our model, especially on recognition tasks with limited data.

Download Full-text

treeclimbR pinpoints the data-dependent resolution of hierarchical hypotheses

Genome Biology ◽

10.1186/s13059-021-02368-1 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Ruizhu Huang ◽

Charlotte Soneson ◽

Pierre-Luc Germain ◽

Thomas S.B. Schmidt ◽

Christian Von Mering ◽

...

Keyword(s):

Single Cell ◽

Synthetic Data ◽

Cell Types ◽

Data Driven ◽

Rna Seq ◽

Hierarchical Trees

AbstracttreeclimbR is for analyzing hierarchical trees of entities, such as phylogenies or cell types, at different resolutions. It proposes multiple candidates that capture the latent signal and pinpoints branches or leaves that contain features of interest, in a data-driven way. It outperforms currently available methods on synthetic data, and we highlight the approach on various applications, including microbiome and microRNA surveys as well as single-cell cytometry and RNA-seq datasets. With the emergence of various multi-resolution genomic datasets, treeclimbR provides a thorough inspection on entities across resolutions and gives additional flexibility to uncover biological associations.

Download Full-text

Zea mays RNA-seq estimated transcript abundances are strongly affected by read mapping bias

BMC Genomics ◽

10.1186/s12864-021-07577-3 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Shuhua Zhan ◽

Cortland Griswold ◽

Lewis Lukens

Keyword(s):

Gene Expression ◽

Zea Mays ◽

Reference Genome ◽

Transcript Abundance ◽

Gene Transcript ◽

Rna Seq ◽

Individual Genome ◽

Abundance Estimates ◽

Mapping Bias ◽

Quantify Gene Expression

Abstract Background Genetic variation for gene expression is a source of phenotypic variation for natural and agricultural species. The common approach to map and to quantify gene expression from genetically distinct individuals is to assign their RNA-seq reads to a single reference genome. However, RNA-seq reads from alleles dissimilar to this reference genome may fail to map correctly, causing transcript levels to be underestimated. Presently, the extent of this mapping problem is not clear, particularly in highly diverse species. We investigated if mapping bias occurred and if chromosomal features associated with mapping bias. Zea mays presents a model species to assess these questions, given it has genotypically distinct and well-studied genetic lines. Results In Zea mays, the inbred B73 genome is the standard reference genome and template for RNA-seq read assignments. In the absence of mapping bias, B73 and a second inbred line, Mo17, would each have an approximately equal number of regulatory alleles that increase gene expression. Remarkably, Mo17 had 2–4 times fewer such positively acting alleles than did B73 when RNA-seq reads were aligned to the B73 reference genome. Reciprocally, over one-half of the B73 alleles that increased gene expression were not detected when reads were aligned to the Mo17 genome template. Genes at dissimilar chromosomal ends were strongly affected by mapping bias, and genes at more similar pericentromeric regions were less affected. Biased transcript estimates were higher in untranslated regions and lower in splice junctions. Bias occurred across software and alignment parameters. Conclusions Mapping bias very strongly affects gene transcript abundance estimates in maize, and bias varies across chromosomal features. Individual genome or transcriptome templates are likely necessary for accurate transcript estimation across genetically variable individuals in maize and other species.

Download Full-text

Hierarchical Bayesian model to inferPL(Z)relations usingGaiaparallaxes

Astronomy and Astrophysics ◽

10.1051/0004-6361/201832945 ◽

2019 ◽

Vol 623 ◽

pp. A156 ◽

Cited By ~ 3

Author(s):

H. E. Delgado ◽

L. M. Sarro ◽

G. Clementini ◽

T. Muraveva ◽

A. Garofalo

Keyword(s):

Bayesian Model ◽

Probability Distributions ◽

Synthetic Data ◽

Full Description ◽

Hierarchical Bayesian ◽

Hierarchical Bayesian Model ◽

Rr Lyrae Stars ◽

Rr Lyrae ◽

Data Release ◽

Lyrae Stars

In a recent study we analysed period–luminosity–metallicity (PLZ) relations for RR Lyrae stars using theGaiaData Release 2 (DR2) parallaxes. It built on a previous work that was based on the firstGaiaData Release (DR1), and also included period–luminosity (PL) relations for Cepheids and RR Lyrae stars. The method used to infer the relations fromGaiaDR2 data and one of the methods used forGaiaDR1 data was based on a Bayesian model, the full description of which was deferred to a subsequent publication. This paper presents the Bayesian method for the inference of the parameters ofPL(Z) relations used in those studies, the main feature of which is to manage the uncertainties on observables in a rigorous and well-founded way. The method encodes the probability relationships between the variables of the problem in a hierarchical Bayesian model and infers the posterior probability distributions of thePL(Z) relationship coefficients using Markov chain Monte Carlo simulation techniques. We evaluate the method with several semi-synthetic data sets and apply it to a sample of 200 fundamental and first-overtone RR Lyrae stars for whichGaiaDR1 parallaxes and literatureKs-band mean magnitudes are available. We define and test several hyperprior probabilities to verify their adequacy and check the sensitivity of the solution with respect to the prior choice. The main conclusion of this work, based on the test with semi-syntheticGaiaDR1 parallaxes, is the absolute necessity of incorporating the existing correlations between the period, metallicity, and parallax measurements in the form of model priors in order to avoid systematically biased results, especially in the case of non-negligible uncertainties in the parallaxes. The relation coefficients obtained here have been superseded by those presented in our recent paper that incorporates the findings of this work and the more recentGaiaDR2 measurements.

Download Full-text

Automatic estimation of large residual statics corrections

Geophysics ◽

10.1190/1.1442092 ◽

1986 ◽

Vol 51 (2) ◽

pp. 332-346 ◽

Cited By ~ 184

Author(s):

Daniel H. Rothman

Keyword(s):

Monte Carlo ◽

Statistical Mechanics ◽

Probability Distributions ◽

Synthetic Data ◽

Linear Inversion ◽

Traveltime Inversion ◽

Automatic Estimation ◽

Crucial Component ◽

Practical Algorithm ◽

Overthrust Belt

Conventional approaches to residual statics estimation obtain solutions by performing linear inversion of observed traveltime deviations. A crucial component of these procedures is picking time delays; gross errors in these picks are known as “cycle skips” or “leg jumps” and are the bane of linear traveltime inversion schemes. This paper augments Rothman (1985), which demonstrated that the estimation of large statics in noise‐contaminated data is posed better as a nonlinear, rather than as a linear, inverse problem. Cycle skips then appear as local (secondary) minima of the resulting nonlinear optimization problem. In the earlier paper, a Monte Carlo technique from statistical mechanics was adapted to perform global optimization, and the technique was applied to synthetic data. Here I present an application of a similar Monte Carlo method to field data from the Wyoming Overthrust belt. Key changes, however, have led to a more efficient and practical algorithm. The new technique performs explicit crosscorrelation of traces. Instead of picking the peaks of these crosscorrelation functions, the method transforms the crosscorrelation functions to probability distributions and then draws random numbers from the distributions. Estimates of statics are now iteratively updated by this procedure until convergence to the optimal stack is achieved. Here I also derive several theoretical properties of the algorithm. The method is expressed as a Markov chain, in which the equilibrium (steady‐state) distribution is the Gibbs distribution of statistical mechanics.

Download Full-text

CoCo: RNA-seq read assignment correction for nested genes and multimapped reads

Bioinformatics ◽

10.1093/bioinformatics/btz433 ◽

2019 ◽

Vol 35 (23) ◽

pp. 5039-5047 ◽

Cited By ~ 6

Author(s):

Gabrielle Deschamps-Francoeur ◽

Vincent Boivin ◽

Sherif Abou Elela ◽

Michelle S Scott

Keyword(s):

Supplementary Information ◽

Rna Seq ◽

Non Coding Rna ◽

Abundance Estimates ◽

Gene Coverage ◽

Nested Genes ◽

Quantification Accuracy ◽

Whole Transcriptome Analysis ◽

Whole Transcriptome ◽

Generation Sequencing

Abstract Motivation Next-generation sequencing techniques revolutionized the study of RNA expression by permitting whole transcriptome analysis. However, sequencing reads generated from nested and multi-copy genes are often either misassigned or discarded, which greatly reduces both quantification accuracy and gene coverage. Results Here we present count corrector (CoCo), a read assignment pipeline that takes into account the multitude of overlapping and repetitive genes in the transcriptome of higher eukaryotes. CoCo uses a modified annotation file that highlights nested genes and proportionally distributes multimapped reads between repeated sequences. CoCo salvages over 15% of discarded aligned RNA-seq reads and significantly changes the abundance estimates for both coding and non-coding RNA as validated by PCR and bedgraph comparisons. Availability and implementation The CoCo software is an open source package written in Python and available from http://gitlabscottgroup.med.usherbrooke.ca/scott-group/coco. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

VIRTUS: a pipeline for comprehensive virus analysis from conventional RNA-seq data

Bioinformatics ◽

10.1093/bioinformatics/btaa859 ◽

2020 ◽

Author(s):

Yoshiaki Yasumizu ◽

Atsushi Hara ◽

Shimon Sakaguchi ◽

Naganari Ohkura

Keyword(s):

Sequence Data ◽

Supplementary Information ◽

Clinical Samples ◽

Host Gene Expression ◽

Rna Seq ◽

Viral Transcript ◽

Rna Transcripts ◽

Copy Numbers ◽

Infected Cells ◽

New Treatments

Abstract Summary The possibility that RNA transcripts from clinical samples contain plenty of virus RNAs has not been pursued actively so far. We here developed a new tool for analyzing virus-transcribed mRNAs, not virus copy numbers, in the data of bulk and single-cell RNA-sequencing of human cells. Our pipeline, named VIRTUS (VIRal Transcript Usage Sensor), was able to detect 762 viruses including herpesviruses, retroviruses and even SARS-CoV-2 (COVID-19), and quantify their transcripts in the sequence data. This tool thus enabled simultaneously detecting infected cells, the composition of multiple viruses within the cell, and the endogenous host-gene expression profile of the cell. This bioinformatics method would be instrumental in addressing the possible effects of covertly infecting viruses on certain diseases and developing new treatments to target such viruses. Availability and implementation : VIRTUS is implemented using Common Workflow Language and Docker under a CC-NC license. VIRTUS is freely available at https://github.com/yyoshiaki/VIRTUS. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Directed Culturing of Microorganisms Using Metatranscriptomics

mBio ◽

10.1128/mbio.00012-11 ◽

2011 ◽

Vol 2 (2) ◽

Cited By ~ 92

Author(s):

Lindsey Bomar ◽

Michele Maltz ◽

Sophie Colston ◽

Joerg Graf

Keyword(s):

High Throughput ◽

Culture Medium ◽

High Throughput Sequencing ◽

Bacterial Species ◽

Hydrolytic Enzymes ◽

Medicinal Leech ◽

Expression Data ◽

Rna Seq ◽

Rna Transcripts ◽

Uncultured Microorganisms

ABSTRACTThe vast majority of bacterial species remain uncultured, and this severely limits the investigation of their physiology, metabolic capabilities, and role in the environment. High-throughput sequencing of RNA transcripts (RNA-seq) allows the investigation of the diverse physiologies from uncultured microorganisms in their natural habitat. Here, we report the use of RNA-seq for characterizing the metatranscriptome of the simple gut microbiome from the medicinal leechHirudo verbanaand for utilizing this information to design a medium for cultivating members of the microbiome. Expression data suggested that aRikenella-like bacterium, the most abundant but uncultured symbiont, forages on sulfated- and sialated-mucin glycans that are fermented, leading to the secretion of acetate. Histological stains were consistent with the presence of sulfated and sialated mucins along the crop epithelium. The second dominant symbiont,Aeromonas veronii, grows in two different microenvironments and is predicted to utilize either acetate or carbohydrates. Based on the metatranscriptome, a medium containing mucin was designed, which enabled the cultivation of theRikenella-like bacterium. Metatranscriptomes shed light on microbial metabolismin situand provide critical clues for directing the culturing of uncultured microorganisms. By choosing a condition under which the desired organism is rapidly proliferating and focusing on highly expressed genes encoding hydrolytic enzymes, binding proteins, and transporters, one can identify an organism’s nutritional preferences and design a culture medium.IMPORTANCEThe number of prokaryotes on the planet has been estimated to exceed 1030cells, and the overwhelming majority of them have evaded cultivation, making it difficult to investigate their ecological, medical, and industrial relevance. The application of transcriptomics based on high-throughput sequencing of RNA transcripts (RNA-seq) to microorganisms in their natural environment can provide investigators with insight into their physiologies under optimal growth conditions. We utilized RNA-seq to learn more about the uncultured and cultured symbionts that comprise the relatively simple digestive-tract microbiome of the medicinal leech. The expression data revealed highly expressed hydrolytic enzymes and transporters that provided critical clues for the design of a culture medium enabling the isolation of the previously unculturedRikenella-like symbiont. This directed culturing method will greatly aid efforts aimed at understanding uncultured microorganisms, including beneficial symbionts, pathogens, and ecologically relevant microorganisms, by facilitating genome sequencing, physiological characterization, and genetic manipulation of the previously uncultured microbes.

Download Full-text

Dynamic Origin-Destination Matrix Estimation from Traffic Counts and Automated Vehicle Identification Data

Transportation Research Record Journal of the Transportation Research Board ◽

10.3141/1607-13 ◽

1997 ◽

Vol 1607 (1) ◽

pp. 87-94 ◽

Cited By ~ 43

Author(s):

Nanne J. Van Der Zijpp

Keyword(s):

Synthetic Data ◽

Inequality Constraints ◽

Bayesian Updating ◽

Problem Definition ◽

Superior Performance ◽

License Plate ◽

Traffic Counts ◽

Matrix Estimation ◽

Automated Vehicle ◽

Vehicle Identification

The problem of estimating time-varying origin-destination matrices from time series of traffic counts is extended to allow for the use of partial vehicle trajectory observations. These may be obtained by using automated vehicle identification (AVI), for example, automated license plate recognition, but they may also originate from floating car data. The central problem definition allows for the use of data from induction loops and AVI equipment at arbitrary (but fixed) locations and allows for the presence of random error in traffic counts and misrecognition at the AVI stations. Although the described methods may be extended to more complex networks, the application addressed involves a single highway corridor in which no route choice alternatives exist. Analysis of the problem leads to an expression for the mutual dependencies between link volume observations and AVI data and the formulation of an estimation problem with inequality constraints. A number of traditional estimation procedures such as discounted constrained least squares (DCLS) and the Kalman filter are described, and a new procedure referred to as Bayesian updating is proposed. The advantage of this new procedure is that it deals with the inequality constraints in an appropriate statistical manner. Experiments with a large number of synthetic data sets indicate in all cases a reduction of the error of estimation due to usage of trajectory counts and, compared with the traditional DCLS and Kalman filtering methods, a superior performance of the Bayesian updating procedure.

Download Full-text