scholarly journals Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences

2016 ◽  
Author(s):  
Matthias Siebert ◽  
Johannes Söding

AbstractPosition weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k-1 act as priors for those of order k. This Bayesian Markov model (BMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BMMs achieve significantly (p<0.063) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26%-101%. BMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BMMs. The Bayesian Markov Model motif discovery software BaMM!motif is available under GPL at http://github.com/soedinglab/BaMMmotif.

2020 ◽  
Vol 2020 ◽  
pp. 1-13
Author(s):  
Lulu Deng ◽  
Long Li ◽  
Cheng Zou ◽  
Chengchi Fang ◽  
Changchun Li

Many increasing documents have proved that alternative polyadenylation (APA) events with different polyadenylation sites (PAS) contribute to posttranscriptional regulation. However, little is known about the detailed molecular features of PASs and its role in porcine fast and slow skeletal muscles through microRNAs (miRNAs) and RNA binding proteins (RBPs). In this study, we combined single-molecule real-time sequencing and Illumina RNA-seq datasets to comprehensively analyze polyadenylation in pigs. We identified a total of 10,334 PASs, of which 8734 were characterized by reference genome annotation. 32.86% of PAS-associated genes were determined to have more than one PAS. Further analysis demonstrated that tissue-specific PASs between fast and slow muscles were enriched in skeletal muscle development pathways. In addition, we obtained 1407 target genes regulated by APA events through potential binding 69 miRNAs and 28 RBPs in variable 3′ UTR regions and some are involved in myofiber transformation. Furthermore, the de novo motif search confirmed that the most common usage of canonical motif AAUAAA and three types of PASs may be related to the strength of motifs. In summary, our results provide a useful annotation of PASs for pig transcriptome and suggest that APA may serve as a role in fast and slow muscle development under the regulation of miRNAs and RBPs.


Author(s):  
Najla Ksouri ◽  
Jaime A. Castro-Mondragón ◽  
Francesc Montardit-Tardà ◽  
Jacques van Helden ◽  
Bruno Contreras-Moreira ◽  
...  

AbstractIdentification of functional regulatory elements encoded in plant genomes is a fundamental need to understand gene regulation. While much attention has been given to model species as Arabidopsis thaliana, little is known about regulatory motifs in other plant genera. Here, we describe an accurate bottom-up approach using the online workbench RSAT::Plants for a versatile ab-initio motif discovery taking Prunus persica as a model. These predictions rely on the construction of a co-expression network to generate modules with similar expression trends and assess the effect of increasing upstream region length on the sensitivity of motif discovery. Applying two discovery algorithms, 18 out of 45 modules were found to be enriched in motifs typical of well-known transcription factor families (bHLH, bZip, BZR, CAMTA, DOF, E2FE, AP2-ERF, Myb-like, NAC, TCP, WRKY) and a novel motif. Our results indicate that small number of input sequences and short promoter length are preferential to minimize the amount of uninformative signals in peach. The spatial distribution of TF binding sites revealed an unbalanced distribution where motifs tend to lie around the transcriptional start site region. The reliability of this approach was also benchmarked in Arabidopsis thaliana, where it recovered the expected motifs from promoters of genes containing ChIPseq peaks. Overall, this paper presents a glimpse of the peach regulatory components at genome scale and provides a general protocol that can be applied to many other species. Additionally, a RSAT Docker container was released to facilitate similar analyses on other species or to reproduce our results.One sentence summaryMotifs prediction depends on the promoter size. A proximal promoter region defined as an interval of -500 bp to +200 bp seems to be the adequate stretch to predict de novo regulatory motifs in peach


2018 ◽  
Author(s):  
Maya Polishchuk ◽  
Inbal Paz ◽  
Zohar Yakhini ◽  
Yael Mandel-Gutfreund

2019 ◽  
Author(s):  
Mallika Parulekar ◽  
Leelavati Narlikar

AbstractWith rapid advances in experimental methods that map transcription start sites (TSSs) at a high resolution, there is a need to characterize the sequence diversity of TSS neighborhoods. Most current techniques scan for previously discovered elements, such as the TATA box, the INR motif, CpG islands, etc. to categorize promoters into different classes. Reliance on such elements hinders the discovery of novel elements. On the other hand, methods that use standard motif discovery to discover de novo promoter elements are also limited by the fact that a motif is picked up only if it is over-represented in the dataset. An element that appears only in a small set of promoters can thus be missed. We previously developed a clustering-based approach that uses no prior knowledge of elements to solve this problem [1]. That method uses Gibbs sampling to learn the model parameters, but is untenable on large datasets. Here we propose a new, fast method called MINDFUL, that uses a greedy k-means-like approach to cluster promoters aligned by TSSs into diverse classes, while also learning the optimal value of k. It is general enough to be used for any data that has categorical variables, and is not restricted to DNA.


2020 ◽  
Vol 17 (171) ◽  
pp. 20200600
Author(s):  
Ibrahim Sultan ◽  
Vincent Fromion ◽  
Sophie Schbath ◽  
Pierre Nicolas

Automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. The central idea of this model is to improve the probabilistic representation of the promoter DNA sequences by incorporating covariates summarizing expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). A dedicated trans-dimensional Markov chain Monte Carlo algorithm adjusts the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe exact position relative to the transcription start site, and chooses the expression covariates relevant for each motif. All parameters are estimated simultaneously, for many motifs and many expression covariates. The method is applied to a dataset of transcription start sites and expression profiles available for Listeria monocytogenes . The results validate the approach and provide a new global view of the transcription regulatory network of this important pathogen. Remarkably, a previously unreported motif is found in promoter regions of ribosomal protein genes, suggesting a role in the regulation of growth.


2019 ◽  
Author(s):  
Ibrahim Sultan ◽  
Vincent Fromion ◽  
Sophie Schbath ◽  
Pierre Nicolas

AbstractAutomatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model of promoter DNA sequences that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. Two main novelties are to allow overlaps between motif occurrences and to incorporate covariates summarising expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). All parameters are estimated using a dedicated trans-dimensional Markov chain Monte Carlo algorithm that adjusts, simultaneously, for many motifs and many expression covariates: the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe position with respect to the transcription start site, and the choice of relevant expression covariates. A data-set of transcription start sites and expression profiles available for the Listeria monocytogenes is analysed. The results validate the approach and provide a new global view of the transcription regulatory network of this important model food-borne pathogen. A previously unreported motif that may play an important role in the regulation of growth was found in promoter regions of ribosomal protein genes.


2020 ◽  
Author(s):  
Timothy L. Bailey

AbstractSequence motif discovery algorithms can identify novel sequence patterns that perform biological functions in DNA, RNA and protein sequences—for example, the binding site motifs of DNA- and RNA-binding proteins. The STREME algorithm presented here advances the state-of-the-art in ab initio motif discovery in terms of both accuracy and versatility. Using in vivo DNA (ChIP-seq) and RNA (CLIP-seq) data, and validating motifs with reference motifs derived from in vitro data, we show that STREME is more accurate, sensitive, thorough and rapid than several widely used algorithms (DREME, HOMER, MEME, Peak-motifs and Weeder). STREME’s capabilities include the ability to find motifs in datasets with hundreds of thousands of sequences, to find both short and long motifs (from 3 to 30 positions), to perform differential motif discovery in pairs of sequence datasets, and to find motifs in sequences over virtually any alphabet (DNA, RNA, protein and user-defined alphabets). Unlike most motif discovery algorithms, STREME accurately estimates and reports the statistical significance of each motif that it discovers. STREME is easy to use via its web server at http://meme-suite.org, and is fully integrated with the widely-used MEME Suite of sequence analysis tools, which can be freely downloaded at the same web site for non-commercial use.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Yogeeshwar Ajjugal ◽  
Narendar Kolimi ◽  
Thenmalarchelvi Rathinavelan

AbstractCGG tandem repeat expansion in the 5′-untranslated region of the fragile X mental retardation-1 (FMR1) gene leads to unusual nucleic acid conformations, hence causing genetic instabilities. We show that the number of G…G (in CGG repeat) or C…C (in CCG repeat) mismatches (other than A…T, T…A, C…G and G…C canonical base pairs) dictates the secondary structural choice of the sense and antisense strands of the FMR1 gene and their corresponding transcripts in fragile X-associated tremor/ataxia syndrome (FXTAS). The circular dichroism (CD) spectra and electrophoretic mobility shift assay (EMSA) reveal that CGG DNA (sense strand of the FMR1 gene) and its transcript favor a quadruplex structure. CD, EMSA and molecular dynamics (MD) simulations also show that more than four C…C mismatches cannot be accommodated in the RNA duplex consisting of the CCG repeat (antisense transcript); instead, it favors an i-motif conformational intermediate. Such a preference for unusual secondary structures provides a convincing justification for the RNA foci formation due to the sequestration of RNA-binding proteins to the bidirectional transcripts and the repeat-associated non-AUG translation that are observed in FXTAS. The results presented here also suggest that small molecule modulators that can destabilize FMR1 CGG DNA and RNA quadruplex structures could be promising candidates for treating FXTAS.


2021 ◽  
pp. 1-6
Author(s):  
Miriam C. Aziz ◽  
Patricia N. Schneider ◽  
Gemma L. Carvill

Developmental and epileptic encephalopathies (DEEs) describe a subset of neurodevelopmental disorders categorized by refractory epilepsy that is often associated with intellectual disability and autism spectrum disorder. The majority of DEEs are now known to have a genetic basis with de novo coding variants accounting for the majority of cases. More recently, a small number of individuals have been identified with intronic <i>SCN1A</i> variants that result in alternative splicing events that lead to ectopic inclusion of poison exons (PEs). PEs are short highly conserved exons that contain a premature truncation codon, and when spliced into the transcript, lead to premature truncation and subsequent degradation by nonsense-mediated decay. The reason for the inclusion/exclusion of these PEs is not entirely clear, but research suggests an autoregulatory role in gene expression and protein abundance. This is seen in proteins such as RNA-binding proteins and serine/arginine-rich proteins. Recent studies have focused on targeting these PEs as a method for therapeutic intervention. Targeting PEs using antisense oligonucleotides (ASOs) has shown to be effective in modulating alternative splicing events by decreasing the amount of transcripts harboring PEs, thus increasing the abundance of full-length transcripts and thereby the amount of protein in haploinsufficient genes implicated in DEE. In the age of personalized medicine, cellular and animal models of the genetic epilepsies have become essential in developing and testing novel precision therapeutics, including PE-targeting ASOs in a subset of DEEs.


Sign in / Sign up

Export Citation Format

Share Document