MINDFUL: A Method to Identify Novel and Diverse Signals with Fast, Unsupervised Learning

Mapping Intimacies ◽

10.1101/805820 ◽

2019 ◽

Author(s):

Mallika Parulekar ◽

Leelavati Narlikar

Keyword(s):

Motif Discovery ◽

De Novo ◽

Cpg Islands ◽

Categorical Variables ◽

Model Parameters ◽

Fast Method ◽

Transcription Start ◽

Transcription Start Sites ◽

Optimal Value ◽

Small Set

AbstractWith rapid advances in experimental methods that map transcription start sites (TSSs) at a high resolution, there is a need to characterize the sequence diversity of TSS neighborhoods. Most current techniques scan for previously discovered elements, such as the TATA box, the INR motif, CpG islands, etc. to categorize promoters into different classes. Reliance on such elements hinders the discovery of novel elements. On the other hand, methods that use standard motif discovery to discover de novo promoter elements are also limited by the fact that a motif is picked up only if it is over-represented in the dataset. An element that appears only in a small set of promoters can thus be missed. We previously developed a clustering-based approach that uses no prior knowledge of elements to solve this problem [1]. That method uses Gibbs sampling to learn the model parameters, but is untenable on large datasets. Here we propose a new, fast method called MINDFUL, that uses a greedy k-means-like approach to cluster promoters aligned by TSSs into diverse classes, while also learning the optimal value of k. It is general enough to be used for any data that has categorical variables, and is not restricted to DNA.

Download Full-text

Statistical modelling of bacterial promoter sequences for regulatory motif discovery with the help of transcriptome data: application to Listeria monocytogenes

Journal of The Royal Society Interface ◽

10.1098/rsif.2020.0600 ◽

2020 ◽

Vol 17 (171) ◽

pp. 20200600

Author(s):

Ibrahim Sultan ◽

Vincent Fromion ◽

Sophie Schbath ◽

Pierre Nicolas

Keyword(s):

Listeria Monocytogenes ◽

Dna Sequences ◽

Motif Discovery ◽

De Novo ◽

Expression Profiles ◽

Monte Carlo Algorithm ◽

Transcriptome Data ◽

Ribosomal Protein Genes ◽

Transcription Start ◽

Transcription Start Sites

Automatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. The central idea of this model is to improve the probabilistic representation of the promoter DNA sequences by incorporating covariates summarizing expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). A dedicated trans-dimensional Markov chain Monte Carlo algorithm adjusts the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe exact position relative to the transcription start site, and chooses the expression covariates relevant for each motif. All parameters are estimated simultaneously, for many motifs and many expression covariates. The method is applied to a dataset of transcription start sites and expression profiles available for Listeria monocytogenes . The results validate the approach and provide a new global view of the transcription regulatory network of this important pathogen. Remarkably, a previously unreported motif is found in promoter regions of ribosomal protein genes, suggesting a role in the regulation of growth.

Download Full-text

Statistical modelling of bacterial promoter sequences for regulatory motif discovery with the help of transcriptome data: application to Listeria monocytogenes

10.1101/723346 ◽

2019 ◽

Author(s):

Ibrahim Sultan ◽

Vincent Fromion ◽

Sophie Schbath ◽

Pierre Nicolas

Keyword(s):

Listeria Monocytogenes ◽

Dna Sequences ◽

Motif Discovery ◽

De Novo ◽

Expression Profiles ◽

Monte Carlo Algorithm ◽

Transcriptome Data ◽

Transcription Start ◽

Data Set ◽

Transcription Start Sites

AbstractAutomatic de novo identification of the main regulons of a bacterium from genome and transcriptome data remains a challenge. To address this task, we propose a statistical model of promoter DNA sequences that can use information on exact positions of the transcription start sites and condition-dependent expression profiles. Two main novelties are to allow overlaps between motif occurrences and to incorporate covariates summarising expression profiles (e.g. coordinates in projection spaces or hierarchical clustering trees). All parameters are estimated using a dedicated trans-dimensional Markov chain Monte Carlo algorithm that adjusts, simultaneously, for many motifs and many expression covariates: the width and palindromic properties of the corresponding position-weight matrices, the number of parameters to describe position with respect to the transcription start site, and the choice of relevant expression covariates. A data-set of transcription start sites and expression profiles available for the Listeria monocytogenes is analysed. The results validate the approach and provide a new global view of the transcription regulatory network of this important model food-borne pathogen. A previously unreported motif that may play an important role in the regulation of growth was found in promoter regions of ribosomal protein genes.

Download Full-text

Genome-wide analysis reveals strong correlation between CpG islands with nearby transcription start sites of genes and their tissue specificity

Gene ◽

10.1016/j.gene.2005.01.012 ◽

2005 ◽

Vol 350 (2) ◽

pp. 129-136 ◽

Cited By ~ 63

Author(s):

Riu Yamashita ◽

Yutaka Suzuki ◽

Sumio Sugano ◽

Kenta Nakai

Keyword(s):

Strong Correlation ◽

Tissue Specificity ◽

Cpg Islands ◽

Transcription Start ◽

Transcription Start Sites ◽

Genome Wide Analysis ◽

Genome Wide

Download Full-text

Digital Restriction Enzyme Analysis of Methylation (DREAM) by Next Generation Sequencing Yields High Resolution Maps of DNA Methylation.

Blood ◽

10.1182/blood.v114.22.567.567 ◽

2009 ◽

Vol 114 (22) ◽

pp. 567-567

Author(s):

Jaroslav Jelinek ◽

Shoudan Liang ◽

Marcos R. H. Estecio ◽

Rong He ◽

Yue Lu ◽

...

Keyword(s):

Dna Methylation ◽

High Resolution ◽

Restriction Site ◽

Cpg Islands ◽

Normal Blood ◽

Enzyme Analysis ◽

Restriction Enzyme Analysis ◽

Transcription Start ◽

Transcription Start Sites ◽

Generation Sequencing

Abstract Abstract 567 Methylation of CpG dinucleotides in DNA is a key epigenetic feature important for × chromosome inactivation, silencing of retrotransposons and genomic imprinting. DNA methylation undergoes complex changes in leukemia, most notably methylation of CpG islands at promoters and associated gene silencing. The direct comparison of epigenomes in normal and neoplastic blood cells will likely increase our understanding of the complex pathology of leukemia. We have developed a digital restriction enzyme analysis of methylation (DREAM) for quantitative mapping of DNA methylation with high resolution on the genome-wide scale. To perform the analysis, genomic DNA is sequentially digested with a pair of enzymes recognizing the same restriction site (CCCGGG) containing a CpG dinucleotide. The first enzyme, SmaI, cuts only at unmethylated CpG and leaves blunt ends. The second enzyme, XmaI, is not blocked by methylation and leaves a short 5' overhang. The enzymes thus create methylation-specific signatures at ends of digested DNA fragments. These are deciphered by next generation sequencing. Methylation levels for each sequenced restriction site are calculated based on the numbers of DNA molecules with the methylated or unmethylated signatures. Using the DREAM method and sequencing on the Illumina Gene Analyzer II platform, we analyzed DNA methylation in a normal adult blood sample. We acquired 32.5 million sequence tags; of these, 16.6 million were mapped to SmaI/XmaI sites unique in the human genome. With a threshold of minimum 5-fold coverage, we obtained quantitative information on the DNA methylation level of 85,171 CpG sites (23% of all genomic SmaI/XmaI sites) in 21,240 genes. The accuracy of DREAM methylation data was validated by a strong correlation with the bisulfite pyrosequencing analysis of 49 genes (R=0.83) and of spiked in plasmid DNA. In normal blood, methylation was strikingly bimodal with 39% sites showing methylation levels below 5% and 28% sites being hypermethylated at levels >95%. Methylation was largely absent within CpG islands (CGI) and more prevalent outside (non-CGI). Close to transcription start sites (within 500 bp), methylation >75% was found only in 0.65% of CGIs compared to 14% in non-CGIs (P<0.001). The methylated CGI promoters were significantly enriched for genes expressed in spermatogenesis and likely correspond to a class of potential cancer-testis antigens previously identified. Away from transcription start sites (>2 kb), methylation >75% was found in 24% of CGIs compared to 72% of non-CGIs (P<0.001). Transcription end regions were methylated in 20% in CGIs compared to 68% in non-CGIs (P<0.001). Also, we observed that 1.4% of CGIs had evidence of half methylation (35-65%), representing potentially imprinted genes. Indeed, this class includes known imprinted regions at chromosomes 8q24.3 and 11p15. Finally, we compared non-CGI promoters showing significant methylation to those free of methylation. Unmethylated promoters were more likely to be expressed in normal blood, and to encode for genes involved in metabolic processes and their regulation. In conclusion, high resolution quantitative methylation analysis is feasible using the DREAM method, and reveals important classes of genes based on methylation in normal blood. Disclosures: No relevant conflicts of interest to declare.

Download Full-text

Human T-Cell Leukemia Virus Type 1 Integration Target Sites in the Human Genome: Comparison with Those of Other Retroviruses

Journal of Virology ◽

10.1128/jvi.02752-06 ◽

2007 ◽

Vol 81 (12) ◽

pp. 6731-6741 ◽

Cited By ~ 103

Author(s):

David Derse ◽

Bruce Crise ◽

Yuan Li ◽

Gerald Princler ◽

Nicole Lum ◽

...

Keyword(s):

Site Selection ◽

Integration Site ◽

Leukemia Virus ◽

Cpg Islands ◽

Transcription Start ◽

Cell Leukemia Virus Type ◽

Retroviral Integration ◽

Transcription Start Sites ◽

Human T Cell ◽

T Cell Leukemia Virus

ABSTRACT Retroviral integration into the host genome is not entirely random, and integration site preferences vary among different retroviruses. Human immunodeficiency virus (HIV) prefers to integrate within active genes, whereas murine leukemia virus (MLV) prefers to integrate near transcription start sites and CpG islands. On the other hand, integration of avian sarcoma-leukosis virus (ASLV) shows little preference either for genes, transcription start sites, or CpG islands. While host cellular factors play important roles in target site selection, the viral integrase is probably the major viral determinant. It is reasonable to hypothesize that retroviruses with similar integrases have similar preferences for target site selection. Although integration profiles are well defined for members of the lentivirus, spumaretrovirus, alpharetrovirus, and gammaretrovirus genera, no members of the deltaretroviruses, for example, human T-cell leukemia virus type 1 (HTLV-1), have been evaluated. We have mapped 541 HTLV-1 integration sites in human HeLa cells and show that HTLV-1, like ASLV, does not specifically target transcription units and transcription start sites. Comparing the integration sites of HTLV-1 with those of ASLV, HIV, simian immunodeficiency virus, MLV, and foamy virus, we show that global and local integration site preferences correlate with the sequence/structure of virus-encoded integrases, supporting the idea that integrase is the major determinant of retroviral integration site selection. Our results suggest that the global integration profiles of other retroviruses could be predicted from phylogenetic comparisons of the integrase proteins. Our results show that retroviruses that engender different insertional mutagenesis risks can have similar integration profiles.

Download Full-text

Optimal and regulated transcription facilitates formation of damage-induced cohesion

10.1101/2020.12.20.423707 ◽

2020 ◽

Author(s):

Pei-Shang Wu ◽

Donald P. Cameron ◽

Jan Grosser ◽

Laura Baranello ◽

Lena Ström

Keyword(s):

De Novo ◽

Transcriptional Response ◽

Chromatin Accessibility ◽

Regulation Of Transcription ◽

Transcription Start ◽

Transcription Start Sites ◽

Expression Of Genes ◽

Chromatid Cohesion ◽

Active Transcription ◽

Histone Exchange

AbstractThe SMC complex cohesin mediates sister chromatid cohesion established during replication, and damage-induced cohesion formed in response to DSBs post replication. The translesion synthesis polymerase Polη is required for damage-induced cohesion through a hitherto unknown mechanism. Since Polη is functionally associated with transcription, and transcription triggers de novo cohesion in S. pombe, we hypothesized that active transcription facilitates damage-induced cohesion in S. cerevisiae. Here, we found that expression of genes involved in chromatin assembly and positive transcription regulation were relatively enriched in WT compared to Polη-deficient cells (rad30Δ). The rad30Δ mutant showed a dysregulated transcriptional response and increased cohesin binding around transcription start sites. Perturbing histone exchange at promoters adversely affected damage-induced cohesion, similarly to deletion of RAD30. Conversely, altering chromatin accessibility or regulation of transcription elongation, suppressed the lack of damage-induced cohesion in rad30Δ cells. These results indicate that Polη promotes damage-induced cohesion through its role in transcription, and support the model that regulated transcription facilitates formation of damage-induced cohesion.

Download Full-text

Epigenetic Drug Treatment Globally Induces Cryptic Transcription Start Sites Encoded in Long Terminal Repeats

Blood ◽

10.1182/blood.v128.22.3931.3931 ◽

2016 ◽

Vol 128 (22) ◽

pp. 3931-3931

Author(s):

Michael Daskalakis ◽

David Brocks ◽

Christopher Schmidt ◽

Daofeng Li ◽

Jing Li ◽

...

Keyword(s):

Drug Treatment ◽

De Novo ◽

Protein Isoforms ◽

Epigenetic Therapy ◽

Transcription Start ◽

Altered Expression ◽

Transcription Start Sites ◽

Fusion Transcripts ◽

Dnmt Inhibitors ◽

Epigenetic Drug

Abstract Epigenetic drugs are currently used for the treatment of several hematologic malignancies, but their mechanism of action remains poorly understood. By using a previously described reporter cell line for epigenetic reactivation of the DAPK1 locus, we have shown that epigenetic treatment causes transcription from uncharacterized intronic transcription start sites (TSSs), thereby generating DAPK1 mRNA with novel first exons. Based on these findings, we analyzed whether inhibition of DNA-Methyltransferases (DNMTs), Histone deacetylases (HDACs), or both resulted in the genome-wide induction of non-canonical TSSs. While epigenetic treatment altered expression of known promoter sites, we observed that both HDAC- and DNMT-inhibitors predominantly induced de novo transcription from cryptic promoters encoded in long-terminal repeat (LTR) retrotransposons. These LTR-associated 'treatment induced, not-annotated TSS' (TINATs) are currently not annotated and normally silenced in almost all cell types with the exception of testicular und thymic tissue. In the majority of cases, these TINATs arose most commonly from LTR12 elements, particularly LTR12C (which apparently provides 50% of all TINATs). TINAT activation after DNMT-inhibitors (DNMTi) coincided with DNA hypomethylation and gain in H3K4me3, H3K9ac, and H3K27ac histone marks. In contrast, HDAC-inhibitors (HDACi) induced only canonical TSSs in association with histone acetylation, but TINATs via a yet unknown mechanism. Nevertheless, both inhibitors convergently induced unidirectional transcription from identical TINAT sites. Moreover, we found a consensus GATA2 binding motif which strongly distinguished LTR12Cs with TINATs from LTR12Cs without TINATs, supporting that GATA2 is likely the upstream transcription factor responsible for TINAT activation. TINATs originating from non-canonical TSSs located within introns of protein-coding genes frequently spliced into downstream exons thereby creating LTR/non-LTR fusion transcripts that harbor novel in place of canonical exon sequence at their 5' end. The resulting transcripts encode truncated or chimeric open reading frames which translated into currently uncharacterized protein isoforms with predicted abnormal functions or immunogenic potential, the last one based on their foreign sequence and capability of being presented on MHC-class I molecules. In summary, we could show that DNMTi and/or HDACi do not predominantly alter the expression of canonical genes, but induce de novo transcription of LTRs especially of the LTR12 family, resulting in numerous fusion transcripts that encode novel protein isoforms which might have the potential to influence cell proliferation or might be an elegant explanation for the priming effect of epigenetic therapy. Ongoing experiments are investigating the functional mechanisms of TINAT reactivation upon epigenetic drug treatment and future proteomic approaches combined with T-cell cytotoxicity assays will further shed light on the interaction between epigenetic and immune therapy and the role of ERV-derived antigen presentation. Disclosures Lübbert: Janssen-Cilag: Other: Travel Funding, Research Funding; Ratiopharm: Other: Study drug valproic acid; Celgene: Other: Travel Funding.

Download Full-text

Bayesian Markov models consistently outperform PWMs at predicting motifs in nucleotide sequences

10.1101/047647 ◽

2016 ◽

Author(s):

Matthias Siebert ◽

Johannes Söding

Keyword(s):

Motif Discovery ◽

Rna Binding ◽

Markov Models ◽

De Novo ◽

Model Complexity ◽

Transcription Start Sites ◽

Regulatory Motifs ◽

Dna And Rna ◽

The Standard Model ◽

Polyadenylation Sites

AbstractPosition weight matrices (PWMs) are the standard model for DNA and RNA regulatory motifs. In PWMs nucleotide probabilities are independent of nucleotides at other positions. Models that account for dependencies need many parameters and are prone to overfitting. We have developed a Bayesian approach for motif discovery using Markov models in which conditional probabilities of order k-1 act as priors for those of order k. This Bayesian Markov model (BMM) training automatically adapts model complexity to the amount of available data. We also derive an EM algorithm for de-novo discovery of enriched motifs. For transcription factor binding, BMMs achieve significantly (p<0.063) higher cross-validated partial AUC than PWMs in 97% of 446 ChIP-seq ENCODE datasets and improve performance by 36% on average. BMMs also learn complex multipartite motifs, improving predictions of transcription start sites, polyadenylation sites, bacterial pause sites, and RNA binding sites by 26%-101%. BMMs never performed worse than PWMs. These robust improvements argue in favour of generally replacing PWMs by BMMs. The Bayesian Markov Model motif discovery software BaMM!motif is available under GPL at http://github.com/soedinglab/BaMMmotif.

Download Full-text

Unique Integration Profiles of Gammaretrovirus, Lentivirus, and Foamy Virus Transduced Dog Long-Term Repopulating Cells.

Blood ◽

10.1182/blood.v108.11.3252.3252 ◽

2006 ◽

Vol 108 (11) ◽

pp. 3252-3252

Author(s):

Kirsten A. Keyser ◽

Brian C. Beard ◽

Grant Trobridge ◽

Laura J. Peterson ◽

Dan Miller ◽

...

Keyword(s):

Gene Therapy ◽

Foamy Virus ◽

Genetic Diseases ◽

Cpg Islands ◽

Vector System ◽

Transcription Start ◽

Transcription Start Sites ◽

Integration Sites ◽

Clinical Gene Therapy ◽

Retrovirus Integration

Abstract Retroviral vectors have been the most effective gene delivery vehicles for hematopoietic stem cell (HSC) gene therapy and patients are now being successfully treated for genetic diseases. Following the success of gene therapy for inherited disorders, such as X-linked severe combined immunodeficiency (SCID), three of the patients developed overt leukemia, and a spontaneous expansion of gene-marked cells has been described in two patients treated in an X-linked chronic granulomatous clinical trial. In spite of these outcomes clinical gene therapy trials continue to show efficacy for patients with genetic diseases. Valuable data has been generated regarding retrovirus integration profiles and factors that can lead to clonal dominance (i.e. enhancer activation and multiple integration sites) in murine models. Large animal studies extend these analyses to clinically predictive models, that should more accurately indicate what would occur in human clinical trials. We analyzed the retrovirus integration profile in dogs transplanted with cells gene-modified with either gammaretrovirus (n=5 dogs), HIV-derived lentivirus (n=6 dogs), or foamy virus (n=2 dogs) vectors using a sensitive LAM-PCR method modified to facilitate high-throughput analysis. The samples used for LAM-PCR ranged from 95–600 days after transplantation. We analyzed over 14,500 sequence reads and were able to unambiguously align a total of 555 unique integration sites to the dog genome with 82 unique gammaretroviral integrants, 210 unique lentiviral integrants, and 263 unique foamy viral integrants. We defined the integration patterns relative to transcription units, a subset of previously defined proto-oncogenes and CpG islands. The most prevalent clustering within 50 kilobases of RefSeq transcription start sites and into CpG islands was seen in gammaretroviral integrants (73.2% and 3.7%, respectively) and to a lesser extent in foamy viral integrants (53.6% and 3.4%, respectively). Regarding integration into RefSeq genes, lentiviral integrants showed the most significant increase (58.6%) relative to random integration analysis (37.3%). Gammaretroviral integrants were also significantly increased both in proto-oncogenes (7.3%) and within 50 kilobases of the transcription start site of proto-oncogenes (7.3%). In addition, even though fewer sites have been analyzed and localized for gammaretrovirus vectors, compared to lentiviral and foamy viral sites, we found two distinct gammaretroviral integrants in the MDS/Evi1 locus with no lentiviral or foamy viral integrants localized in this region. While no single factor distinguishes one of the retroviruses as the ‘safest’ vector system, the low frequency of integration of foamy virus into RefSeq genes and the decreased density of lentiviral integrants around transcription start sites suggest that these vectors may be preferred relative to gammaretroviral vectors. Additionally, high titer self-inactivating (SIN) vector design has been achieved using both lentivirus and foamy virus which should decrease the risk of enhancer activation relative to intact LTR constructs. These aspects and others, including weaker/regulated internal promoters, and chromatin insulators are important factors that should be considered when designing vectors for clinical gene therapy applications.

Download Full-text

Diversification of CpG-Island Promoters Revealed by Comparative Analysis Between Human and Rhesus Monkey Genomes

Mammalian Genome ◽

10.1007/s00335-020-09844-2 ◽

2020 ◽

Vol 31 (7-8) ◽

pp. 240-251

Author(s):

Saki Aoto ◽

Mayu Fushimi ◽

Kei Yura ◽

Kohji Okamura

Keyword(s):

Rhesus Monkey ◽

Cpg Island ◽

Cpg Islands ◽

Housekeeping Genes ◽

Transcription Start ◽

Cpg Dinucleotides ◽

Transcription Start Sites ◽

Cpg Sites ◽

Mammalian Genomes ◽

Ncbi Refseq

Abstract While CpG dinucleotides are significantly reduced compared to other dinucleotides in mammalian genomes, they can congregate and form CpG islands, which localize around the 5ʹ regions of genes, where they function as promoters. CpG-island promoters are generally unmethylated and are often found in housekeeping genes. However, their nucleotide sequences and existence per se are not conserved between humans and mice, which may be due to evolutionary gain and loss of the regulatory regions. In this study, human and rhesus monkey genomes, with moderately conserved sequences, were compared at base resolution. Using transcription start site data, we first validated our methods’ ability to identify orthologous promoters and indicated a limitation using the 5ʹ end of curated gene models, such as NCBI RefSeq, as their transcription start sites. We found that, in addition to deamination mutations, insertions and deletions of bases, repeats, and long fragments contributed to the mutations of CpG dinucleotides. We also observed that the G + C contents tended to change in CpG-poor environments, while CpG content was altered in G + C-rich environments. While loss of CpG islands can be caused by gradual decreases in CpG sites, gain of these islands appear to require two distinct nucleotide altering steps. Taken together, our findings provide novel insights into the process of acquisition and diversification of CpG-island promoters in vertebrates.

Download Full-text