scholarly journals Deep learning reveals evolutionary conservation and divergence of sequence properties underlying gene regulatory enhancers across mammals

2017 ◽  
Author(s):  
Ling Chen ◽  
Alexandra E. Fish ◽  
John A. Capra

AbstractIn mammals, genomic regions with enhancer activity turnover rapidly; in contrast, gene expression patterns and transcription factor binding preferences are largely conserved. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. We tested this hypothesis by quantifying the conservation of sequence patterns underlying histone-mark defined enhancers across six diverse mammals in two machine learning frameworks. We first trained support vector machine (SVM) classifiers based on the frequency spectrum of short DNA sequence patterns. These classifiers accurately identified many adult liver, developing limb, and developing brain enhancers in each species. Then, we applied these classifiers across species and found that classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. This indicates that the short sequence patterns predictive of enhancers are largely conserved. We also observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. To test the conservation of more complex sequences patterns, we trained convolutional neural networks (CNNs) on enhancer sequences in each species. The CNNs demonstrated better performance overall, but worse cross-species generalization than SVMs, suggesting the importance of combinatorial interactions between motifs, but less conservation of these more complex sequence patterns. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Furthermore, short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution, with evolutionary change in more complex sequence patterns.Author summaryAlterations in gene expression levels are a driving force of both speciation and complex disease; therefore, it is of great importance to understand the mechanisms underlying the evolution and function gene regulatory DNA sequences. Recent studies have revealed that while gene expression patterns and transcription factor binding preferences are broadly conserved across diverse animals, there is extensive turnover in distal gene regulatory regions, called enhancers, between closely related species. We investigate this seeming incongruence by analyzing genome-wide enhancer datasets from six diverse mammalian species. We trained two machine-learning classifiers—a k-mer spectrum support vector machine (SVM) and convolutional neural network (CNN)—to distinguish enhancers from the genomic background. The k-mer spectrum SVM models the occurrences of short sequence patterns while the CNN models both the short sequences patterns and their combinatorial patterns. Both the SVM and CNN enhancer prediction models trained in one species are able to predict enhancers in the same cellular context in other species. However, CNNs performed better at predicting enhancers in each species, but they generalize less well across species than the SVMs. This argues that the short sequence properties encoding regulatory activity are remarkably conserved across more than 180 million years of mammalian evolution with more evolutionary turnover in the more complex combinations of the conserved short sequence motifs.

Blood ◽  
2004 ◽  
Vol 104 (11) ◽  
pp. 2897-2897
Author(s):  
Torsten Haferlach ◽  
Helmut Loeffler ◽  
Alexander Kohlmann ◽  
Martin Dugas ◽  
Wolfgang Hiddemann ◽  
...  

Abstract Balanced chromosomal rearrangements leading to fusion genes on the molecular level define distinct biological subsets in AML. The four balanced rearrangements (t(15;17), t(8;21), inv(16), and 11q23/MLL) show a close correlation to cytomorphology and gene expression patterns. We here focused on seven AML with t(8;16)(p11;p13). This translocation is rare (7/3515 cases in own cohort). It is more frequently found in therapy-related AML than in de novo AML (3/258 t-AML, and 4/3287 de novo, p=0.0003). Cytomorphologically, AML with t(8;16) is characterized by striking features: In all 7 cases the positivity for myeloperoxidase on bone marrow smears was >70% and intriguingly, in parallel >80% of blast cells stained strongly positive for non-specific esterase (NSE) in all cases. Thus, these cases can not be classified according to FAB categories. These data suggest that AML-t(8;16) arise from a very early stem cell with both myeloid and monoblastic potential. Furthermore, we detected erythrophagocytosis in 6/7 cases that was described as specific feature in AML with t(8;16). Four pts. had chromosomal aberrations in addition to t(8;16), 3 of these were t-AML all showing aberrations of 7q. Survival was poor with 0, 1, 1, 2, 20 and 18+ (after alloBMT) mo., one lost to follow-up, respectively. We then analyzed gene expression patterns in 4 cases (Affymetrix U133A+B). First we compared t(8;16) AML with 46 AML FAB M1, 41 M4, 9 M5a, and 16 M5b, all with normal karyotype. Hierachical clustering and principal component analyses (PCA) revealed that t(8;16) AML were intercalating with FAB M4 and M5b and did not cluster near to M1. Thus, monocytic characteristics influence the gene expression pattern stronger than myeloid. Next we compared the t(8;16) AML with the 4 other balanced subtypes according to the WHO classification (t(15;17): 43; t(8;21): 40; inv(16): 49; 11q23/MLL-rearrangements: 50). Using support vector machines the overall accuracy for correct subgroup assignment was 97.3% (10-fold CV), and 96.8% (2/3 training and 1/3 test set, 100 runs). In PCA and hierarchical cluster analysis the t(8;16) were grouped in the vicinity of the 11q23 cases. However, in a pairwise comparison these two subgroups could be discriminated with an accuracy of 94.4% (10-fold CV). Genes with a specific expression in AML-t(8;16) were further investigated in pathway analyses (Ingenuity). 15 of the top 100 genes associated with AML-t(8;16) were involved in the CMYC-pathway with up regulation of BCOR, COXB5, CDK10, FLI1, HNRPA2B1, NSEP1, PDIP38, RAD50, SUPT5H, TLR2 and USP33, and down regulation of ERG, GATA2, NCOR2 and RPS20. CEBP beta, known to play a role in myelomonocytic differentiation, was also up-regulated in t(8;16)-AML. Ten additional genes out of the 100 top differentially expressed genes were also involved in this pathway with up-regulation of DDB2, HIST1H3D, NSAP1, PTPNS1, RAN, USP4, TRIM8, ZNF278 and down regulation of KIT and MBD2. In conclusion, AML with t(8;16) is a specific subtype of AML with unique characteristics in morphology and gene expression patterns. It is more frequently found in t-AML, outcome is inferior in comparison to other AML with balanced translocations. Due to its unique features, it is a candidate for inclusion into the WHO classification as a specific entity.


2018 ◽  
Author(s):  
Karl Kumbier ◽  
Sumanta Basu ◽  
James B. Brown ◽  
Susan Celniker ◽  
Bin Yu

AbstractAdvances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically “black-boxes,” learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest (iRF) algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF (s-iRF), describes “subsets” of rules that frequently occur on RF decision paths. We refer to these “rule subsets” as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.


2015 ◽  
Author(s):  
Konstantin N Kozlov ◽  
Vitaly V Gursky ◽  
Ivan V Kulakovskiy ◽  
Maria G Samsonova

Background: The detailed analysis of transcriptional regulation is crucially important for understanding biological processes. The gap gene network in Drosophila attracts large interest among researches studying mechanisms of transcriptional regulation. It implements the most upstream regulatory layer of the segmentation gene network. The knowledge of molecular mechanisms involved in gap gene regulation is far less complete than that of genetics of the system. Mathematical modeling goes beyond insights gained by genetics and molecular approaches. It allows us to reconstruct wild-type gene expression patterns in silico, infer underlying regulatory mechanism and prove its sufficiency. Results: We developed a new model that provides a dynamical description of gap gene regulatory systems, using detailed DNA-based information, as well as spatial transcription factor concentration data at varying time points. We showed that this model correctly reproduces gap gene expression patterns in wild type embryos and is able to predict gap expression patterns in Kr mutants and four reporter constructs. We used four-fold cross validation test and fitting to random dataset to validate the model and proof its sufficiency in data description. The identifiability analysis showed that most model parameters are well identifiable. We reconstructed the gap gene network topology and studied the impact of individual transcription factor binding sites on the model output. We measured this impact by calculating the site regulatory weight as a normalized difference between the residual sum of squares error for the set of all annotated sites and the set, from which the site of interest was left out. Conclusions: The reconstructed topology of the gap gene network is in agreement with previous modeling results and data from literature. We showed that 1) the regulatory weights of transcription factor binding sites show very weak correlation with their PWM score; 2) sites with low regulatory weight are important for the model output; 3) functional important sites are not exclusively located in cis-regulatory elements, but are rather dispersed through regulatory region. It is of importance that some of the sites with high functional impact in hb, Kr and kni regulatory regions coincide with strong sites annotated and verified in Dnase I footprint assays. Keywords: transcription; thermodynamics; reaction-diffusion; drosophila


2021 ◽  
Author(s):  
José Aguilar-Rodríguez ◽  
Joshua L. Payne

The relationship between genotype and phenotype is central to our understanding of development, evolution, and disease. This relationship is known as the genotype- phenotype map. Gene regulatory circuits occupy a central position in this map, because they control when, where, and to what extent genes are expressed, and thus drive fundamental physiological, developmental, and behavioral processes in living organisms as different as bacteria and humans. Mutations that affect these gene expression patterns are often implicated in disease, so it is important that gene regulatory circuits are robust to mutation. Such mutations can also bring forth beneficial phenotypic variation that embodies or leads to evolutionary adaptations or innovations. Here we review recent theoretical and experimental work that sheds light on the robustness and evolvability of gene regulatory circuits.


Blood ◽  
2004 ◽  
Vol 104 (11) ◽  
pp. 471-471
Author(s):  
Torsten Haferlach ◽  
Wolfgang Kern ◽  
Alexander Kohlmann ◽  
Martin Dugas ◽  
Sylvia Merk ◽  
...  

Abstract MDS and AML are discriminated by percentages of blasts in the bone marrow (BM) according to the FAB as well as to the WHO classification. However, thresholds are arbitrary and demonstrate only a limited reproducibility in interlaboratory testings. Thus, other parameters have been assessed to discriminate these entities with respect to diagnosis and prognosis. In particular, in the majority of cases common karyotype aberrations have been observed between MDS and AML which have a higher prognostic impact than blast percentages. We applied gene expression profiling (U133A+B, Affymetrix) in 70 MDS and 238 AML cases. In accordance with the WHO classification we excluded cases with balanced translocations (i.e. t(8;21), t(15;17), inv(16), or 11q23) which are classified as AML irrespective of BM blast percentage. First we aimed at identifying genes of which the expression correlated to blast count (Spearman correlation). Out of the top 50 genes this analysis revealed only the FLT3 gene which showed a higher expression in cases with high blast count, while 12 genes with a higher expression in cases with lower blast counts were identified (ANXA3, ARG1, CAMP, CD24, CEACAM1, CEACAM6, CEACAM8, CRISP3, KIAA0922, LCN2, MMP9, STOM). Most of the latter genes are expressed in mature granulocytes and are involved in differentiation and apoptosis. In a second step we performed class prediction using support vector machines (SVM) to separate MDS and AML according to blast percentages as defined in the WHO classification (<5%: RA and 5q- syndrome; 5–9%: RAEB-1; 10–19%: RAEB-2; >19% AML). Using 10-fold cross validation and support vector machines the overall prediction accuracy was only 80%. In detail, 230/238 AML cases were correctly assigned to the AML group while 8 cases were classified as MDS RAEB-2. However, none of the RA, 5q- syndrome and RAEB-1 cases were correctly assigned to their groups, respectively, but were either classified as AML or RAEB-2. Furthermore, only 16 of 38 RAEB-2 cases were correctly predicted, while the 20 remaining cases were assigned to the AML group. Thus, no clear gene expression patterns were identified which correlated with AML and MDS subtypes according to WHO classification. Taking the common genetic background observed in MDS and AML into account, both entities were categorized in a third step according to cytogenetics and classified based on their gene expression profiles. In order to assess the impact of the common genetic background, the largest cytogenetically defined subgroups were compared to each other, i.e. AML and MDS with normal karyotype and with complex aberrant karyotype. Intriguingly, while correct classification of AML or MDS was found in 91%, classification into the correct cytogenetic groups was achieved in 95%. Consequently, all cases were devided into the two groups, complex aberrant karyotype (n=60) and other or no aberrations (n=248) irrespective of AML or MDS. A classification into these groups also yielded an accuracy of 93%. Our data suggests that gene expression profiling reveales the biology of MDS or AML to highly correlate with cytogenetics and less with the percentages of BM blasts. These results strengthen the need for a revision of the current MDS and AML classification centering now genetic abnormalities, which may also be used for clinical decisions.


Pneumologie ◽  
2018 ◽  
Vol 72 (S 01) ◽  
pp. S8-S9
Author(s):  
M Bauer ◽  
H Kirsten ◽  
E Grunow ◽  
P Ahnert ◽  
M Kiehntopf ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document