scholarly journals The Sheep and the Goats: Distinguishing transcriptional enhancers in a complex chromatin landscape

2018 ◽  
Author(s):  
Anne Sonnenschein ◽  
Ian Dworkin ◽  
David N. Arnosti

ABSTRACTPredicting regulatory function of non-coding DNA using genomic information remains a major goal in genomics, and an important step in interpreting the cis-regulatory code. Regulatory capacity can be partially inferred from transcription factor occupancy, histone modifications, motif enrichment, and evolutionary conservation. However, combinations of these features in well-studied systems such as Drosophila have limited predictive accuracy. Here we examine the current limits of computational enhancer prediction by applying machine-learning methods to an extensive set of genomic features, validating predictions with the Fly Enhancer Resource, which characterized the transcriptional activity of approximately fifteen percent of the genome. Supervised machine learning trained on a range of genomic features identify active elements with a high degree of accuracy, but are less successful at distinguishing tissue-specific expression patterns. Consistent with previous observations of their widespread genomic interactions, many transcription factors were associated with enhancers not known to be direct functional targets. Interestingly, no single factor was necessary for enhancer identification, although binding by the ′pioneer′ transcription factor Zelda was the most predictive feature for enhancer activity. Using an increasing number of predictive features improved classification with diminishing returns. Thus, additional single-timepoint ChIP data may have only marginal utility for discerning true regulatory regions. On the other hand, spatially- and temporally-differentiated genomic features may provide more power for this type of computational enhancer identification. Inclusion of new types of information distinct from current chromatin-immunoprecipitation data may enable more precise identification of enhancers, and further insight into the features that distinguish their biological functions.

2020 ◽  
Author(s):  
Etienne Becht ◽  
Daniel Tolstrup ◽  
Charles-Antoine Dutertre ◽  
Florent Ginhoux ◽  
Evan W. Newell ◽  
...  

AbstractModern immunologic research increasingly requires high-dimensional analyses in order to understand the complex milieu of cell-types that comprise the tissue microenvironments of disease. To achieve this, we developed Infinity Flow combining hundreds of overlapping flow cytometry panels using machine learning to enable the simultaneous analysis of the co-expression patterns of 100s of surface-expressed proteins across millions of individual cells. In this study, we demonstrate that this approach allows the comprehensive analysis of the cellular constituency of the steady-state murine lung and to identify novel cellular heterogeneity in the lungs of melanoma metastasis bearing mice. We show that by using supervised machine learning, Infinity Flow enhances the accuracy and depth of clustering or dimensionality reduction algorithms. Infinity Flow is a highly scalable, low-cost and accessible solution to single cell proteomics in complex tissues.


2018 ◽  
Author(s):  
Karl Kumbier ◽  
Sumanta Basu ◽  
James B. Brown ◽  
Susan Celniker ◽  
Bin Yu

AbstractAdvances in supervised learning have enabled accurate prediction in biological systems governed by complex interactions among biomolecules. However, state-of-the-art predictive algorithms are typically “black-boxes,” learning statistical interactions that are difficult to translate into testable hypotheses. The iterative Random Forest (iRF) algorithm took a step towards bridging this gap by providing a computationally tractable procedure to identify the stable, high-order feature interactions that drive the predictive accuracy of Random Forests (RF). Here we refine the interactions identified by iRF to explicitly map responses as a function of interacting features. Our method, signed iRF (s-iRF), describes “subsets” of rules that frequently occur on RF decision paths. We refer to these “rule subsets” as signed interactions. Signed interactions share not only the same set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We describe stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate our proposed approach in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of enhancer activity, s-iRF recovers one of the few experimentally validated high-order interactions and suggests novel enhancer elements where this interaction may be active. In the case of spatial gene expression patterns, s-iRF recovers all 11 reported links in the gap gene network. By refining the process of interaction recovery, our approach has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.


2020 ◽  
Author(s):  
Ruslan M. Deviatiiarov ◽  
Anna Gams ◽  
Roman Syunyaev ◽  
Tatiana V. Tatarinova ◽  
Oleg Gusev ◽  
...  

AbstractGenome regulatory elements play a critical role during cardiac development and maintenance of normal physiological homeostasis, and genome-wide association studies identified a large number of SNPs associated with cardiovascular diseases localized in intergenic zones. We used cap analysis of gene expression (CAGE) to identify transcription start sites (TSS) with one nucleotide resolution that effectively maps genome regulatory elements in a representative collection of human heart tissues. Here we present a comprehensive and fully annotated CAGE atlas of human promoters and enhancers from four chambers of the non-diseased human donor hearts, including both atria and ventricles. We have identified 10,528 novel regulatory elements, where 2,750 are classified as TSS and 4,258 novel enhancers, which were validated with ChIP-seq libraries and motif enrichment analysis. We found that heart-region specific expression patterns are primarily based on the alternative promoter and specific enhancer activity. Our study significantly increased evidence of the association of regulatory elements-located variants with heart morphology and pathologies. The precise location of cardiac disease-related SNPs within the regulatory regions and their correlation with a specific cell type offers a new understanding of genetic heart diseases.


F1000Research ◽  
2018 ◽  
Vol 7 ◽  
pp. 1933 ◽  
Author(s):  
Ruipeng Lu ◽  
Peter K. Rogan

Background:The distribution and composition ofcis-regulatory modules composed of transcription factor (TF) binding site (TFBS) clusters in promoters substantially determine gene expression patterns and TF targets. TF knockdown experiments have revealed that TF binding profiles and gene expression levels are correlated. We use TFBS features within accessible promoter intervals to predict genes with similar tissue-wide expression patterns and TF targets.Methods:Genes with correlated expression patterns across 53 tissues and TF targets were respectively identified from Bray-Curtis Similarity and TF knockdown experiments. Corresponding promoter sequences were reduced to DNase I-accessible intervals; TFBSs were then identified within these intervals using information theory-based position weight matrices for each TF (iPWMs) and clustered. Features from information-dense TFBS clusters predicted these genes with machine learning classifiers, which were evaluated for accuracy, specificity and sensitivity. Mutations in TFBSs were analyzed toin silicoexamine their impact on cluster densities and the regulatory states of target genes.Results:  We initially chose the glucocorticoid receptor gene (NR3C1), whose regulation has been extensively studied, to test this approach.SLC25A32andTANKwere found to exhibit the most similar expression patterns toNR3C1. A Decision Tree classifier exhibited the largest area under the Receiver Operating Characteristic (ROC) curve in detecting such genes. Target gene prediction was confirmed using siRNA knockdown of TFs, which was found to be more accurate than those predicted after CRISPR/CAS9 inactivation.In-silicomutation analyses of TFBSs also revealed that one or more information-dense TFBS clusters in promoters are required for accurate target gene prediction. Conclusions: Machine learning based on TFBS information density, organization, and chromatin accessibility accurately identifies gene targets with comparable tissue-wide expression patterns. Multiple information-dense TFBS clusters in promoters appear to protect promoters from effects of deleterious binding site mutations in a single TFBS that would otherwise alter regulation of these genes.


2022 ◽  
Author(s):  
Gabriela Garcia ◽  
Tharanga Kariyawasam ◽  
Anton Lord ◽  
Cristiano Costa ◽  
Lana Chaves ◽  
...  

Abstract We describe the first application of the Near-infrared spectroscopy (NIRS) technique to detect Plasmodium falciparum and P. vivax malaria parasites through the skin of malaria positive and negative human subjects. NIRS is a rapid, non-invasive and reagent free technique which involves rapid interaction of a beam of light with a biological sample to produce diagnostic signatures in seconds. We used a handheld, miniaturized spectrometer to shine NIRS light on the ear, arm and finger of P. falciparum (n=7) and P. vivax (n=20) positive people and malaria negative individuals (n=33) in a malaria endemic setting in Brazil. Supervised machine learning algorithms for predicting the presence of malaria were applied to predict malaria infection status in independent individuals (n=12). Separate machine learning algorithms for differentiating P. falciparum from P. vivax infected subjects were developed using spectra from the arm and ear of P. falciparum and P. vivax (n=108) and the resultant model predicted infection in spectra of their fingers (n=54).NIRS non-invasively detected malaria positive and negative individuals that were excluded from the model with 100% sensitivity, 83% specificity and 92% accuracy (n=12) with spectra collected from the arm. Moreover, NIRS also correctly differentiated P. vivax from P. falciparum positive individuals with a predictive accuracy of 93% (n=54). These findings are promising but further work on a larger scale is needed to address several gaps in knowledge and establish the full capacity of NIRS as a non-invasive diagnostic tool for malaria. It is recommended that the tool is further evaluated in multiple epidemiological and demographic settings where other factors such as age, mixed infection and skin colour can be incorporated into predictive algorithms to produce more robust models for universal diagnosis of malaria.


2020 ◽  
Author(s):  
Ruslan Deviatiiarov ◽  
Anna Gams ◽  
Roman Syunyaev ◽  
Tatiana Tatarinova ◽  
Oleg Gusev ◽  
...  

Abstract Genome regulatory elements play a critical role during cardiac development and maintenance of normal physiological homeostasis, and genome-wide association studies identified a large number of SNPs associated with cardiovascular diseases localized in intergenic zones. We used cap analysis of gene expression (CAGE) to identify transcription start sites (TSS) with one nucleotide resolution that effectively maps genome regulatory elements in a representative collection of human heart tissues. Here we present a comprehensive and fully annotated CAGE atlas of human promoters and enhancers from four chambers of the non-diseased human donor hearts, including both atria and ventricles. We have identified 10,528 novel regulatory elements, where 2,750 are classified as TSS and 4,258 novel enhancers, which were validated with ChIP-seq libraries and motif enrichment analysis. We found that heart-region specific expression patterns are primarily based on the alternative promoter and specific enhancer activity. Our study significantly increased evidence of the association of regulatory elements-located variants with heart morphology and pathologies. The precise location of cardiac disease-related SNPs within the regulatory regions and their correlation with a specific cell type offers a new understanding of genetic heart diseases.


2021 ◽  
Author(s):  
Shinya IWASE ◽  
Taka-aki Nakada ◽  
Tadanaga Shimada ◽  
Takehiko Oami ◽  
Takashi Shimazui ◽  
...  

Abstract Background: Machine learning can predict outcomes and determine variables contributing to precise prediction, and can thus classify patients with different risk factors of outcomes. This study aimed to investigate the predictive accuracy for mortality and length of stay in intensive care unit (ICU) patients using machine learning, and to identify the variables contributing to the precise prediction or classification of patients.Methods: Patients (n=12,747) admitted to the ICU at Chiba University Hospital were randomly assigned to the training and test cohorts. After learning using the variables on admission in the training cohort, the area under the curve (AUC) was analyzed in the test cohort to evaluate the predictive accuracy of the supervised machine learning classifiers, including random forest (RF) for outcomes (primary outcome, mortality; secondary outcome, and length of ICU stay). The rank of the variables that contributed to the machine learning prediction was confirmed, and cluster analysis of the patients with risk factors of mortality was performed to identify the important variables associated with patient outcomes.Results: Machine learning using RF revealed a high predictive value for mortality, with an AUC of 0.945. In addition, RF showed high predictive value for short and long ICU stays, with AUCs of 0.881 and 0.889, respectively. Lactate dehydrogenase (LDH) was identified as a variable contributing to the precise prediction in machine learning for both mortality and length of ICU stay. LDH was also identified as a contributing variable to classify patients into sub-populations based on different risk factors of mortality.Conclusion: The machine learning algorithm could predict mortality and length of stay in ICU patients with high accuracy. LDH was identified as a contributing variable in mortality and length of ICU stay prediction and could be used to classify patients based on mortality risk.


Plants ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 2721
Author(s):  
Chao Tan ◽  
Huilei Qiao ◽  
Ming Ma ◽  
Xue Wang ◽  
Yunyun Tian ◽  
...  

The basic helix-loop-helix (bHLH) transcription factor family is one of the largest transcription factor families in plants and plays crucial roles in plant development. Melon is an important horticultural plant as well as an attractive model plant for studying fruit ripening. However, the bHLH gene family of melon has not yet been identified, and its functions in fruit growth and ripening are seldom researched. In this study, 118 bHLH genes were identified in the melon genome. These CmbHLH genes were unevenly distributed on chromosomes 1 to 12, and five CmbHLHs were tandem repeat on chromosomes 4 and 8. There were 13 intron distribution patterns among the CmbHLH genes. Phylogenetic analysis illustrated that these CmbHLHs could be classified into 16 subfamilies. Expression patterns of the CmbHLH genes were studied using transcriptome data. Tissue specific expression of the CmbHLH32 gene was analysed by quantitative RT-PCR. The results showed that the CmbHLH32 gene was highly expressed in female flower and early developmental stage fruit. Transgenic melon lines overexpressing CmbHLH32 were generated, and overexpression of CmbHLH32 resulted in early fruit ripening compared to wild type. The CmbHLH transcription factor family was identified and analysed for the first time in melon, and overexpression of CmbHLH32 affected the ripening time of melon fruit. These findings laid a foundation for further study on the role of bHLH family members in the growth and development of melon.


2019 ◽  
Author(s):  
Anup P. Challa ◽  
Andrew L. Beam ◽  
Min Shen ◽  
Tyler Peryea ◽  
Robert R. Lavieri ◽  
...  

AbstractPregnant women are an especially vulnerable population, given the sensitivity of a developing fetus to chemical exposures. However, prescribing behavior for the gravid patient is guided on limited human data and conflicting cases of adverse outcomes due to the exclusion of pregnant populations from randomized, controlled trials. These factors increase risk for adverse drug outcomes and reduce quality of care for pregnant populations. Herein, we propose the application of artificial intelligence to systematically predict the teratogenicity of a prescriptible small molecule from information inherent to the drug. Using unsupervised and supervised machine learning, our model probes all small molecules with known structure and teratogenicity data published in research-amenable formats to identify patterns among structural, meta-structural, and in vitro bioactivity data for each drug and its teratogenicity score. With this workflow, we discovered three chemical functionalities that predispose a drug towards increased teratogenicity and two moieties with potentially protective effects. Our models predict three clinically-relevant classes of teratogenicity with AUC = 0.8 and nearly double the predictive accuracy of a blind control for the same task, suggesting successful modeling. We also present extensive barriers to translational research that restrict data-driven studies in pregnancy and therapeutically “orphan” pregnant populations. Collectively, this work represents a first-in-kind platform for the application of computing to study and predict teratogenicity.


Sign in / Sign up

Export Citation Format

Share Document