scholarly journals Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs Using Integrated Gradients

2018 ◽  
Author(s):  
Avanti Shrikumar ◽  
Eva Prakash ◽  
Anshul Kundaje

AbstractSupport Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.

2019 ◽  
Vol 35 (14) ◽  
pp. i173-i182 ◽  
Author(s):  
Avanti Shrikumar ◽  
Eva Prakash ◽  
Anshul Kundaje

Abstract Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines. Availability and implementation Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Peyton Greenside ◽  
Tyler Shimko ◽  
Polly Fordyce ◽  
Anshul Kundaje

AbstractMotivationTranscription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.ResultsWe present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.AvailabilityCode is available at: https://github.com/kundajelab/dfim.Contact: [email protected]


Micromachines ◽  
2018 ◽  
Vol 9 (11) ◽  
pp. 587 ◽  
Author(s):  
Malgorzata Straka ◽  
Benjamin Shafer ◽  
Srikanth Vasudevan ◽  
Cristin Welle ◽  
Loren Rieth

Characterizing the aging processes of electrodes in vivo is essential in order to elucidate the changes of the electrode–tissue interface and the device. However, commonly used impedance measurements at 1 kHz are insufficient for determining electrode viability, with measurements being prone to false positives. We implanted cohorts of five iridium oxide (IrOx) and six platinum (Pt) Utah arrays into the sciatic nerve of rats, and collected the electrochemical impedance spectroscopy (EIS) up to 12 weeks or until array failure. We developed a method to classify the shapes of the magnitude and phase spectra, and correlated the classifications to circuit models and electrochemical processes at the interface likely responsible. We found categories of EIS characteristic of iridium oxide tip metallization, platinum tip metallization, tip metal degradation, encapsulation degradation, and wire breakage in the lead. We also fitted the impedance spectra as features to a fine-Gaussian support vector machine (SVM) algorithm for both IrOx and Pt tipped arrays, with a prediction accuracy for categories of 95% and 99%, respectively. Together, this suggests that these simple and computationally efficient algorithms are sufficient to explain the majority of variance across a wide range of EIS data describing Utah arrays. These categories were assessed over time, providing insights into the degradation and failure mechanisms for both the electrode–tissue interface and wire bundle. Methods developed in this study will allow for a better understanding of how EIS can characterize the physical changes to electrodes in vivo.


2019 ◽  
Author(s):  
Shubhada R. Kulkarni ◽  
D. Marc Jones ◽  
Klaas Vandepoele

ABSTRACTDetermining where transcription factors (TF) bind in genomes provides insights into which transcriptional programs are active across organs, tissue types, and environmental conditions. Recent advances in high-throughput profiling of regulatory DNA have yielded large amounts of information about chromatin accessibility. Interpreting the functional significance of these datasets requires knowledge of which regulators are likely to bind these regions. This can be achieved by using information about TF binding preferences, or motifs, to identify TF binding events that are likely to be functional. Although different approaches exist to map motifs to DNA sequences, a systematic evaluation of these tools in plants is missing. Here we compare four motif mapping tools widely used in the Arabidopsis research community and evaluate their performance using chromatin immunoprecipitation datasets for 40 TFs. Downstream gene regulatory network (GRN) reconstruction was found to be sensitive to the motif mapper used. We further show that the low recall of FIMO, one of the most frequently used motif mapping tools, can be overcome by using an Ensemble approach, which combines results from different mapping tools. Several examples are provided demonstrating how the Ensemble approach extends our view on transcriptional control for TFs active in different biological processes. Finally, a new protocol is presented to efficiently derive more complete cell type-specific GRNs through the integrative analysis of open chromatin regions, known binding site information, and expression datasets.


2016 ◽  
Author(s):  
Genivaldo Gueiros Z. Silva ◽  
Bas E. Dutilh ◽  
Robert A. Edwards

ABSTRACTSummaryMetagenomics approaches rely on identifying the presence of organisms in the microbial community from a set of unknown DNA sequences. Sequence classification has valuable applications in multiple important areas of medical and environmental research. Here we introduce FOCUS2, an update of the previously published computational method FOCUS. FOCUS2 was tested with 10 simulated and 543 real metagenomes demonstrating that the program is more sensitive, faster, and more computationally efficient than existing methods.AvailabilityThe Python implementation is freely available at https://edwards.sdsu.edu/FOCUS2.Supplementary informationavailable at Bioinformatics online.


2017 ◽  
Vol 35 (4) ◽  
pp. 837-854 ◽  
Author(s):  
Cristina M Alexandre ◽  
James R Urton ◽  
Ken Jean-Baptiste ◽  
John Huddleston ◽  
Michael W Dorrity ◽  
...  

AbstractVariation in regulatory DNA is thought to drive phenotypic variation, evolution, and disease. Prior studies of regulatory DNA and transcription factors across animal species highlighted a fundamental conundrum: Transcription factor binding domains and cognate binding sites are conserved, while regulatory DNA sequences are not. It remains unclear how conserved transcription factors and dynamic regulatory sites produce conserved expression patterns across species. Here, we explore regulatory DNA variation and its functional consequences within Arabidopsis thaliana, using chromatin accessibility to delineate regulatory DNA genome-wide. Unlike in previous cross-species comparisons, the positional homology of regulatory DNA is maintained among A. thaliana ecotypes and less nucleotide divergence has occurred. Of the ∼50,000 regulatory sites in A. thaliana, we found that 15% varied in accessibility among ecotypes. Some of these accessibility differences were associated with extensive, previously unannotated sequence variation, encompassing many deletions and ancient hypervariable alleles. Unexpectedly, for the majority of such regulatory sites, nearby gene expression was unaffected. Nevertheless, regulatory sites with high levels of sequence variation and differential chromatin accessibility were the most likely to be associated with differential gene expression. Finally, and most surprising, we found that the vast majority of differentially accessible sites show no underlying sequence variation. We argue that these surprising results highlight the necessity to consider higher-order regulatory context in evaluating regulatory variation and predicting its phenotypic consequences.


2021 ◽  
Author(s):  
Eva Prakash ◽  
Avanti Shrikumar ◽  
Anshul Kundaje

Deep neural networks and support vector machines have been shown to accurately predict genomewide signals of regulatory activity from raw DNA sequences. These models are appealing in part because they can learn predictive DNA sequence features without prior assumptions. Several methods such as in-silico mutagenesis, GradCAM, DeepLIFT, Integrated Gradients and GkmExplain have been developed to reveal these learned features. However, the behavior of these methods on regulatory genomic data remains an area of active research. Although prior work has benchmarked these methods on simulated datasets with known ground-truth motifs, these simulations employed highly simplified regulatory logic that is not representative of the genome. In this work, we propose a novel pipeline for designing simulated data that comes closer to modeling the complexity of regulatory genomic DNA. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments and use these datasets to benchmark different interpretation methods based on their ability to identify ground-truth motifs. We find that a GradCAM-based method, which was reported to perform well on a more simplified dataset, does not do well on this dataset (particularly when using an architecture with shorter convolutional kernels in the first layer), and we theoretically show that this is expected based on the nature of regulatory genomic data. We also show that Integrated Gradients sometimes performs worse than gradient-times-input, likely owing to its linear interpolation path. We additionally explore the impact of user-defined settings on the interpretation methods, such as the choice of "reference"/"baseline", and identify recommended settings for genomics. Our analysis suggests several promising directions for future research on these model interpretation methods. Code and links to data are available at https://github.com/kundajelab/interpret-benchmark.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Ebraheem Alzahrani ◽  
Wajdi Alghamdi ◽  
Malik Zaka Ullah ◽  
Yaser Daanial Khan

AbstractProteins are a vital component of cells that perform physiological functions to ensure smooth operations of bodily functions. Identification of a protein's function involves a detailed understanding of the structure of proteins. Stress proteins are essential mediators of several responses to cellular stress and are categorized based on their structural characteristics. These proteins are found to be conserved across many eukaryotic and prokaryotic linkages and demonstrate varied crucial functional activities inside a cell. The in-vivo, ex vivo, and in-vitro identification of stress proteins are a time-consuming and costly task. This study is aimed at the identification of stress protein sequences with the aid of mathematical modelling and machine learning methods to supplement the aforementioned wet lab methods. The model developed using Random Forest showed remarkable results with 91.1% accuracy while models based on neural network and support vector machine showed 87.7% and 47.0% accuracy, respectively. Based on evaluation results it was concluded that random-forest based classifier surpassed all other predictors and is suitable for use in practical applications for the identification of stress proteins. Live web server is available at http://biopred.org/stressprotiens, while the webserver code available is at https://github.com/abdullah5naveed/SRP_WebServer.git


2019 ◽  
Vol 35 (22) ◽  
pp. 4640-4646 ◽  
Author(s):  
Xi Han ◽  
Xiaonan Wang ◽  
Kang Zhou

Abstract Motivation Protein activity is a significant characteristic for recombinant proteins which can be used as biocatalysts. High activity of proteins reduces the cost of biocatalysts. A model that can predict protein activity from amino acid sequence is highly desired, as it aids experimental improvement of proteins. However, only limited data for protein activity are currently available, which prevents the development of such models. Since protein activity and solubility are correlated for some proteins, the publicly available solubility dataset may be adopted to develop models that can predict protein solubility from sequence. The models could serve as a tool to indirectly predict protein activity from sequence. In literature, predicting protein solubility from sequence has been intensively explored, but the predicted solubility represented in binary values from all the developed models was not suitable for guiding experimental designs to improve protein solubility. Here we propose new machine learning (ML) models for improving protein solubility in vivo. Results We first implemented a novel approach that predicted protein solubility in continuous numerical values instead of binary ones. After combining it with various ML algorithms, we achieved a R2 of 0.4115 when support vector machine algorithm was used. Continuous values of solubility are more meaningful in protein engineering, as they enable researchers to choose proteins with higher predicted solubility for experimental validation, while binary values fail to distinguish proteins with the same value—there are only two possible values so many proteins have the same one. Availability and implementation We present the ML workflow as a series of IPython notebooks hosted on GitHub (https://github.com/xiaomizhou616/protein_solubility). The workflow can be used as a template for analysis of other expression and solubility datasets. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document