Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

Mapping Intimacies ◽

10.1101/302711 ◽

2018 ◽

Cited By ~ 2

Author(s):

Peyton Greenside ◽

Tyler Shimko ◽

Polly Fordyce ◽

Anshul Kundaje

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Chromatin Accessibility ◽

Feature Interaction ◽

Core Motif ◽

Feature Interactions ◽

Binding Models ◽

Regulatory Dna Sequences ◽

Regulatory Dna

AbstractMotivationTranscription factors bind regulatory DNA sequences in a combinatorial manner to modulate gene expression. Deep neural networks (DNNs) can learn the cis-regulatory grammars encoded in regulatory DNA sequences associated with transcription factor binding and chromatin accessibility. Several feature attribution methods have been developed for estimating the predictive importance of individual features (nucleotides or motifs) in any input DNA sequence to its associated output prediction from a DNN model. However, these methods do not reveal higher-order feature interactions encoded by the models.ResultsWe present a new method called Deep Feature Interaction Maps (DFIM) to efficiently estimate interactions between all pairs of features in any input DNA sequence. DFIM accurately identifies ground truth motif interactions embedded in simulated regulatory DNA sequences. DFIM identifies synergistic interactions between GATA1 and TAL1 motifs from in vivo TF binding models. DFIM reveals epistatic interactions involving nucleotides flanking the core motif of the Cbf1 TF in yeast from in vitro TF binding models. We also apply DFIM to regulatory sequence models of in vivo chromatin accessibility to reveal interactions between regulatory genetic variants and proximal motifs of target TFs as validated by TF binding quantitative trait loci. Our approach makes significant strides in improving the interpretability of deep learning models for genomics.AvailabilityCode is available at: https://github.com/kundajelab/dfim.Contact: [email protected]

Download Full-text

Discovering epistatic feature interactions from neural network models of regulatory DNA sequences

Bioinformatics ◽

10.1093/bioinformatics/bty575 ◽

2018 ◽

Vol 34 (17) ◽

pp. i629-i637 ◽

Cited By ~ 23

Author(s):

Peyton Greenside ◽

Tyler Shimko ◽

Polly Fordyce ◽

Anshul Kundaje

Keyword(s):

Neural Network ◽

Dna Sequences ◽

Network Models ◽

Neural Network Models ◽

Feature Interactions ◽

Regulatory Dna Sequences ◽

Regulatory Dna

Download Full-text

Gkmexplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs Using Integrated Gradients

10.1101/457606 ◽

2018 ◽

Cited By ~ 1

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Support Vector ◽

Computationally Efficient ◽

Link Type ◽

Novel Approach ◽

Mutation Impact ◽

Regulatory Dna

AbstractSupport Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose gkmexplain: a novel approach inspired by the method of Integrated Gradients for interpreting gkm-SVM models. Using simulated regulatory DNA sequences, we show that gkmexplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. We use a novel motif discovery method called TF-MoDISco to recover consolidated TF motifs from gkm-SVM models of in vivo TF binding by aggregating predictive patterns identified by gkmexplain. Finally, we find that mutation impact scores derived through gkmexplain using gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines consistently outperform deltaSVM and ISM at identifying regulatory genetic variants (dsQTLs). Code and example notebooks replicating the workflow are available at https://github.com/kundajelab/gkmexplain. Explanatory videos available at http://bit.ly/gkmexplainvids.

Download Full-text

GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs

Bioinformatics ◽

10.1093/bioinformatics/btz322 ◽

2019 ◽

Vol 35 (14) ◽

pp. i173-i182 ◽

Cited By ~ 12

Author(s):

Avanti Shrikumar ◽

Eva Prakash ◽

Anshul Kundaje

Keyword(s):

Dna Sequences ◽

Motif Discovery ◽

Chromatin Accessibility ◽

Supplementary Information ◽

Support Vector ◽

Computationally Efficient ◽

Sequence Patterns ◽

Mutation Impact ◽

Regulatory Dna

Abstract Summary Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM) or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo transcription factor (TF) binding data, we recover consolidated, non-redundant TF motifs. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines. Availability and implementation Code and example notebooks to reproduce results are at https://github.com/kundajelab/gkmexplain. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

DNA thermodynamics shape chromosome organization and topology

Biochemical Society Transactions ◽

10.1042/bst20120334 ◽

2013 ◽

Vol 41 (2) ◽

pp. 548-553 ◽

Cited By ~ 13

Author(s):

Andrew A. Travers ◽

Georgi Muskhelishvili

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Chromosome Organization ◽

Topological Properties ◽

Genetic Organization ◽

And Topology ◽

Dna Translocases

How much information is encoded in the DNA sequence of an organism? We argue that the informational, mechanical and topological properties of DNA are interdependent and act together to specify the primary characteristics of genetic organization and chromatin structures. Superhelicity generated in vivo, in part by the action of DNA translocases, can be transmitted to topologically sensitive regions encoded by less stable DNA sequences.

Download Full-text

Simian virus 40 major late promoter: an upstream DNA sequence required for efficient in vitro transcription

Molecular and Cellular Biology ◽

10.1128/mcb.4.1.133-141.1984 ◽

1984 ◽

Vol 4 (1) ◽

pp. 133-141

Author(s):

J Brady ◽

M Radonovich ◽

M Thoren ◽

G Das ◽

N P Salzman

Keyword(s):

Dna Sequence ◽

Dna Sequences ◽

Simian Virus 40 ◽

Promoter Sequence ◽

In Vitro Transcription ◽

Simian Virus ◽

Upstream Promoter ◽

Upstream Promoter Sequence

We have previously identified an 11-base DNA sequence, 5'-G-G-T-A-C-C-T-A-A-C-C-3' (simian virus 40 [SV40] map position 294 to 304), which is important in the control of SV40 late RNA expression in vitro and in vivo (Brady et al., Cell 31:625-633, 1982). We report here the identification of another domain of the SV40 late promoter. A series of mutants with deletions extending from SV40 map position 0 to 300 was prepared by nuclease BAL 31 treatment. The cloned templates were then analyzed for efficiency and accuracy of late SV40 RNA expression in the Manley in vitro transcription system. Our studies showed that, in addition to the promoter domain near map position 300, there are essential DNA sequences between nucleotide positions 74 and 95 that are required for efficient expression of late SV40 RNA. Included in this SV40 DNA sequence were two of the six GGGCGG SV40 repeat sequences and an 11-nucleotide segment which showed strong homology with the upstream sequences required for the efficient in vitro and in vivo expression of the histone H2A gene. This upstream promoter sequence supported transcription with the same efficiency even when it was moved 72 nucleotides closer to the major late cap site. In vitro promoter competition analysis demonstrated that the upstream promoter sequence, independent of the 294 to 304 promoter element, is capable of binding polymerase-transcription factors required for SV40 late gene transcription. Finally, we show that DNA sequences which control the specificity of RNA initiation at nucleotide 325 lie downstream of map position 294.

Download Full-text

Borders of Cis-Regulatory DNA Sequences Preferentially Harbor the Divergent Transcription Factor Binding Motifs in the Human Genome

Frontiers in Genetics ◽

10.3389/fgene.2018.00571 ◽

2018 ◽

Vol 9 ◽

Author(s):

Jia-Hsin Huang ◽

Ryan Shun-Yuen Kwan ◽

Zing Tsung-Yeh Tsai ◽

Tzu-Chieh Lin ◽

Huai-Kuang Tsai

Keyword(s):

Transcription Factor ◽

Human Genome ◽

Dna Sequences ◽

Transcription Factor Binding ◽

Binding Motifs ◽

Factor Binding ◽

Regulatory Dna Sequences ◽

Transcription Factor Binding Motifs ◽

Divergent Transcription ◽

Regulatory Dna

Download Full-text

Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays

PLoS ONE ◽

10.1371/journal.pone.0218073 ◽

2019 ◽

Vol 14 (6) ◽

pp. e0218073 ◽

Cited By ~ 17

Author(s):

Rajiv Movva ◽

Peyton Greenside ◽

Georgi K. Marinov ◽

Surag Nair ◽

Avanti Shrikumar ◽

...

Keyword(s):

Neural Network ◽

Dna Sequences ◽

Genetic Variants ◽

Network Models ◽

Massively Parallel ◽

Neural Network Models ◽

Regulatory Dna Sequences ◽

Reporter Assays ◽

Regulatory Dna

Download Full-text

Molecular identification of regulatory DNA sequences for basal and gamma-inteferon induced expression of HLA DRα in human multiforme glioblastoma cell lines

Journal of Neuroimmunology ◽

10.1016/0165-5728(87)90152-4 ◽

1987 ◽

Vol 16 (1) ◽

pp. 15

Author(s):

Patricia Basta ◽

Paula Sherman ◽

Jenny Ting

Keyword(s):

Cell Lines ◽

Molecular Identification ◽

Dna Sequences ◽

Glioblastoma Cell ◽

Induced Expression ◽

Regulatory Dna Sequences ◽

Regulatory Dna ◽

Glioblastoma Cell Lines

Download Full-text

Functional Comparison of the Upstream Regulatory DNA Sequences of Four Human Epidermal Keratin Genes

Journal of Investigative Dermatology ◽

10.1111/1523-1747.ep12460939 ◽

1991 ◽

Vol 96 (2) ◽

pp. 162-167 ◽

Cited By ~ 25

Author(s):

Chuan-Kui Jiang ◽

Howard S Epstein ◽

Marjana Tomic ◽

Irwin M Freedberg ◽

Miroslav Blumenberg

Keyword(s):

Dna Sequences ◽

Regulatory Dna Sequences ◽

Regulatory Dna

Download Full-text

STRUCTURAL AND FUNCTIONAL CHARACTERIZATION OF PUTATIVE REGULATORY DNA SEQUENCES OFFCPGENES IN THE CENTRIC DIATOMCYCLOTELLA CRYPTICA

Diatom Research ◽

10.1080/0269249x.2008.9705735 ◽

2008 ◽

Vol 23 (1) ◽

pp. 31-49 ◽

Cited By ~ 5

Author(s):

Tanja Brakemann ◽

Frank Becker ◽

Peter Kroth ◽

Erhard Rhiel

Keyword(s):

Dna Sequences ◽

Functional Characterization ◽

Regulatory Dna Sequences ◽

Regulatory Dna

Download Full-text