Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks

Peter K. Koo; Antonio Majdandzic; Matthew Ploenzke; Praveen Anand; Steffan B. Paul

doi:10.1371/journal.pcbi.1008925

Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks

PLoS Computational Biology ◽

10.1371/journal.pcbi.1008925 ◽

2021 ◽

Vol 17 (5) ◽

pp. e1008925

Author(s):

Peter K. Koo ◽

Antonio Majdandzic ◽

Matthew Ploenzke ◽

Praveen Anand ◽

Steffan B. Paul

Keyword(s):

Neural Networks ◽

Protein Interactions ◽

Effect Size ◽

Deep Neural Networks ◽

Rna Binding ◽

Rna Binding Proteins ◽

Population Level ◽

Sequence Motifs ◽

Convolutional Network ◽

Importance Analysis

Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.

Download Full-text

Global Importance Analysis: A Method to Quantify Importance of Genomic Features in Deep Neural Networks

10.1101/2020.09.08.288068 ◽

2020 ◽

Author(s):

Peter K. Koo ◽

Matthew Ploenzke ◽

Praveen Anand ◽

Steffan B. Paul ◽

Antonio Majdandzic

Keyword(s):

Neural Networks ◽

Effect Size ◽

Deep Neural Networks ◽

Rna Binding ◽

Rna Binding Proteins ◽

Sequence Motifs ◽

Single Nucleotide Variants ◽

Convolutional Network ◽

Model Predictions ◽

Importance Analysis

ABSTRACTDeep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. For model interpretability, attribution methods have been employed to reveal learned patterns that resemble sequence motifs. First-order attribution methods only quantify the independent importance of single nucleotide variants in a given sequence – it does not provide the effect size of motifs (or their interactions with other patterns) on model predictions. Here we introduce global importance analysis (GIA), a new model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a new convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.

Download Full-text

Interpreting Deep Neural Networks Beyond Attribution Methods: Quantifying Global Importance of Genomic Features

10.1101/2020.02.19.956896 ◽

2020 ◽

Cited By ~ 1

Author(s):

Peter K. Koo ◽

Matt Ploenzke

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Population Level ◽

Computational Genomics ◽

Great Success ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genomic Features ◽

High Performing ◽

Importance Analysis

AbstractDespite deep neural networks (DNNs) having found great success at improving performance on various prediction tasks in computational genomics, it remains difficult to understand why they make any given prediction. In genomics, the main approaches to interpret a high-performing DNN are to visualize learned representations via weight visualizations and attribution methods. While these methods can be informative, each has strong limitations. For instance, attribution methods only uncover the independent contribution of single nucleotide variants in a given sequence. Here we discuss and argue for global importance analysis which can quantify population-level importance of putative features and their interactions learned by a DNN. We highlight recent work that has benefited from this interpretability approach and then discuss connections between global importance analysis and causality.

Download Full-text

De novo prediction of RNA-protein interactions with Graph Neural Networks

10.1101/2021.09.28.462100 ◽

2021 ◽

Author(s):

Viplove Arora ◽

Guido Sanguinetti

Keyword(s):

Neural Networks ◽

Protein Interactions ◽

Large Scale ◽

Rna Binding ◽

De Novo ◽

Rna Binding Proteins ◽

Large Data ◽

Data Sets ◽

Graph Neural Networks ◽

Post Transcriptional Regulation

RNA-binding proteins (RBPs) are key co- and post-transcriptional regulators of gene expression, playing a crucial role in many biological processes. Experimental methods like CLIP-seq have enabled the identification of transcriptome-wide RNA-protein interactions for select proteins, however the time and resource intensive nature of these technologies call for the development of computational methods to complement their predictions. Here we leverage recent, large-scale CLIP-seq experiments to construct a de novo predictor of RNA-protein interactions based on graph neural networks (GNN). We show that the GNN method allows not only to predict missing links in a RNA-protein network, but to predict the entire complement of targets of previously unassayed proteins, and even to reconstruct the entire network of RNA-protein interactions in different conditions based on minimal information. Our results demonstrate the potential of machine learning methods to extract useful information on post-transcriptional regulation from large data sets.

Download Full-text

rec-Y2H matrix screening reveals a vast potential for direct protein-protein interactions among RNA binding proteins

10.1101/2020.09.14.296160 ◽

2020 ◽

Author(s):

Benjamin Lang ◽

Jae-Seong Yang ◽

Mireia Garriga-Canut ◽

Silvia Speroni ◽

Maria Gili ◽

...

Keyword(s):

Protein Interactions ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Interaction Networks ◽

Sequence Motifs ◽

Protein Protein Interactions ◽

Binding Interaction ◽

Transcriptional Gene Regulation

AbstractRNA-binding proteins (RBPs) are crucial factors of post-transcriptional gene regulation and their modes of action are intensely investigated. At the center of attention are RNA motifs that guide where RBPs bind. However, sequence motifs are often poor predictors of RBP-RNA interactions in vivo. It is hence believed that many RBPs recognize RNAs as complexes, to increase specificity and regulatory possibilities. To probe the potential for complex formation among RBPs, we assembled a library of 978 mammalian RBPs and used rec-Y2H screening to detect direct interactions between RBPs, sampling > 600 K interactions. We discovered 1994 new interactions and demonstrate that interacting RBPs bind RNAs adjacently in vivo. We further find that the mRNA binding region and motif preferences of RBPs can deviate, depending on their adjacently binding interaction partners. Finally, we reveal novel RBP interaction networks among major RNA processing steps and show that splicing impairing RBP mutations observed in cancer rewire spliceosomal interaction networks.Graphical abstract

Download Full-text

Deep neural networks for interpreting RNA binding protein target preferences

10.1101/518191 ◽

2019 ◽

Cited By ~ 2

Author(s):

Mahsa Ghanbari ◽

Uwe Ohler

Keyword(s):

Binding Sites ◽

Deep Neural Networks ◽

Rna Binding ◽

Rna Binding Proteins ◽

Predictive Score ◽

Sequence Motifs ◽

Multiple Sources ◽

Complex Features ◽

Regulatory Functions

AbstractDeep learning has become a powerful paradigm to analyze the binding sites of regulatory factors including RNA-binding proteins (RBPs), owing to its strength to learn complex features from possibly multiple sources of raw data. However, the interpretability of these models, which is crucial to improve our understanding of RBP binding preferences and functions, has not yet been investigated in significant detail. We have designed a multitask and multimodal deep neural network for characterizing in vivo RBP binding preferences. The model incorporates not only the sequence but also the region type of the binding sites as input, which helps the model to boost the prediction performance. To interpret the model, we quantified the contribution of the input features to the predictive score of each RBP. Learning across multiple RBPs at once, we are able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP. Our findings are consistent with known motifs and binding behaviors of RBPs and can provide new insights about the regulatory functions of RBPs.

Download Full-text

PlncRNADB: A Repository of Plant lncRNAs and lncRNA-RBP Protein Interactions

Current Bioinformatics ◽

10.2174/1574893614666190131161002 ◽

2019 ◽

Vol 14 (7) ◽

pp. 621-627 ◽

Cited By ~ 3

Author(s):

Youhuang Bai ◽

Xiaozhuan Dai ◽

Tiantian Ye ◽

Peijing Zhang ◽

Xu Yan ◽

...

Keyword(s):

Protein Interactions ◽

Binding Proteins ◽

Rna Binding ◽

Rna Binding Proteins ◽

Populus Trichocarpa ◽

Noncoding Rnas ◽

Reference Database ◽

Protein Coding ◽

Arabidopsis Lyrata ◽

User Friendly

Background: Long noncoding RNAs (lncRNAs) are endogenous noncoding RNAs, arbitrarily longer than 200 nucleotides, that play critical roles in diverse biological processes. LncRNAs exist in different genomes ranging from animals to plants. Objective: PlncRNADB is a searchable database of lncRNA sequences and annotation in plants. Methods: We built a pipeline for lncRNA prediction in plants, providing a convenient utility for users to quickly distinguish potential noncoding RNAs from protein-coding transcripts. Results: More than five thousand lncRNAs are collected from four plant species (Arabidopsis thaliana, Arabidopsis lyrata, Populus trichocarpa and Zea mays) in PlncRNADB. Moreover, our database provides the relationship between lncRNAs and various RNA-binding proteins (RBPs), which can be displayed through a user-friendly web interface. Conclusion: PlncRNADB can serve as a reference database to investigate the lncRNAs and their interaction with RNA-binding proteins in plants. The PlncRNADB is freely available at http://bis.zju.edu.cn/PlncRNADB/.

Download Full-text

Compendium of Methods to Uncover RNA-Protein Interactions In Vivo

Methods and Protocols ◽

10.3390/mps4010022 ◽

2021 ◽

Vol 4 (1) ◽

pp. 22

Author(s):

Mrinmoyee Majumder ◽

Viswanathan Palanisamy

Keyword(s):

Gene Expression ◽

Protein Interactions ◽

Rna Binding ◽

Rna Binding Proteins ◽

Expression Patterns ◽

Control Of Gene Expression ◽

Regulatory Pathways ◽

Advantages And Disadvantages ◽

Protein Nucleic Acid

Control of gene expression is critical in shaping the pro-and eukaryotic organisms’ genotype and phenotype. The gene expression regulatory pathways solely rely on protein–protein and protein–nucleic acid interactions, which determine the fate of the nucleic acids. RNA–protein interactions play a significant role in co- and post-transcriptional regulation to control gene expression. RNA-binding proteins (RBPs) are a diverse group of macromolecules that bind to RNA and play an essential role in RNA biology by regulating pre-mRNA processing, maturation, nuclear transport, stability, and translation. Hence, the studies aimed at investigating RNA–protein interactions are essential to advance our knowledge in gene expression patterns associated with health and disease. Here we discuss the long-established and current technologies that are widely used to study RNA–protein interactions in vivo. We also present the advantages and disadvantages of each method discussed in the review.

Download Full-text

RNA-Centric Approaches to Profile the RNA–Protein Interaction Landscape on Selected RNAs

Non-Coding RNA ◽

10.3390/ncrna7010011 ◽

2021 ◽

Vol 7 (1) ◽

pp. 11 ◽

Cited By ~ 1

Author(s):

André P. Gerber

Keyword(s):

Mass Spectrometry ◽

Protein Interactions ◽

Regulatory Networks ◽

Rna Binding ◽

Rna Binding Proteins ◽

Protein Complexes ◽

Cell Protein ◽

Transcriptional Regulatory Networks ◽

Technological Advances

RNA–protein interactions frame post-transcriptional regulatory networks and modulate transcription and epigenetics. While the technological advances in RNA sequencing have significantly expanded the repertoire of RNAs, recently developed biochemical approaches combined with sensitive mass-spectrometry have revealed hundreds of previously unrecognized and potentially novel RNA-binding proteins. Nevertheless, a major challenge remains to understand how the thousands of RNA molecules and their interacting proteins assemble and control the fate of each individual RNA in a cell. Here, I review recent methodological advances to approach this problem through systematic identification of proteins that interact with particular RNAs in living cells. Thereby, a specific focus is given to in vivo approaches that involve crosslinking of RNA–protein interactions through ultraviolet irradiation or treatment of cells with chemicals, followed by capture of the RNA under study with antisense-oligonucleotides and identification of bound proteins with mass-spectrometry. Several recent studies defining interactomes of long non-coding RNAs, viral RNAs, as well as mRNAs are highlighted, and short reference is given to recent in-cell protein labeling techniques. These recent experimental improvements could open the door for broader applications and to study the remodeling of RNA–protein complexes upon different environmental cues and in disease.

Download Full-text

Evaluation of Power Insulator Detection Efficiency with the Use of Limited Training Dataset

Applied Sciences ◽

10.3390/app10062104 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2104

Author(s):

Michał Tomaszewski ◽

Paweł Michalski ◽

Jakub Osuchowski

Keyword(s):

Neural Network ◽

Neural Networks ◽

Object Detection ◽

Convolutional Neural Network ◽

Deep Neural Networks ◽

Detection Efficiency ◽

Training Data ◽

Training Dataset ◽

Training Set ◽

Convolutional Network

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.

Download Full-text

LPI-HyADBS: a hybrid framework for lncRNA-protein interaction prediction integrating feature selection and classification

BMC Bioinformatics ◽

10.1186/s12859-021-04485-x ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Liqian Zhou ◽

Qi Duan ◽

Xiongfei Tian ◽

He Xu ◽

Jianxin Tang ◽

...

Keyword(s):

Feature Selection ◽

Protein Interactions ◽

Cross Validation ◽

Rna Binding ◽

Rna Binding Proteins ◽

Biological Information ◽

Hybrid Framework ◽

Protein Interaction Prediction ◽

Single Dataset ◽

Feature Selection Approach

Abstract Background Long noncoding RNAs (lncRNAs) have dense linkages with a plethora of important cellular activities. lncRNAs exert functions by linking with corresponding RNA-binding proteins. Since experimental techniques to detect lncRNA-protein interactions (LPIs) are laborious and time-consuming, a few computational methods have been reported for LPI prediction. However, computation-based LPI identification methods have the following limitations: (1) Most methods were evaluated on a single dataset, and researchers may thus fail to measure their generalization ability. (2) The majority of methods were validated under cross validation on lncRNA-protein pairs, did not investigate the performance under other cross validations, especially for cross validation on independent lncRNAs and independent proteins. (3) lncRNAs and proteins have abundant biological information, how to select informative features need to further investigate. Results Under a hybrid framework (LPI-HyADBS) integrating feature selection based on AdaBoost, and classification models including deep neural network (DNN), extreme gradient Boost (XGBoost), and SVM with a penalty Coefficient of misclassification (C-SVM), this work focuses on finding new LPIs. First, five datasets are arranged. Each dataset contains lncRNA sequences, protein sequences, and an LPI network. Second, biological features of lncRNAs and proteins are acquired based on Pyfeat. Third, the obtained features of lncRNAs and proteins are selected based on AdaBoost and concatenated to depict each LPI sample. Fourth, DNN, XGBoost, and C-SVM are used to classify lncRNA-protein pairs based on the concatenated features. Finally, a hybrid framework is developed to integrate the classification results from the above three classifiers. LPI-HyADBS is compared to six classical LPI prediction approaches (LPI-SKF, LPI-NRLMF, Capsule-LPI, LPI-CNNCP, LPLNP, and LPBNI) on five datasets under 5-fold cross validations on lncRNAs, proteins, lncRNA-protein pairs, and independent lncRNAs and independent proteins. The results show LPI-HyADBS has the best LPI prediction performance under four different cross validations. In particular, LPI-HyADBS obtains better classification ability than other six approaches under the constructed independent dataset. Case analyses suggest that there is relevance between ZNF667-AS1 and Q15717. Conclusions Integrating feature selection approach based on AdaBoost, three classification techniques including DNN, XGBoost, and C-SVM, this work develops a hybrid framework to identify new linkages between lncRNAs and proteins.

Download Full-text