scholarly journals RPITER: A Hierarchical Deep Learning Framework for ncRNA–Protein Interaction Prediction

2019 ◽  
Vol 20 (5) ◽  
pp. 1070 ◽  
Author(s):  
Cheng Peng ◽  
Siyu Han ◽  
Hui Zhang ◽  
Ying Li

Non-coding RNAs (ncRNAs) play crucial roles in multiple fundamental biological processes, such as post-transcriptional gene regulation, and are implicated in many complex human diseases. Mostly ncRNAs function by interacting with corresponding RNA-binding proteins. The research on ncRNA–protein interaction is the key to understanding the function of ncRNA. However, the biological experiment techniques for identifying RNA–protein interactions (RPIs) are currently still expensive and time-consuming. Due to the complex molecular mechanism of ncRNA–protein interaction and the lack of conservation for ncRNA, especially for long ncRNA (lncRNA), the prediction of ncRNA–protein interaction is still a challenge. Deep learning-based models have become the state-of-the-art in a range of biological sequence analysis problems due to their strong power of feature learning. In this study, we proposed a hierarchical deep learning framework RPITER to predict RNA–protein interaction. For sequence coding, we improved the conjoint triad feature (CTF) coding method by complementing more primary sequence information and adding sequence structure information. For model design, RPITER employed two basic neural network architectures of convolution neural network (CNN) and stacked auto-encoder (SAE). Comprehensive experiments were performed on five benchmark datasets from PDB and NPInter databases to analyze and compare the performances of different sequence coding methods and prediction models. We found that CNN and SAE deep learning architectures have powerful fitting abilities for the k-mer features of RNA and protein sequence. The improved CTF coding method showed performance gain compared with the original CTF method. Moreover, our designed RPITER performed well in predicting RNA–protein interaction (RPI) and could outperform most of the previous methods. On five widely used RPI datasets, RPI369, RPI488, RPI1807, RPI2241 and NPInter, RPITER obtained A U C of 0.821, 0.911, 0.990, 0.957 and 0.985, respectively. The proposed RPITER could be a complementary method for predicting RPI and constructing RPI network, which would help push forward the related biological research on ncRNAs and lncRNAs.

2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Xiongfei Tian ◽  
Ling Shen ◽  
Zhenwu Wang ◽  
Liqian Zhou ◽  
Lihong Peng

AbstractLong noncoding RNAs (lncRNAs) regulate many biological processes by interacting with corresponding RNA-binding proteins. The identification of lncRNA–protein Interactions (LPIs) is significantly important to well characterize the biological functions and mechanisms of lncRNAs. Existing computational methods have been effectively applied to LPI prediction. However, the majority of them were evaluated only on one LPI dataset, thereby resulting in prediction bias. More importantly, part of models did not discover possible LPIs for new lncRNAs (or proteins). In addition, the prediction performance remains limited. To solve with the above problems, in this study, we develop a Deep Forest-based LPI prediction method (LPIDF). First, five LPI datasets are obtained and the corresponding sequence information of lncRNAs and proteins are collected. Second, features of lncRNAs and proteins are constructed based on four-nucleotide composition and BioSeq2vec with encoder-decoder structure, respectively. Finally, a deep forest model with cascade forest structure is developed to find new LPIs. We compare LPIDF with four classical association prediction models based on three fivefold cross validations on lncRNAs, proteins, and LPIs. LPIDF obtains better average AUCs of 0.9012, 0.6937 and 0.9457, and the best average AUPRs of 0.9022, 0.6860, and 0.9382, respectively, for the three CVs, significantly outperforming other methods. The results show that the lncRNA FTX may interact with the protein P35637 and needs further validation.


2018 ◽  
Author(s):  
Kaiming Zhang ◽  
Xiaoyong Pan ◽  
Yang Yang ◽  
Hong-Bin Shen

AbstractCircular RNAs (circRNAs), with their crucial roles in gene regulation and disease development, have become a rising star in the RNA world. A lot of previous wet-lab studies focused on the interaction mechanisms between circRNAs and RNA-binding proteins (RBPs), as the knowledge of circRNA-RBP association is very important for understanding functions of circRNAs. Recently, the abundant CLIP-Seq experimental data has made the large-scale identification and analysis of circRNA-RBP interactions possible, while no computational tool based on machine learning has been developed yet.We present a new deep learning-based method, CRIP (CircRNAs Interact with Proteins), for the prediction of RBP binding sites on circRNAs, using only the RNA sequences. In order to fully exploit the sequence information, we propose a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional neural network (CNN) learns high-level abstract features and a recurrent neural network (RNN) learns long dependency in the sequences. We construct 37 datasets including sequence fragments of binding sites on circRNAs, and each set corresponds to one RBP. The experimental results show that the new encoding scheme is superior to the existing feature representation methods for RNA sequences, and the hybrid network outperforms conventional classifiers by a large margin, where both the CNN and RNN components contribute to the performance improvement. To the best of our knowledge, CRIP is the first machine learning-based tool specialized in the prediction of circRNA-RBP interactions, which is expected to play an important role for large-scale function analysis of circRNAs.


2021 ◽  
Author(s):  
Lei Deng ◽  
Wenjuan Nie ◽  
Jiaojiao Zhao ◽  
Jingpu Zhang

Abstract Background: Viral infection and diseases are caused by various viruses involved in the protein-protein interaction (PPI) between virus and host, which are a threat to human health. Studying the virus-host PPI is beneficial to apprehending the mechanism of viral infection and developing new treatment drugs. Although several computational methods for predicting the virus-host PPI have been proposed, most of them are supported by the machine learning algorithms, making the hidden high-level feature difficult to be extracted. Results: We proposed a novel hybrid deep learning framework combined with four CNN layers and LSTM to predict the virus-host PPI only using protein sequence information. CNN can extract the nonlinear position-related features of protein sequence, and LSTM can obtain the long-term relevant information. L1-regularized logistic regression is applied to eliminate the noise and redundant information. Our model achieved the best performance on the benchmark dataset and independent set compared with other existing methods. Conclusion: Our method, through the hybrid deep neural network, is useful for predicting virus-host PPI using protein sequence alone, and achieved the best prediction performance compared with other existing methods, which is promising on the virus-host PPI prediction


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Dipendra Jha ◽  
Vishu Gupta ◽  
Logan Ward ◽  
Zijiang Yang ◽  
Christopher Wolverton ◽  
...  

AbstractThe application of machine learning (ML) techniques in materials science has attracted significant attention in recent years, due to their impressive ability to efficiently extract data-driven linkages from various input materials representations to their output properties. While the application of traditional ML techniques has become quite ubiquitous, there have been limited applications of more advanced deep learning (DL) techniques, primarily because big materials datasets are relatively rare. Given the demonstrated potential and advantages of DL and the increasing availability of big materials datasets, it is attractive to go for deeper neural networks in a bid to boost model performance, but in reality, it leads to performance degradation due to the vanishing gradient problem. In this paper, we address the question of how to enable deeper learning for cases where big materials data is available. Here, we present a general deep learning framework based on Individual Residual learning (IRNet) composed of very deep neural networks that can work with any vector-based materials representation as input to build accurate property prediction models. We find that the proposed IRNet models can not only successfully alleviate the vanishing gradient problem and enable deeper learning, but also lead to significantly (up to 47%) better model accuracy as compared to plain deep neural networks and traditional ML techniques for a given input materials representation in the presence of big data.


2021 ◽  
Vol 9 (Suppl 3) ◽  
pp. A874-A874
Author(s):  
David Soong ◽  
David Soong ◽  
David Soong ◽  
Anantharaman Muthuswamy ◽  
Clifton Drew ◽  
...  

BackgroundRecent advances in machine learning and digital pathology have enabled a variety of applications including predicting tumor grade and genetic subtypes, quantifying the tumor microenvironment (TME), and identifying prognostic morphological features from H&E whole slide images (WSI). These supervised deep learning models require large quantities of images manually annotated with cellular- and tissue-level details by pathologists, which limits scale and generalizability across cancer types and imaging platforms. Here we propose a semi-supervised deep learning framework that automatically annotates biologically relevant image content from hundreds of solid tumor WSI with minimal pathologist intervention, thus improving quality and speed of analytical workflows aimed at deriving clinically relevant features.MethodsThe dataset consisted of >200 H&E images across >10 solid tumor types (e.g. breast, lung, colorectal, cervical, and urothelial cancers) from advanced disease patients. WSI were first partitioned into small tiles of 128μm for feature extraction using a 50-layer convolutional neural network pre-trained on the ImageNet database. Dimensionality reduction and unsupervised clustering were applied to the resultant embeddings and image clusters were identified with enriched histological and morphological characteristics. A random subset of representative tiles (<0.5% of whole slide tissue areas) from these distinct image clusters was manually reviewed by pathologists and assigned to eight histological and morphological categories: tumor, stroma/connective tissue, necrotic cells, lymphocytes, red blood cells, white blood cells, normal tissue and glass/background. This dataset allowed the development of a multi-label deep neural network to segment morphologically distinct regions and detect/quantify histopathological features in WSI.ResultsAs representative image tiles within each image cluster were morphologically similar, expert pathologists were able to assign annotations to multiple images in parallel, effectively at 150 images/hour. Five-fold cross-validation showed average prediction accuracy of 0.93 [0.8–1.0] and area under the curve of 0.90 [0.8–1.0] over the eight image categories. As an extension of this classifier framework, all whole slide H&E images were segmented and composite lymphocyte, stromal, and necrotic content per patient tumor was derived and correlated with estimates by pathologists (p<0.05).ConclusionsA novel and scalable deep learning framework for annotating and learning H&E features from a large unlabeled WSI dataset across tumor types was developed. This automated approach accurately identified distinct histomorphological features, with significantly reduced labeling time and effort required for pathologists. Further, this classifier framework was extended to annotate regions enriched in lymphocytes, stromal, and necrotic cells – important TME contexture with clinical relevance for patient prognosis and treatment decisions.


Author(s):  
Tahani Aljohani ◽  
Alexandra I. Cristea

Massive Open Online Courses (MOOCs) have become universal learning resources, and the COVID-19 pandemic is rendering these platforms even more necessary. In this paper, we seek to improve Learner Profiling (LP), i.e. estimating the demographic characteristics of learners in MOOC platforms. We have focused on examining models which show promise elsewhere, but were never examined in the LP area (deep learning models) based on effective textual representations. As LP characteristics, we predict here the employment status of learners. We compare sequential and parallel ensemble deep learning architectures based on Convolutional Neural Networks and Recurrent Neural Networks, obtaining an average high accuracy of 96.3% for our best method. Next, we predict the gender of learners based on syntactic knowledge from the text. We compare different tree-structured Long-Short-Term Memory models (as state-of-the-art candidates) and provide our novel version of a Bi-directional composition function for existing architectures. In addition, we evaluate 18 different combinations of word-level encoding and sentence-level encoding functions. Based on these results, we show that our Bi-directional model outperforms all other models and the highest accuracy result among our models is the one based on the combination of FeedForward Neural Network and the Stack-augmented Parser-Interpreter Neural Network (82.60% prediction accuracy). We argue that our prediction models recommended for both demographics characteristics examined in this study can achieve high accuracy. This is additionally also the first time a sound methodological approach toward improving accuracy for learner demographics classification on MOOCs was proposed.


2019 ◽  
Vol 10 (1) ◽  
Author(s):  
Jordy Homing Lam ◽  
Yu Li ◽  
Lizhe Zhu ◽  
Ramzan Umarov ◽  
Hanlun Jiang ◽  
...  

Abstract Protein-RNA interaction plays important roles in post-transcriptional regulation. However, the task of predicting these interactions given a protein structure is difficult. Here we show that, by leveraging a deep learning model NucleicNet, attributes such as binding preference of RNA backbone constituents and different bases can be predicted from local physicochemical characteristics of protein structure surface. On a diverse set of challenging RNA-binding proteins, including Fem-3-binding-factor 2, Argonaute 2 and Ribonuclease III, NucleicNet can accurately recover interaction modes discovered by structural biology experiments. Furthermore, we show that, without seeing any in vitro or in vivo assay data, NucleicNet can still achieve consistency with experiments, including RNAcompete, Immunoprecipitation Assay, and siRNA Knockdown Benchmark. NucleicNet can thus serve to provide quantitative fitness of RNA sequences for given binding pockets or to predict potential binding pockets and binding RNAs for previously unknown RNA binding proteins.


Sign in / Sign up

Export Citation Format

Share Document