Accurate Prediction of Genome-wide RNA Secondary Structure Profile Based On Extreme Gradient Boosting

Mapping Intimacies ◽

10.1101/610782 ◽

2019 ◽

Cited By ~ 1

Author(s):

Yaobin Ke ◽

Jiahua Rao ◽

Huiying Zhao ◽

Yutong Lu ◽

Nong Xiao ◽

...

Keyword(s):

Secondary Structure ◽

High Throughput ◽

Rna Secondary Structure ◽

Pearson Correlation ◽

Supplementary Information ◽

Gradient Boosting ◽

Synonymous Mutations ◽

Periodic Distribution ◽

Genome Wide ◽

Extreme Gradient Boosting

AbstractMotivationMany studies have shown that RNA secondary structure plays a vital role in fundamental cellular processes, such as protein synthesis, mRNA processing, mRNA assembly, ribosome function and eukaryotic spliceosomes. Identification of RNA secondary structure is a key step to understand the common mechanisms underlying the translation process. Recently, a few experimental methods were developed to measure genome-wide RNA secondary structure profile through high-throughput sequencing techniques, and have been successfully applied to genomes including yeast and human. However, these high-throughput methods usually have low precision and are hard to cover all nucleotides on the RNA due to limited sequencing coverage.ResultsIn this study, we developed a new method for the prediction of genome-wide RNA secondary structure profile (TH-GRASP) from RNA sequence based on eXtreme Gradient Boosting (XGBoost). The method achieves an prediction with areas under the receiver operating characteristic curve (AUC) values greater than 0.9 on three different datasets, and AUC of 0.892 by an independent test on the recently released Zika virus RNA dataset. These AUCs represent a consistent increase of >6% than the recently developed method CROSS trained by a shallow neural network. A further analysis on the 1000-Genome Project data showed that our predicted unpaired probability at mutations sites are highly correlated with the minor allele frequencies (MAF) of synonymous, non-synonymous mutations, and mutations in 3’ and 5’UTR with Pearson Correlation Coefficients all above 0.8. These PCCs are consistently higher than those generated by RNAplfold method. Moreover, an investigation over all human mRNA indicated a periodic distribution of the predicted unpaired probability on codons, and a decrease of paired probability in the boundary with 5’ and 3’ untranslated regions. These results highlighted TH-GRASP is effective to remove experimental noises and to have ability to make predictions on nucleotides with low or no coverage by fitting high-throughput genomic data for RNA secondary structure profiles, and also suggested that building model on high throughput experimental data might be a future direction to substitute analytical methods.AvailabilityThe TH-GRASP is available for academic use athttps://github.com/sysu-yanglab/TH-GRASP.Supplementary informationSupplementary data are available online.

Download Full-text

Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting

Bioinformatics ◽

10.1093/bioinformatics/btaa534 ◽

2020 ◽

Vol 36 (17) ◽

pp. 4576-4582

Author(s):

Yaobin Ke ◽

Jiahua Rao ◽

Huiying Zhao ◽

Yutong Lu ◽

Nong Xiao ◽

...

Keyword(s):

Secondary Structure ◽

High Throughput ◽

Rna Secondary Structure ◽

High Throughput Sequencing ◽

Supplementary Information ◽

Gradient Boosting ◽

Synonymous Mutations ◽

Periodic Distribution ◽

Genome Wide ◽

Extreme Gradient Boosting

Abstract Motivation RNA secondary structure plays a vital role in fundamental cellular processes, and identification of RNA secondary structure is a key step to understand RNA functions. Recently, a few experimental methods were developed to profile genome-wide RNA secondary structure, i.e. the pairing probability of each nucleotide, through high-throughput sequencing techniques. However, these high-throughput methods have low precision and cannot cover all nucleotides due to limited sequencing coverage. Results Here, we have developed a new method for the prediction of genome-wide RNA secondary structure profile from RNA sequence based on the extreme gradient boosting technique. The method achieves predictions with areas under the receiver operating characteristic curve (AUC) >0.9 on three different datasets, and AUC of 0.888 by another independent test on the recently released Zika virus data. These AUCs are consistently >5% greater than those by the CROSS method recently developed based on a shallow neural network. Further analysis on the 1000 Genome Project data showed that our predicted unpaired probabilities are highly correlated (>0.8) with the minor allele frequencies at synonymous, non-synonymous mutations, and mutations in untranslated regions, which were higher than those generated by RNAplfold. Moreover, the prediction over all human mRNA indicated a consistent result with previous observation that there is a periodic distribution of unpaired probability on codons. The accurate predictions by our method indicate that such model trained on genome-wide experimental data might be an alternative for analytical methods. Availability and implementation The GRASP is available for academic use at https://github.com/sysu-yanglab/GRASP. Supplementary information Supplementary data are available online.

Download Full-text

Improved RNA secondary structure modeling through high-throughput crowdsourced RNA design initiatives

10.26226/morressier.5ebd45acffea6f735881af01 ◽

2020 ◽

Author(s):

Hannah Wayment-Steele

Keyword(s):

Secondary Structure ◽

High Throughput ◽

Rna Secondary Structure ◽

Structure Modeling ◽

Rna Design

Download Full-text

Faculty Opinions recommendation of Genome-wide measurement of RNA secondary structure in yeast.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.5187965.7487056 ◽

2010 ◽

Author(s):

Anuj Kumar ◽

Cole Johnson

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Genome Wide

Download Full-text

Classification of Hot Spots using XGBoost and LightGBM Algorithms

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.e9459.069520 ◽

2020 ◽

Vol 9 (5) ◽

pp. 722-724

Keyword(s):

Computational Methods ◽

Protein Interactions ◽

Hot Spots ◽

Cell Metabolism ◽

Pearson Correlation ◽

Classification Performance ◽

Gradient Boosting ◽

Support Vector ◽

Extreme Gradient Boosting ◽

Hub Proteins

Protein-Protein Interactions referred as PPIs perform significant role in biological functions like cell metabolism, immune response, signal transduction etc. Hot spots are small fractions of residues in interfaces and provide substantial binding energy in PPIs. Therefore, identification of hot spots is important to discover and analyze molecular medicines and diseases. The current strategy, alanine scanning isn't pertinent to enormous scope applications since the technique is very costly and tedious. The existing computational methods are poor in classification performance as well as accuracy in prediction. They are concerned with the topological structure and gene expression of hub proteins. The proposed system focuses on hot spots of hub proteins by eliminating redundant as well as highly correlated features using Pearson Correlation Coefficient and Support Vector Machine based feature elimination. Extreme Gradient boosting and LightGBM algorithms are used to ensemble a set of weak classifiers to form a strong classifier. The proposed system shows better accuracy than the existing computational methods. The model can also be used to predict accurate molecular inhibitors for specific PPIs

Download Full-text

SubMito-XGBoost: predicting protein submitochondrial localization by fusing multiple feature information and eXtreme gradient boosting

Bioinformatics ◽

10.1093/bioinformatics/btz734 ◽

2019 ◽

Vol 36 (4) ◽

pp. 1074-1081 ◽

Cited By ~ 25

Author(s):

Bin Yu ◽

Wenying Qiu ◽

Cheng Chen ◽

Anjun Ma ◽

Jing Jiang ◽

...

Keyword(s):

Drug Design ◽

Protein Function ◽

Predictive Performance ◽

Supplementary Information ◽

Gradient Boosting ◽

Source Codes ◽

Extreme Gradient Boosting ◽

Auto Correlation Function ◽

Abnormal Mitochondria ◽

Leave One Out

Abstract Motivation Mitochondria are an essential organelle in most eukaryotes. They not only play an important role in energy metabolism but also take part in many critical cytopathological processes. Abnormal mitochondria can trigger a series of human diseases, such as Parkinson's disease, multifactor disorder and Type-II diabetes. Protein submitochondrial localization enables the understanding of protein function in studying disease pathogenesis and drug design. Results We proposed a new method, SubMito-XGBoost, for protein submitochondrial localization prediction. Three steps are included: (i) the g-gap dipeptide composition (g-gap DC), pseudo-amino acid composition (PseAAC), auto-correlation function (ACF) and Bi-gram position-specific scoring matrix (Bi-gram PSSM) are employed to extract protein sequence features, (ii) Synthetic Minority Oversampling Technique (SMOTE) is used to balance samples, and the ReliefF algorithm is applied for feature selection and (iii) the obtained feature vectors are fed into XGBoost to predict protein submitochondrial locations. SubMito-XGBoost has obtained satisfactory prediction results by the leave-one-out-cross-validation (LOOCV) compared with existing methods. The prediction accuracies of the SubMito-XGBoost method on the two training datasets M317 and M983 were 97.7% and 98.9%, which are 2.8–12.5% and 3.8–9.9% higher than other methods, respectively. The prediction accuracy of the independent test set M495 was 94.8%, which is significantly better than the existing studies. The proposed method also achieves satisfactory predictive performance on plant and non-plant protein submitochondrial datasets. SubMito-XGBoost also plays an important role in new drug design for the treatment of related diseases. Availability and implementation The source codes and data are publicly available at https://github.com/QUST-AIBBDRC/SubMito-XGBoost/. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

incaRNAfbinv 2.0: a webserver and software with motif control for fragment-based design of RNAs

Bioinformatics ◽

10.1093/bioinformatics/btaa039 ◽

2020 ◽

Vol 36 (9) ◽

pp. 2920-2922

Author(s):

Matan Drory Retwitzer ◽

Vladimir Reinharz ◽

Alexander Churkin ◽

Yann Ponty ◽

Jérôme Waldispühl ◽

...

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Rna Folding ◽

Practical Interest ◽

Supplementary Information ◽

Command Line ◽

Design Tools ◽

Additional Information ◽

Non Coding Rna ◽

Rna Design

Abstract Summary RNA design has conceptually evolved from the inverse RNA folding problem. In the classical inverse RNA problem, the user inputs an RNA secondary structure and receives an output RNA sequence that folds into it. Although modern RNA design methods are based on the same principle, a finer control over the resulting sequences is sought. As an important example, a substantial number of non-coding RNA families show high preservation in specific regions, while being more flexible in others and this information should be utilized in the design. By using the additional information, RNA design tools can help solve problems of practical interest in the growing fields of synthetic biology and nanotechnology. incaRNAfbinv 2.0 utilizes a fragment-based approach, enabling a control of specific RNA secondary structure motifs. The new version allows significantly more control over the general RNA shape, and also allows to express specific restrictions over each motif separately, in addition to other advanced features. Availability and implementation incaRNAfbinv 2.0 is available through a standalone package and a web-server at https://www.cs.bgu.ac.il/incaRNAfbinv. Source code, command-line and GUI wrappers can be found at https://github.com/matandro/RNAsfbinv. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

A comprehensive database of high-throughput sequencing-based RNA secondary structure probing data (Structure Surfer)

BMC Bioinformatics ◽

10.1186/s12859-016-1071-0 ◽

2016 ◽

Vol 17 (1) ◽

Cited By ~ 15

Author(s):

Nathan D. Berkowitz ◽

Ian M. Silverman ◽

Daniel M. Childress ◽

Hilal Kazan ◽

Li-San Wang ◽

...

Keyword(s):

Data Structure ◽

Secondary Structure ◽

High Throughput ◽

Rna Secondary Structure ◽

High Throughput Sequencing ◽

Comprehensive Database ◽

Structure Probing

Download Full-text

StructureFold: genome-wide RNA secondary structure mapping and reconstructionin vivo

Bioinformatics ◽

10.1093/bioinformatics/btv213 ◽

2015 ◽

Vol 31 (16) ◽

pp. 2668-2675 ◽

Cited By ~ 36

Author(s):

Yin Tang ◽

Emil Bouvier ◽

Chun Kit Kwok ◽

Yiliang Ding ◽

Anton Nekrutenko ◽

...

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Structure Mapping ◽

Genome Wide

Download Full-text

Machine learning techniques to predict daily rainfall amount

Journal Of Big Data ◽

10.1186/s40537-021-00545-4 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Chalachew Muluken Liyew ◽

Haileyesus Amsaya Melese

Keyword(s):

Machine Learning ◽

Pearson Correlation ◽

Daily Rainfall ◽

Learning Model ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Correlation Technique ◽

Learning Techniques ◽

Machine Learning Model ◽

Extreme Gradient Boosting

AbstractPredicting the amount of daily rainfall improves agricultural productivity and secures food and water supply to keep citizens healthy. To predict rainfall, several types of research have been conducted using data mining and machine learning techniques of different countries’ environmental datasets. An erratic rainfall distribution in the country affects the agriculture on which the economy of the country depends on. Wise use of rainfall water should be planned and practiced in the country to minimize the problem of the drought and flood occurred in the country. The main objective of this study is to identify the relevant atmospheric features that cause rainfall and predict the intensity of daily rainfall using machine learning techniques. The Pearson correlation technique was used to select relevant environmental variables which were used as an input for the machine learning model. The dataset was collected from the local meteorological office at Bahir Dar City, Ethiopia to measure the performance of three machine learning techniques (Multivariate Linear Regression, Random Forest, and Extreme Gradient Boost). Root mean squared error and Mean absolute Error methods were used to measure the performance of the machine learning model. The result of the study revealed that the Extreme Gradient Boosting machine learning algorithm performed better than others.

Download Full-text

PhyloFold: Precise and Swift Prediction of RNA Secondary Structures to Incorporate Phylogeny among Homologs

10.1101/2020.03.05.975797 ◽

2020 ◽

Author(s):

Masaki Tagashira

Keyword(s):

Secondary Structure ◽

Rna Secondary Structure ◽

Prediction Accuracy ◽

Structural Alignment ◽

Source Code ◽

Secondary Structures ◽

Supplementary Information ◽

Supplementary Data ◽

Link Type ◽

Structural Alignments

AbstractMotivationThe simultaneous consideration of sequence alignment and RNA secondary structure, or structural alignment, is known to help predict more accurate secondary structures of homologs. However, the consideration is heavy and can be done only roughly to decompose structural alignments.ResultsThe PhyloFold method, which predicts secondary structures of homologs considering likely pairwise structural alignments, was developed in this study. The method shows the best prediction accuracy while demanding comparable running time compared to conventional methods.AvailabilityThe source code of the programs implemented in this study is available on “https://github.com/heartsh/phylofold” and “https://github.com/heartsh/phyloalifold“.Contact“[email protected]”.Supplementary informationSupplementary data are available.

Download Full-text