A Bi-LSTM Based Ensemble Algorithm for Prediction of Protein Secondary Structure

Hailong Hu; Zhong Li; Arne Elofsson; Shangxin Xie

doi:10.3390/app9173538

A Bi-LSTM Based Ensemble Algorithm for Prediction of Protein Secondary Structure

Applied Sciences ◽

10.3390/app9173538 ◽

2019 ◽

Vol 9 (17) ◽

pp. 3538 ◽

Cited By ~ 3

Author(s):

Hailong Hu ◽

Zhong Li ◽

Arne Elofsson ◽

Shangxin Xie

Keyword(s):

Secondary Structure ◽

Cross Validation ◽

State Of The Art ◽

Protein Secondary Structure ◽

Ensemble Methods ◽

Ensemble Model ◽

Training Process ◽

Independent Test ◽

Test Sets ◽

Fold Cross Validation

The prediction of protein secondary structure continues to be an active area of research in bioinformatics. In this paper, a Bi-LSTM based ensemble model is developed for the prediction of protein secondary structure. The ensemble model with dual loss function consists of five sub-models, which are finally joined by a Bi-LSTM layer. In contrast to existing ensemble methods, which generally train each sub-model and then join them as a whole, this ensemble model and sub-models can be trained simultaneously and the performance of each model can be observed and compared during the training process. Three independent test sets (e.g., data1199, 513 protein Cuff & Barton set (CB513) and 203 proteins from Critical Appraisals Skills Programme (CASP203)) are employed to test the method. On average, the ensemble model achieved 84.3% in Q 3 accuracy and 81.9% in segment overlap measure ( SOV ) score by using 10-fold cross validation. There is an improvement of up to 1% over some state-of-the-art prediction methods of protein secondary structure.

Download Full-text

A novel computational model for predicting potential LncRNA-disease associations based on both direct and indirect features of LncRNA-disease pairs

BMC Bioinformatics ◽

10.1186/s12859-020-03906-7 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Yubin Xiao ◽

Zheng Xiao ◽

Xiang Feng ◽

Zhiping Chen ◽

Linai Kuang ◽

...

Keyword(s):

Computational Model ◽

Cross Validation ◽

State Of The Art ◽

Prediction Methods ◽

Good Prediction ◽

Average Case ◽

Comparison Results ◽

Disease Associations ◽

Fold Cross Validation

Abstract Background Accumulating evidence has demonstrated that long non-coding RNAs (lncRNAs) are closely associated with human diseases, and it is useful for the diagnosis and treatment of diseases to get the relationships between lncRNAs and diseases. Due to the high costs and time complexity of traditional bio-experiments, in recent years, more and more computational methods have been proposed by researchers to infer potential lncRNA-disease associations. However, there exist all kinds of limitations in these state-of-the-art prediction methods as well. Results In this manuscript, a novel computational model named FVTLDA is proposed to infer potential lncRNA-disease associations. In FVTLDA, its major novelty lies in the integration of direct and indirect features related to lncRNA-disease associations such as the feature vectors of lncRNA-disease pairs and their corresponding association probability fractions, which guarantees that FVTLDA can be utilized to predict diseases without known related-lncRNAs and lncRNAs without known related-diseases. Moreover, FVTLDA neither relies solely on known lncRNA-disease nor requires any negative samples, which guarantee that it can infer potential lncRNA-disease associations more equitably and effectively than traditional state-of-the-art prediction methods. Additionally, to avoid the limitations of single model prediction techniques, we combine FVTLDA with the Multiple Linear Regression (MLR) and the Artificial Neural Network (ANN) for data analysis respectively. Simulation experiment results show that FVTLDA with MLR can achieve reliable AUCs of 0.8909, 0.8936 and 0.8970 in 5-Fold Cross Validation (fivefold CV), 10-Fold Cross Validation (tenfold CV) and Leave-One-Out Cross Validation (LOOCV), separately, while FVTLDA with ANN can achieve reliable AUCs of 0.8766, 0.8830 and 0.8807 in fivefold CV, tenfold CV, and LOOCV respectively. Furthermore, in case studies of gastric cancer, leukemia and lung cancer, experiment results show that there are 8, 8 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with MLR, and 8, 7 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with ANN, having been verified by recent literature. Comparing with the representative prediction model of KATZLDA, comparison results illustrate that FVTLDA with MLR and FVTLDA with ANN can achieve the average case study contrast scores of 0.8429 and 0.8515 respectively, which are both notably higher than the average case study contrast score of 0.6375 achieved by KATZLDA. Conclusion The simulation results show that FVTLDA has good prediction performance, which is a good supplement to future bioinformatics research.

Download Full-text

Decision-Tree Based Meta-Strategy Improved Accuracy of Disorder Prediction and Identified Novel Disordered Residues Inside Binding Motifs

International Journal of Molecular Sciences ◽

10.3390/ijms19103052 ◽

2018 ◽

Vol 19 (10) ◽

pp. 3052 ◽

Cited By ~ 7

Author(s):

Bi Zhao ◽

Bin Xue

Keyword(s):

Cross Validation ◽

Prediction Performance ◽

Computational Techniques ◽

Binding Motifs ◽

Intrinsically Disordered ◽

Biological Studies ◽

Independent Test ◽

Overall Performance ◽

Improved Accuracy ◽

Fold Cross Validation

Using computational techniques to identify intrinsically disordered residues is practical and effective in biological studies. Therefore, designing novel high-accuracy strategies is always preferable when existing strategies have a lot of room for improvement. Among many possibilities, a meta-strategy that integrates the results of multiple individual predictors has been broadly used to improve the overall performance of predictors. Nonetheless, a simple and direct integration of individual predictors may not effectively improve the performance. In this project, dual-threshold two-step significance voting and neural networks were used to integrate the predictive results of four individual predictors, including: DisEMBL, IUPred, VSL2, and ESpritz. The new meta-strategy has improved the prediction performance of intrinsically disordered residues significantly, compared to all four individual predictors and another four recently-designed predictors. The improvement was validated using five-fold cross-validation and in independent test datasets.

Download Full-text

Global observation-based climatology of precipitation occurrence and peak intensity

10.5194/egusphere-egu2020-7837 ◽

2020 ◽

Author(s):

Hylke Beck ◽

Seth Westra ◽

Eric Wood

Keyword(s):

Land Surface ◽

Regression Models ◽

Cross Validation ◽

Climate Models ◽

Daily Precipitation ◽

State Of The Art ◽

Coefficient Of Determination ◽

Peak Intensity ◽

Uncertainty Estimates ◽

Fold Cross Validation

We introduce a unique set of global observation-based climatologies of daily precipitation (P) occurrence (related to the lower tail of the P distribution) and peak intensity (related to the upper tail of the P distribution). The climatologies were produced using Random Forest (RF) regression models trained with an unprecedented collection of daily P observations from 93,138 stations worldwide. Five-fold cross-validation was used to evaluate the generalizability of the approach and to quantify uncertainty globally. The RF models were found to provide highly satisfactory performance, yielding cross-validation coefficient of determination (R2) values from 0.74 for the 15-year return-period daily P intensity to 0.86 for the >0.5 mm d-1 daily P occurrence. The performance of the RF models was consistently superior to that of state-of-the-art reanalysis (ERA5) and satellite (IMERG) products. The highest P intensities over land were found along the western equatorial coast of Africa, in India, and along coastal areas of Southeast Asia. Using a 0.5 mm d-1 threshold, P was estimated to occur 23.2 % of days on average over the global land surface (excluding Antarctica). The climatologies including uncertainty estimates will be released as the Precipitation DISTribution (PDIST) dataset via www.gloh2o.org/pdist. We expect the dataset to be useful for numerous purposes, such as the evaluation of climate models, the bias correction of gridded P datasets, and the design of hydraulic structures in poorly gauged regions.

Download Full-text

Hermes: an ensemble machine learning architecture for protein secondary structure prediction

10.1101/640656 ◽

2019 ◽

Author(s):

Larry Bliss ◽

Ben Pascoe ◽

Samuel K Sheppard

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Secondary Structure ◽

Structure Prediction ◽

Cross Validation ◽

Secondary Structure Prediction ◽

Protein Structures ◽

Lower Boundary ◽

Protein Secondary Structure ◽

Homologous Proteins

AbstractMotivationProtein structure predictions, that combine theoretical chemistry and bioinformatics, are an increasingly important technique in biotechnology and biomedical research, for example in the design of novel enzymes and drugs. Here, we present a new ensemble bi-layered machine learning architecture, that directly builds on ten existing pipelines providing rapid, high accuracy, 3-State secondary structure prediction of proteins.ResultsAfter training on 1348 solved protein structures, we evaluated the model with four independent datasets: JPRED4 - compiled by the authors of the successful predictor with the same name, and CASP11, CASP12 & CASP13 - assembled by the Critical Assessment of protein Structure Prediction consortium who run biannual experiments focused on objective testing of predictors. These rigorous, pre-established protocols included 7-fold cross-validation and blind testing. This led to a mean Hermes accuracy of 95.5%, significantly (p<0.05) better than the ten previously published models analysed in this paper. Furthermore, Hermes yielded a reduction in standard deviation, lower boundary outliers, and reduced dependency on solved structures of homologous proteins, as measured by NEFF score. This architecture provides advantages over other pipelines, while remaining accessible to users at any level of bioinformatics experience.Availability and ImplementationThe source code for Hermes is freely available at: https://github.com/HermesPrediction/Hermes. This page also includes the cross-validation with corresponding models, and all training/testing data presented in this study with predictions and accuracy.

Download Full-text

A Novel Computational Model for Predicting Potential LncRNA-Disease Associations based on Both Direct and Indirect Features of LncRNA-Disease Pairs

10.21203/rs.2.18937/v3 ◽

2020 ◽

Author(s):

Yubin Xiao ◽

Zheng Xiao ◽

Xiang Feng ◽

Zhiping Chen ◽

Linai Kuang ◽

...

Keyword(s):

Computational Model ◽

Cross Validation ◽

State Of The Art ◽

Prediction Methods ◽

Good Prediction ◽

Average Case ◽

Comparison Results ◽

Disease Associations ◽

Fold Cross Validation

Abstract Background: Accumulating evidence has demonstrated that long non-coding RNAs (lncRNAs) are closely associated with human diseases, and it is useful for the diagnosis and treatment of diseases to get the relationships between lncRNAs and diseases. Due to the high costs and time complexity of traditional bio-experiments, in recent years, more and more computational methods have been proposed by researchers to infer potential lncRNA-disease associations. However, there exist all kinds of limitations in these state-of-the-art prediction methods as well.Results: In this manuscript, a novel computational model named FVTLDA is proposed to infer potential lncRNA-disease associations. In FVTLDA, its major novelty lies in the integration of direct and indirect features related to lncRNA-disease associations such as the feature vectors of lncRNA-disease pairs and their corresponding association probability fractions, which guarantees that FVTLDA can be utilized to predict diseases without known related-lncRNAs and lncRNAs without known related-diseases. Moreover, FVTLDA neither relies solely on known lncRNA-disease nor requires any negative samples, which guarantee that it can infer potential lncRNA-disease associations more equitably and effectively than traditional state-of-the-art prediction methods. Additionally, to avoid the limitations of single model prediction techniques, we combine FVTLDA with the Multiple Linear Regression (MLR) and the Artificial Neural Network (ANN) for data analysis respectively. Simulation experiment results show that FVTLDA with MLR can achieve reliable AUCs of 0.8909, 0.8936 and 0.8970 in 5-Fold Cross Validation (5-fold CV), 10-Fold Cross Validation (10-fold CV) and Leave-One-Out Cross Validation (LOOCV), separately, while FVTLDA with ANN can achieve reliable AUCs of 0.8766, 0.8830 and 0.8807 in 5-fold CV, 10-fold CV, and LOOCV respectively. Furthermore, in case studies of gastric cancer, leukemia and lung cancer, experiment results show that there are 8, 8 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with MLR, and 8, 7 and 8 out of top 10 candidate lncRNAs predicted by FVTLDA with ANN, having been verified by recent literature. Comparing with the representative prediction model of KATZLDA, comparison results illustrate that FVTLDA with MLR and FVTLDA with ANN can achieve the average case study contrast scores of 0.8429 and 0.8515 respectively, which are both notably higher than the average case study contrast score of 0.6375 achieved by KATZLDA.Conclusion: The simulation results show that FVTLDA has good prediction performance, which is a good supplement to future bioinformatics research.

Download Full-text

ProteinUnet2 for Fast Protein Secondary Structure Prediction: A Step Towards Proper Evaluation

10.21203/rs.3.rs-900318/v1 ◽

2021 ◽

Author(s):

Katarzyna Stapor ◽

Krzysztof Kotowski ◽

Tomasz Smolarczyk ◽

Irena Roterman

Keyword(s):

Secondary Structure ◽

Structure Prediction ◽

Secondary Structure Prediction ◽

State Of The Art ◽

Protein Secondary Structure ◽

Evaluation Study ◽

Practical Significance ◽

Statistical Methodology ◽

Extensive Evaluation ◽

Benchmark Datasets

Abstract Background: The importance of protein secondary structure (SS) prediction is widely known, its solution enables learning about the role of a protein in organisms. As the experimental methods are expensive and sometimes impossible, many SS predictors, mainly based on different machine learning methods have been proposed for many years. SS prediction as the imbalanced classification problem should not be judged by the commonly used Q3/Q8 metrics. Moreover, as the benchmark datasets are not random samples, the classical statistical null hypothesis testing based on the Neyman-Pearson approach is not appropriate. Also, the state-of-the-art predictors have usually relatively long prediction times.Results: We present a new deep network ProteinUnet2 for SS prediction which is based on U-Net convolutional architecture. We also propose a new statistical methodology for prediction performance assessment based on the significance from Fisher-Pitman permutation tests accompanied by practical significance measured by Cohen’s effect size. Through an extensive evaluation study, we report the performance of ProteinUnet2 in comparison with two state-of-the-art methods SAINT and SPOT-1D on benchmark datasets TEST2016, TEST2018, and CASP12. Conclusions: Our results suggest that ProteinUnet2 has much shorter prediction times while maintaining (or outperforming) the mentioned predictors. We strongly believe that our proposed statistical methodology will be adopted and used (and even expanded) by the research community.

Download Full-text

A Methodology to Determine the Subset of Heuristics for Hyperheuristics through Metalearning for Solving Graph Coloring and Capacitated Vehicle Routing Problems

Complexity ◽

10.1155/2021/6660572 ◽

2021 ◽

Vol 2021 ◽

pp. 1-22

Author(s):

Lucero Ortiz-Aguilar ◽

Martín Carpio ◽

Alfonso Rojas-Domínguez ◽

Manuel Ornelas-Rodriguez ◽

H. J. Puga-Soberanes ◽

...

Keyword(s):

Vehicle Routing ◽

Graph Coloring ◽

Cross Validation ◽

State Of The Art ◽

Statistical Tests ◽

Statistical Comparison ◽

Offline Learning ◽

Comparison Of The Results ◽

Fold Cross Validation ◽

Capacitated Vehicle

In this work, we focus on the problem of selecting low-level heuristics in a hyperheuristic approach with offline learning, for the solution of instances of different problem domains. The objective is to improve the performance of the offline hyperheuristic approach, identifying equivalence classes in a set of instances of different problems and selecting the best performing heuristics in each of them. A methodology is proposed as the first step of a set of instances of all problems, and the generic characteristics of each instance and the performance of the heuristics in each one of them are considered to define the vectors of characteristics and make a grouping of classes. Metalearning with statistical tests is used to select the heuristics for each class. Finally, we used the Naive Bayes to test the set instances with k-fold cross-validation, and we compared all results statistically with the best-known values. In this research, the methodology was tested by applying it to the problems of capacitated vehicle routing (CVRP) and graph coloring (GCP). The experimental results show that the proposed methodology can improve the performance of the offline hyperheuristic approach, correctly identifying the classes of instances and applying the appropriate heuristics in each case. This is based on the statistical comparison of the results obtained with those of the state of the art of each instance.

Download Full-text

Prediction of Protein Secondary Structure

Jurnal Teknologi ◽

10.11113/jt.v35.605 ◽

2012 ◽

Author(s):

Satya Nanda Vel Arjunan ◽

Safaai Deris ◽

Rosli Md Illias

Keyword(s):

Protein Structure ◽

Secondary Structure ◽

Structure Prediction ◽

Large Scale ◽

Secondary Structure Prediction ◽

State Of The Art ◽

Protein Structures ◽

Protein Secondary Structure ◽

Protein Secondary Structure Prediction ◽

General Guide

Dengan wujudnya projek jujukan DNA secara besar–besaran, teknik yang tepat untuk meramalkan struktur protein diperlukan. Masalah meramalkan struktur protein daripada jujukan DNA pada dasarnya masih belum dapat diselesaikan walaupun kajian intensif telah dilakukan selama lebih daripada tiga dekad. Dalam kertas kerja ini, teori asas struktur protein akan dibincangkan sebagai panduan umum bagi kajian peramalan struktur protein sekunder. Analisis jujukan terkini serta prinsip yang digunakan dalam teknik–teknik tersebut akan diterangkan. Kata kunci: Peramalan struktur sekunder protein; Rangkaian Neural In the wake of large-scale DNA sequencing projects, accurate tools are needed to predict protein structures. The problem of predicting protein structure from DNA sequence remains fundamentally unsolved even after more than three decades of intensive research. In this paper, fundamental theory of the protein structure will be presented as a general guide to protein secondary structure prediction research. An overview of the state–of–the–art in sequence analysis and some principles of the methods involved wil be described. Key words: Protein secondary structure prediction; Neural networks

Download Full-text

Derivation and Validation of a Record Linkage Algorithm between EMS and the Emergency Department

10.1101/124313 ◽

2017 ◽

Author(s):

Colby Redfield ◽

Abdulhakim Tlimat ◽

Yoni Halpern ◽

David Schoenfeld ◽

Edward Ullman ◽

...

Keyword(s):

Record Linkage ◽

Cross Validation ◽

Supervised Machine Learning ◽

Multivariate Logistic Regression Model ◽

Data Sets ◽

Multivariate Logistic Regression ◽

Linkage Algorithm ◽

Limited Success ◽

Test Sets ◽

Fold Cross Validation

AbstractBackgroundLinking EMS electronic patient care reports (ePCRs) to ED records can provide clinicians access to vital information that can alter management. It can also create rich databases for research and quality improvement. Unfortunately, previous attempts at ePCR - ED record linkage have had limited success.ObjectiveTo derive and validate an automated record linkage algorithm between EMS ePCR’s and ED records using supervised machine learning.MethodsAll consecutive ePCR’s from a single EMS provider between June 2013 and June 2015 were included. A primary reviewer matched ePCR’s to a list of ED patients to create a gold standard. Age, gender, last name, first name, social security number (SSN), and date of birth (DOB) were extracted. Data was randomly split into 80%/20% training and test data sets. We derived missing indicators, identical indicators, edit distances, and percent differences. A multivariate logistic regression model was trained using 5k fold cross-validation, using label k-fold, L2 regularization, and class re-weighting.ResultsA total of 14,032 ePCRs were included in the study. Inter-rater reliability between the primary and secondary reviewer had a Kappa of 0.9. The algorithm had a sensitivity of 99.4%, a PPV of 99.9% and AUC of 0.99 in both the training and test sets. DOB match had the highest odd ratio of 16.9, followed by last name match (10.6). SSN match had an odds ratio of 3.8.ConclusionsWe were able to successfully derive and validate a probabilistic record linkage algorithm from a single EMS ePCR provider to our hospital EMR.

Download Full-text

Analyzing performance of classifiers for medical datasets

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.15.11370 ◽

2018 ◽

Vol 7 (2.15) ◽

pp. 136 ◽

Cited By ~ 1

Author(s):

Rosaida Rosly ◽

Mokhairi Makhtar ◽

Mohd Khalid Awang ◽

Mohd Isa Awang ◽

Mohd Nordin Abdul Rahman

Keyword(s):

Breast Cancer ◽

Cross Validation ◽

Ensemble Methods ◽

Data Sets ◽

Ensemble Classifiers ◽

Classification Models ◽

Data Set ◽

Mining Tool ◽

Fold Cross Validation

This paper analyses the performance of classification models using single classification and combination of ensemble method, which are Breast Cancer Wisconsin and Hepatitis data sets as training datasets. This paper presents a comparison of different classifiers based on a 10-fold cross validation using a data mining tool. In this experiment, various classifiers are implemented including three popular ensemble methods which are boosting, bagging and stacking for the combination. The result shows that for the classification of the Breast Cancer Wisconsin data set, the single classification of Naïve Bayes (NB) and a combination of bagging+NB algorithm displayed the highest accuracy at the same percentage (97.51%) compared to other combinations of ensemble classifiers. For the classification of the Hepatitisdata set, the result showed that the combination of stacking+Multi-Layer Perception (MLP) algorithm achieved a higher accuracy at 86.25%. By using the ensemble classifiers, the result may be improved. In future, a multi-classifier approach will be proposed by introducing a fusion at the classification level between these classifiers to obtain classification with higher accuracies.

Download Full-text