Recent Advances in the Prediction of Protein Structural Classes: Feature Descriptors and Machine Learning Algorithms

Lin Zhu; Mehdi D. Davari; Wenjin Li

doi:10.3390/cryst11040324

Recent Advances in the Prediction of Protein Structural Classes: Feature Descriptors and Machine Learning Algorithms

Crystals ◽

10.3390/cryst11040324 ◽

2021 ◽

Vol 11 (4) ◽

pp. 324

Author(s):

Lin Zhu ◽

Mehdi D. Davari ◽

Wenjin Li

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Protein Function ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Class Prediction ◽

Structural Class ◽

Protein Structural Class ◽

Feature Descriptors ◽

New Feature

In the postgenomic age, rapid growth in the number of sequence-known proteins has been accompanied by much slower growth in the number of structure-known proteins (as a result of experimental limitations), and a widening gap between the two is evident. Because protein function is linked to protein structure, successful prediction of protein structure is of significant importance in protein function identification. Foreknowledge of protein structural class can help improve protein structure prediction with significant medical and pharmaceutical implications. Thus, a fast, suitable, reliable, and reasonable computational method for protein structural class prediction has become pivotal in bioinformatics. Here, we review recent efforts in protein structural class prediction from protein sequence, with particular attention paid to new feature descriptors, which extract information from protein sequence, and the use of machine learning algorithms in both feature selection and the construction of new classification models. These new feature descriptors include amino acid composition, sequence order, physicochemical properties, multiprofile Bayes, and secondary structure-based features. Machine learning methods, such as artificial neural networks (ANNs), support vector machine (SVM), K-nearest neighbor (KNN), random forest, deep learning, and examples of their application are discussed in detail. We also present our view on possible future directions, challenges, and opportunities for the applications of machine learning algorithms for prediction of protein structural classes.

Download Full-text

Image-based effective feature generation for protein structural class and ligand binding prediction

PeerJ Computer Science ◽

10.7717/peerj-cs.253 ◽

2020 ◽

Vol 6 ◽

pp. e253

Author(s):

Nafees Sadique ◽

Al Amin Neaz Ahmed ◽

Md Tajul Islam ◽

Md. Nawshad Pervage ◽

Swakkhar Shatabda

Keyword(s):

Machine Learning ◽

Ligand Binding ◽

Tertiary Structure ◽

Learning Algorithms ◽

Three Dimensional ◽

Machine Learning Algorithms ◽

Binding Prediction ◽

Structural Class ◽

Protein Structural Class ◽

Atom Bond

Proteins are the building blocks of all cells in both human and all living creatures of the world. Most of the work in the living organism is performed by proteins. Proteins are polymers of amino acid monomers which are biomolecules or macromolecules. The tertiary structure of protein represents the three-dimensional shape of a protein. The functions, classification and binding sites are governed by the protein’s tertiary structure. If two protein structures are alike, then the two proteins can be of the same kind implying similar structural class and ligand binding properties. In this paper, we have used the protein tertiary structure to generate effective features for applications in structural similarity to detect structural class and ligand binding. Firstly, we have analyzed the effectiveness of a group of image-based features to predict the structural class of a protein. These features are derived from the image generated by the distance matrix of the tertiary structure of a given protein. They include local binary pattern (LBP) histogram, Gabor filtered LBP histogram, separate row multiplication matrix with uniform LBP histogram, neighbor block subtraction matrix with uniform LBP histogram and atom bond. Separate row multiplication matrix and neighbor block subtraction matrix filters, as well as atom bond, are our novels. The experiments were done on a standard benchmark dataset. We have demonstrated the effectiveness of these features over a large variety of supervised machine learning algorithms. Experiments suggest support vector machines is the best performing classifier on the selected dataset using the set of features. We believe the excellent performance of Hybrid LBP in terms of accuracy would motivate the researchers and practitioners to use it to identify protein structural class. To facilitate that, a classification model using Hybrid LBP is readily available for use at http://brl.uiu.ac.bd/PL/. Protein-ligand binding is accountable for managing the tasks of biological receptors that help to cure diseases and many more. Therefore, binding prediction between protein and ligand is important for understanding a protein’s activity or to accelerate docking computations in virtual screening-based drug design. Protein-ligand binding prediction requires three-dimensional tertiary structure of the target protein to be searched for ligand binding. In this paper, we have proposed a supervised learning algorithm for predicting protein-ligand binding, which is a similarity-based clustering approach using the same set of features. Our algorithm works better than the most popular and widely used machine learning algorithms.

Download Full-text

Evaluation and Comparison of Machine Learning Algorithms for Solar Flare Class Prediction

10.1109/aimv53313.2021.9671015 ◽

2021 ◽

Author(s):

Savita R. Gandhi ◽

Aishawariya Athawale ◽

Hetvi Julasana ◽

Suchit Purohit

Keyword(s):

Machine Learning ◽

Solar Flare ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Class Prediction

Download Full-text

Amino Acid k-mer Feature Extraction for Quantitative Antimicrobial Resistance (AMR) Prediction by Machine Learning and Model Interpretation for Biological Insights

Biology ◽

10.3390/biology9110365 ◽

2020 ◽

Vol 9 (11) ◽

pp. 365

Author(s):

Taha ValizadehAslani ◽

Zhengqiao Zhao ◽

Bahrad A. Sokhansanj ◽

Gail L. Rosen

Keyword(s):

Machine Learning ◽

Feature Extraction ◽

Amino Acid ◽

Computational Complexity ◽

Antimicrobial Resistance ◽

Learning Algorithms ◽

Extraction Methods ◽

Machine Learning Algorithms ◽

Model Interpretation ◽

New Feature

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.

Download Full-text

AUTOMATED SELECTION OF INPUTS FOR LOG PREDICTION MODELS USING A NEW FEATURE SELECTION METHOD

10.30632/spwla-2021-0091 ◽

2021 ◽

Author(s):

Ravi Arkalgud ◽

◽

Andrew McDonald ◽

Ross Brackenridge ◽

◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Prediction Models ◽

Learning Algorithms ◽

Feature Selection Method ◽

Machine Learning Algorithms ◽

Target Feature ◽

Geomechanical Properties ◽

New Feature ◽

Selection Of

Automation is becoming an integral part of our daily lives as technology and techniques rapidly develop. Many automation workflows are now routinely being applied within the geoscience domain. The basic structure of automation and its success of modelling fundamentally hinges on the appropriate choice of parameters and speed of processing. The entire process demands that the data being fed into any machine learning model is essentially of good quality. The technological advances in well logging technology over decades have enabled the collection of vast amounts of data across wells and fields. This poses a major issue in automating petrophysical workflows. It necessitates to ensure that, the data being fed is appropriate and fit for purpose. The selection of features (logging curves) and parameters for machine learning algorithms has therefore become a topic at the forefront of related research. Inappropriate feature selections can lead erroneous results, reduced precision and have proved to be computationally expensive. Experienced Eye (EE) is a novel methodology, derived from Domain Transfer Analysis (DTA), which seeks to identify and elicit the optimum input curves for modelling. During the EE solution process, relationships between the input variables and target variables are developed, based on characteristics and attributes of the inputs instead of statistical averages. The relationships so developed between variables can then be ranked appropriately and selected for modelling process. This paper focuses on three distinct petrophysical data scenarios where inputs are ranked prior to modelling: prediction of continuous permeability from discrete core measurements, porosity from multiple logging measurements and finally the prediction of key geomechanical properties. Each input curve is ranked against a target feature. For each case study, the best ranked features were carried forward to the modelling stage, and the results are validated alongside conventional interpretation methods. Ranked features were also compared between different machine learning algorithms: DTA, Neural Networks and Multiple Linear Regression. Results are compared with the available data for various case studies. The use of the new feature selection has been proven to improve accuracy and precision of prediction results from multiple modelling algorithms.

Download Full-text

Improving the Prediction of Protein Structural Class for Low-Similarity Sequences by Incorporating Evolutionaryand Structural Information

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2016.p0402 ◽

2016 ◽

Vol 20 (3) ◽

pp. 402-411 ◽

Cited By ~ 2

Author(s):

Liang Kong ◽

◽

Lingfu Kong ◽

Rong Jing ◽

Keyword(s):

Protein Function ◽

Structural Information ◽

Sequence Similarity ◽

Computational Method ◽

Evolutionary Information ◽

Support Vector ◽

Local Alignment ◽

Class Prediction ◽

Structural Class ◽

Protein Structural Class

Protein structural class prediction is beneficial to study protein function, regulation and interactions. However, protein structural class prediction for low-similarity sequences (i.e., below 40% in pairwise sequence similarity) remains a challenging problem at present. In this study, a novel computational method is proposed to accurately predict protein structural class for low-similarity sequences. This method is based on support vector machine in conjunction with integrated features from evolutionary information generated with position specific iterative basic local alignment search tool (PSI-BLAST) and predicted secondary structure. Various prediction accuracies evaluated by the jackknife tests are reported on two widely-used low-similarity benchmark datasets (25PDB and 1189), reaching overall accuracies 89.3% and 87.9%, which are significantly higher than those achieved by state-of-the-art in protein structural class prediction. The experimental results suggest that our method could serve as an effective alternative to existing methods in protein structural classification, especially for low-similarity sequences.

Download Full-text

Emotion Recognition via Facial Expression: Utilization of Numerous Feature Descriptors in Different Machine Learning Algorithms

TENCON 2018 - 2018 IEEE Region 10 Conference ◽

10.1109/tencon.2018.8650192 ◽

2018 ◽

Cited By ~ 2

Author(s):

John Chris T. Kwong ◽

Felan Carlo C. Garcia ◽

Patricia Angela R. Abu ◽

Rosula S. J. Reyes

Keyword(s):

Machine Learning ◽

Facial Expression ◽

Emotion Recognition ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Feature Descriptors

Download Full-text

Deep Recurrent Neural Network for Protein Function Prediction from Sequence

10.1101/103994 ◽

2017 ◽

Cited By ~ 14

Author(s):

Xueliang Leon Liu

Keyword(s):

Machine Learning ◽

Protein Function ◽

Short Term Memory ◽

Amino Acid Sequences ◽

Machine Learning Algorithms ◽

Class Prediction ◽

Iron Storage ◽

Deep Recurrent Neural Network ◽

Wide Range ◽

Protein Functions

AbstractAs high-throughput biological sequencing becomes faster and cheaper, the need to extract useful information from sequencing becomes ever more paramount, often limited by low-throughput experimental characterizations. For proteins, accurate prediction of their functions directly from their primary amino-acid sequences has been a long standing challenge. Here, machine learning using artificial recurrent neural networks (RNN) was applied towards classification of protein function directly from primary sequence without sequence alignment, heuristic scoring or feature engineering. The RNN models containing long-short-term-memory (LSTM) units trained on public, annotated datasets from UniProt achieved high performance for in-class prediction of four important protein functions tested, particularly compared to other machine learning algorithms using sequence-derived protein features. RNN models were used also for out-of-class predictions of phylogenetically distinct protein families with similar functions, including proteins of the CRISPR-associated nuclease, ferritin-like iron storage and cytochrome P450 families. Applying the trained RNN models on the partially unannotated UniRef100 database predicted not only candidates validated by existing annotations but also currently unannotated sequences. Some RNN predictions for the ferritin-like iron sequestering function were experimentally validated, even though their sequences differ significantly from known, characterized proteins and from each other and cannot be easily predicted using popular bioinformatics methods. As sequencing and experimental characterization data increases rapidly, the machine-learning approach based on RNN could be useful for discovery and prediction of homologues for a wide range of protein functions.

Download Full-text

Supervised machine learning algorithms for protein structure classification

Computational Biology and Chemistry ◽

10.1016/j.compbiolchem.2009.04.004 ◽

2009 ◽

Vol 33 (3) ◽

pp. 216-223 ◽

Cited By ~ 53

Author(s):

Pooja Jain ◽

Jonathan M. Garibaldi ◽

Jonathan D. Hirst

Keyword(s):

Machine Learning ◽

Protein Structure ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Protein Structure Classification

Download Full-text

An empirical comparison of individual machine learning techniques and ensemble approaches in protein structural class prediction

Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. ◽

10.1109/ijcnn.2005.1555886 ◽

2006 ◽

Cited By ~ 2

Author(s):

V.G. Bittencourt ◽

M.C.C. Abreut ◽

M.C.P. de Souto ◽

A.M. de P. Canuto

Keyword(s):

Machine Learning ◽

Machine Learning Techniques ◽

Empirical Comparison ◽

Class Prediction ◽

Structural Class ◽

Protein Structural Class ◽

Learning Techniques

Download Full-text

Supplemental Material for One Model to Rule Them All? Using Machine Learning Algorithms to Determine the Number of Factors in Exploratory Factor Analysis

Psychological Methods ◽

10.1037/met0000262.supp ◽

2020 ◽

Keyword(s):

Machine Learning ◽

Factor Analysis ◽

Exploratory Factor Analysis ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Number Of Factors

Download Full-text