scholarly journals Recent Advances in the Prediction of Protein Structural Classes: Feature Descriptors and Machine Learning Algorithms

Crystals ◽  
2021 ◽  
Vol 11 (4) ◽  
pp. 324
Author(s):  
Lin Zhu ◽  
Mehdi D. Davari ◽  
Wenjin Li

In the postgenomic age, rapid growth in the number of sequence-known proteins has been accompanied by much slower growth in the number of structure-known proteins (as a result of experimental limitations), and a widening gap between the two is evident. Because protein function is linked to protein structure, successful prediction of protein structure is of significant importance in protein function identification. Foreknowledge of protein structural class can help improve protein structure prediction with significant medical and pharmaceutical implications. Thus, a fast, suitable, reliable, and reasonable computational method for protein structural class prediction has become pivotal in bioinformatics. Here, we review recent efforts in protein structural class prediction from protein sequence, with particular attention paid to new feature descriptors, which extract information from protein sequence, and the use of machine learning algorithms in both feature selection and the construction of new classification models. These new feature descriptors include amino acid composition, sequence order, physicochemical properties, multiprofile Bayes, and secondary structure-based features. Machine learning methods, such as artificial neural networks (ANNs), support vector machine (SVM), K-nearest neighbor (KNN), random forest, deep learning, and examples of their application are discussed in detail. We also present our view on possible future directions, challenges, and opportunities for the applications of machine learning algorithms for prediction of protein structural classes.

2020 ◽  
Vol 6 ◽  
pp. e253
Author(s):  
Nafees Sadique ◽  
Al Amin Neaz Ahmed ◽  
Md Tajul Islam ◽  
Md. Nawshad Pervage ◽  
Swakkhar Shatabda

Proteins are the building blocks of all cells in both human and all living creatures of the world. Most of the work in the living organism is performed by proteins. Proteins are polymers of amino acid monomers which are biomolecules or macromolecules. The tertiary structure of protein represents the three-dimensional shape of a protein. The functions, classification and binding sites are governed by the protein’s tertiary structure. If two protein structures are alike, then the two proteins can be of the same kind implying similar structural class and ligand binding properties. In this paper, we have used the protein tertiary structure to generate effective features for applications in structural similarity to detect structural class and ligand binding. Firstly, we have analyzed the effectiveness of a group of image-based features to predict the structural class of a protein. These features are derived from the image generated by the distance matrix of the tertiary structure of a given protein. They include local binary pattern (LBP) histogram, Gabor filtered LBP histogram, separate row multiplication matrix with uniform LBP histogram, neighbor block subtraction matrix with uniform LBP histogram and atom bond. Separate row multiplication matrix and neighbor block subtraction matrix filters, as well as atom bond, are our novels. The experiments were done on a standard benchmark dataset. We have demonstrated the effectiveness of these features over a large variety of supervised machine learning algorithms. Experiments suggest support vector machines is the best performing classifier on the selected dataset using the set of features. We believe the excellent performance of Hybrid LBP in terms of accuracy would motivate the researchers and practitioners to use it to identify protein structural class. To facilitate that, a classification model using Hybrid LBP is readily available for use at http://brl.uiu.ac.bd/PL/. Protein-ligand binding is accountable for managing the tasks of biological receptors that help to cure diseases and many more. Therefore, binding prediction between protein and ligand is important for understanding a protein’s activity or to accelerate docking computations in virtual screening-based drug design. Protein-ligand binding prediction requires three-dimensional tertiary structure of the target protein to be searched for ligand binding. In this paper, we have proposed a supervised learning algorithm for predicting protein-ligand binding, which is a similarity-based clustering approach using the same set of features. Our algorithm works better than the most popular and widely used machine learning algorithms.


Biology ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 365
Author(s):  
Taha ValizadehAslani ◽  
Zhengqiao Zhao ◽  
Bahrad A. Sokhansanj ◽  
Gail L. Rosen

Machine learning algorithms can learn mechanisms of antimicrobial resistance from the data of DNA sequence without any a priori information. Interpreting a trained machine learning algorithm can be exploited for validating the model and obtaining new information about resistance mechanisms. Different feature extraction methods, such as SNP calling and counting nucleotide k-mers have been proposed for presenting DNA sequences to the model. However, there are trade-offs between interpretability, computational complexity and accuracy for different feature extraction methods. In this study, we have proposed a new feature extraction method, counting amino acid k-mers or oligopeptides, which provides easier model interpretation compared to counting nucleotide k-mers and reaches the same or even better accuracy in comparison with different methods. Additionally, we have trained machine learning algorithms using different feature extraction methods and compared the results in terms of accuracy, model interpretability and computational complexity. We have built a new feature selection pipeline for extraction of important features so that new AMR determinants can be discovered by analyzing these features. This pipeline allows the construction of models that only use a small number of features and can predict resistance accurately.


2021 ◽  
Author(s):  
Ravi Arkalgud ◽  
◽  
Andrew McDonald ◽  
Ross Brackenridge ◽  
◽  
...  

Automation is becoming an integral part of our daily lives as technology and techniques rapidly develop. Many automation workflows are now routinely being applied within the geoscience domain. The basic structure of automation and its success of modelling fundamentally hinges on the appropriate choice of parameters and speed of processing. The entire process demands that the data being fed into any machine learning model is essentially of good quality. The technological advances in well logging technology over decades have enabled the collection of vast amounts of data across wells and fields. This poses a major issue in automating petrophysical workflows. It necessitates to ensure that, the data being fed is appropriate and fit for purpose. The selection of features (logging curves) and parameters for machine learning algorithms has therefore become a topic at the forefront of related research. Inappropriate feature selections can lead erroneous results, reduced precision and have proved to be computationally expensive. Experienced Eye (EE) is a novel methodology, derived from Domain Transfer Analysis (DTA), which seeks to identify and elicit the optimum input curves for modelling. During the EE solution process, relationships between the input variables and target variables are developed, based on characteristics and attributes of the inputs instead of statistical averages. The relationships so developed between variables can then be ranked appropriately and selected for modelling process. This paper focuses on three distinct petrophysical data scenarios where inputs are ranked prior to modelling: prediction of continuous permeability from discrete core measurements, porosity from multiple logging measurements and finally the prediction of key geomechanical properties. Each input curve is ranked against a target feature. For each case study, the best ranked features were carried forward to the modelling stage, and the results are validated alongside conventional interpretation methods. Ranked features were also compared between different machine learning algorithms: DTA, Neural Networks and Multiple Linear Regression. Results are compared with the available data for various case studies. The use of the new feature selection has been proven to improve accuracy and precision of prediction results from multiple modelling algorithms.


Author(s):  
Liang Kong ◽  
◽  
Lingfu Kong ◽  
Rong Jing ◽  

Protein structural class prediction is beneficial to study protein function, regulation and interactions. However, protein structural class prediction for low-similarity sequences (i.e., below 40% in pairwise sequence similarity) remains a challenging problem at present. In this study, a novel computational method is proposed to accurately predict protein structural class for low-similarity sequences. This method is based on support vector machine in conjunction with integrated features from evolutionary information generated with position specific iterative basic local alignment search tool (PSI-BLAST) and predicted secondary structure. Various prediction accuracies evaluated by the jackknife tests are reported on two widely-used low-similarity benchmark datasets (25PDB and 1189), reaching overall accuracies 89.3% and 87.9%, which are significantly higher than those achieved by state-of-the-art in protein structural class prediction. The experimental results suggest that our method could serve as an effective alternative to existing methods in protein structural classification, especially for low-similarity sequences.


2017 ◽  
Author(s):  
Xueliang Leon Liu

AbstractAs high-throughput biological sequencing becomes faster and cheaper, the need to extract useful information from sequencing becomes ever more paramount, often limited by low-throughput experimental characterizations. For proteins, accurate prediction of their functions directly from their primary amino-acid sequences has been a long standing challenge. Here, machine learning using artificial recurrent neural networks (RNN) was applied towards classification of protein function directly from primary sequence without sequence alignment, heuristic scoring or feature engineering. The RNN models containing long-short-term-memory (LSTM) units trained on public, annotated datasets from UniProt achieved high performance for in-class prediction of four important protein functions tested, particularly compared to other machine learning algorithms using sequence-derived protein features. RNN models were used also for out-of-class predictions of phylogenetically distinct protein families with similar functions, including proteins of the CRISPR-associated nuclease, ferritin-like iron storage and cytochrome P450 families. Applying the trained RNN models on the partially unannotated UniRef100 database predicted not only candidates validated by existing annotations but also currently unannotated sequences. Some RNN predictions for the ferritin-like iron sequestering function were experimentally validated, even though their sequences differ significantly from known, characterized proteins and from each other and cannot be easily predicted using popular bioinformatics methods. As sequencing and experimental characterization data increases rapidly, the machine-learning approach based on RNN could be useful for discovery and prediction of homologues for a wide range of protein functions.


Sign in / Sign up

Export Citation Format

Share Document