Incorporating Amino Acids Composition and Functional Domains for Identifying Bacterial Toxin Proteins

Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix

BMC Molecular and Cell Biology ◽

10.1186/s12860-019-0240-1 ◽

2019 ◽

Vol 20 (S2) ◽

Cited By ~ 2

Author(s):

Abel Chandra ◽

Alok Sharma ◽

Abdollah Dehzangi ◽

Daichi Shigemizu ◽

Tatsuhiko Tsunoda

Keyword(s):

Amino Acids ◽

Cell Biology ◽

Covalent Modification ◽

Evolutionary Information ◽

Support Vector ◽

Computational Techniques ◽

Post Translational Modification ◽

Statistical Measures ◽

Scoring Matrices ◽

Lysine Residues

Abstract Background The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. Results We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. Conclusions The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from https://github.com/abelavit/Bigram-PGK.

Download Full-text

Predicting Protein Producibility in Filamentous Fungi

10.1101/138560 ◽

2017 ◽

Author(s):

Karmen L Dykstra ◽

Juho Rousu ◽

Mikko Arvas

Keyword(s):

Filamentous Fungi ◽

Predictive Performance ◽

Support Vector ◽

E Coli ◽

Machine Learning Methods ◽

Vector Machines ◽

Production Host ◽

Domain Information ◽

Protein Dataset ◽

Variable Performance

AbstractIn this paper we study the problem of predicting the producibility of recombinant proteins in filamentous fungi, especially T. reesei, using machine learning methods. We train supervised and semi-supervised support vector machines with protein sequences, represented by their amino acid composition as well as protein family and domain information. Our results indicate, somewhat surprisingly, that quite modest amount of proteins with experimental data are required to build a state-of-the-art classifier and that additional unlabeled sequences in semi-supervised models do not bring increased predictive performance. Our experiments in cross-species prediction show that models trained for the filamentous fungus A. niger protein dataset can be generalized to predict protein producibility in T. reesei, and vice versa, without sacrificing too much accuracy, regardless of their approximately 500 millions years of divergence. However, predictors trained on E. coli and S. cerevisiae datasets gave variable performance when applied to the filamentous fungi datasets, indicating that while protein producibility prediction can be generalized accross related species, fully generic prediction tools applicable to any protein production host may not be realistic to achieve.

Download Full-text

Prediction of post-translational modification sites using multiple kernel support vector machine

PeerJ ◽

10.7717/peerj.3261 ◽

2017 ◽

Vol 5 ◽

pp. e3261 ◽

Cited By ~ 7

Author(s):

BingHua Wang ◽

Minghui Wang ◽

Ao Li

Keyword(s):

Protein Function ◽

Predictive Performance ◽

Computational Prediction ◽

Computational Method ◽

Support Vector ◽

Prediction Methods ◽

Sequence Information ◽

Post Translational Modification ◽

Multiple Kernel ◽

Local Sequence

Protein post-translational modification (PTM) is an important mechanism that is involved in the regulation of protein function. Considering the high-cost and labor-intensive of experimental identification, many computational prediction methods are currently available for the prediction of PTM sites by using protein local sequence information in the context of conserved motif. Here we proposed a novel computational method by using the combination of multiple kernel support vector machines (SVM) for predicting PTM sites including phosphorylation, O-linked glycosylation, acetylation, sulfation and nitration. To largely make use of local sequence information and site-modification relationships, we developed a local sequence kernel and Gaussian interaction profile kernel, respectively. Multiple kernels were further combined to train SVM for efficiently leveraging kernel information to boost predictive performance. We compared the proposed method with existing PTM prediction methods. The experimental results revealed that the proposed method performed comparable or better performance than the existing prediction methods, suggesting the feasibility of the developed kernels and the usefulness of the proposed method in PTM sites prediction.

Download Full-text

Identification of Protein Pupylation Sites Using Bi-Profile Bayes Feature Extraction and Ensemble Learning

Mathematical Problems in Engineering ◽

10.1155/2013/283129 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Xiaowei Zhao ◽

Jian Zhang ◽

Qiao Ning ◽

Pingping Sun ◽

Zhiqiang Ma ◽

...

Keyword(s):

Feature Extraction ◽

Correlation Coefficient ◽

Posttranslational Modifications ◽

Protein Identification ◽

Matthews Correlation Coefficient ◽

Predictive Performance ◽

Computational Prediction ◽

Training Dataset ◽

Support Vector ◽

Lysine Residues

Pupylation, one of the most important posttranslational modifications of proteins, typically takes place when prokaryotic ubiquitin-like protein (Pup) is attached to specific lysine residues on a target protein. Identification of pupylation substrates and their corresponding sites will facilitate the understanding of the molecular mechanism of pupylation. Comparing with the labor-intensive and time-consuming experiment approaches, computational prediction of pupylation sites is much desirable for their convenience and fast speed. In this study, a new bioinformatics tool named EnsemblePup was developed that used an ensemble of support vector machine classifiers to predict pupylation sites. The highlight of EnsemblePup was to utilize the Bi-profile Bayes feature extraction as the encoding scheme. The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross validation on the training dataset. When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629. The experimental results suggested that EnsemblePup presented here might be useful to identify and annotate potential pupylation sites in proteins of interest. A web server for predicting pupylation sites was developed.

Download Full-text

Predicting the oligomeric states of fluorescent proteins

10.7287/peerj.preprints.922 ◽

2015 ◽

Author(s):

Saw Simeon ◽

Watshara Shoombuatong ◽

Likit Preeyanon ◽

Virapong Prachayasittikul ◽

Chanin Nantasenamat

Keyword(s):

Amino Acid ◽

Fluorescent Proteins ◽

Computational Prediction ◽

Amino Acid Sequences ◽

Machine Learning Algorithms ◽

Support Vector ◽

Dipeptide Composition ◽

Network Support ◽

Protein Tagging ◽

Fold Cross Validation

Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering to create monomeric FPs by saving time and money. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from their amino acid sequences. An exhaustive dataset consisting of 397 unique FP oligomeric states was compiled from the literature. FP were described by 3 classes of protein descriptors including amino acid composition, dipeptide composition and physicochemical properties. The oligomeric states of FP was predicted using decision tree (DT) algorithm and results demonstrated that DT provided robust performance with accuracies in ranges of 79.97-81.72% and 80.76-82.63% for the internal (e.g. 10-fold cross-validation) and external sets, respectively. This approach was also benchmarked with other common machine learning algorithms such as artificial neural network, support vector machine and random forest. A thorough analysis of amino acid sequence features was conducted to provide informative insights into FP oligomerization, which may aid in engineering novel monomeric fluorescent proteins. The following differentiating characteristics of monomeric and oligomeric fluorescent proteins were derived from DT: (i) substitution of any amino acid to Glu led to the reduction of aggregated proteins and (ii) oligomerization of FP appears to be stabilized by several hydrophobic contacts. Datasets and R source code are available at http://dx.doi.org/10.6084/m9.figshare.1348575.

Download Full-text

iPVP-MCV: A Multi-Classifier Voting Model for the Accurate Identification of Phage Virion Proteins

Symmetry ◽

10.3390/sym13081506 ◽

2021 ◽

Vol 13 (8) ◽

pp. 1506

Author(s):

Haitao Han ◽

Wenhong Zhu ◽

Chenchen Ding ◽

Taigang Liu

Keyword(s):

Predictive Performance ◽

Amino Acid Sequences ◽

Majority Voting ◽

Support Vector ◽

Icosahedral Symmetry ◽

Structural Protein ◽

Virion Protein ◽

Accurate Identification ◽

Evolutionary Features ◽

Voting Model

The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. The phage virion protein (PVP), a type of bacteriophage structural protein, is an essential material of the infectious viral particles and is responsible for multiple biological functions. Accurate identification of PVPs is of great significance for comprehending the interaction between phages and host bacteria and developing new antimicrobial drugs or antibiotics. However, traditional experimental approaches for identifying PVPs are often time-consuming and laborious. Therefore, the development of computational methods that can efficiently and accurately identify PVPs is desired. In this study, we proposed a multi-classifier voting model called iPVP-MCV to enhance the predictive performance of PVPs based on their amino acid sequences. First, three types of evolutionary features were extracted from the position-specific scoring matrix (PSSM) profiles to represent PVPs and non-PVPs. Then, a set of baseline models were trained based on the support vector machine (SVM) algorithm combined with each type of feature descriptors. Finally, the outputs of these baseline models were integrated to construct the proposed method iPVP-MCV by using the majority voting strategy. Our results demonstrated that the proposed iPVP-MCV model was superior to existing methods when performing the rigorous independent dataset test.

Download Full-text

Predicting the oligomeric states of fluorescent proteins

10.7287/peerj.preprints.922v1 ◽

2015 ◽

Author(s):

Saw Simeon ◽

Watshara Shoombuatong ◽

Likit Preeyanon ◽

Virapong Prachayasittikul ◽

Chanin Nantasenamat

Keyword(s):

Amino Acid ◽

Fluorescent Proteins ◽

Computational Prediction ◽

Amino Acid Sequences ◽

Machine Learning Algorithms ◽

Support Vector ◽

Dipeptide Composition ◽

Network Support ◽

Protein Tagging ◽

Fold Cross Validation

Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering to create monomeric FPs by saving time and money. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from their amino acid sequences. An exhaustive dataset consisting of 397 unique FP oligomeric states was compiled from the literature. FP were described by 3 classes of protein descriptors including amino acid composition, dipeptide composition and physicochemical properties. The oligomeric states of FP was predicted using decision tree (DT) algorithm and results demonstrated that DT provided robust performance with accuracies in ranges of 79.97-81.72% and 80.76-82.63% for the internal (e.g. 10-fold cross-validation) and external sets, respectively. This approach was also benchmarked with other common machine learning algorithms such as artificial neural network, support vector machine and random forest. A thorough analysis of amino acid sequence features was conducted to provide informative insights into FP oligomerization, which may aid in engineering novel monomeric fluorescent proteins. The following differentiating characteristics of monomeric and oligomeric fluorescent proteins were derived from DT: (i) substitution of any amino acid to Glu led to the reduction of aggregated proteins and (ii) oligomerization of FP appears to be stabilized by several hydrophobic contacts. Datasets and R source code are available at http://dx.doi.org/10.6084/m9.figshare.1348575.

Download Full-text

PredHydroxy: computational prediction of protein hydroxylation site locations based on the primary structure

Molecular BioSystems ◽

10.1039/c4mb00646a ◽

2015 ◽

Vol 11 (3) ◽

pp. 819-825 ◽

Cited By ~ 16

Author(s):

Shao-Ping Shi ◽

Xiang Chen ◽

Hao-Dong Xu ◽

Jian-Ding Qiu

Keyword(s):

Amino Acids ◽

Support Vector Machines ◽

Primary Structure ◽

Computational Prediction ◽

Support Vector ◽

Quality Indices ◽

High Quality ◽

Vector Machines ◽

Amino Acids Composition

A predictor PredHydroxy, based on position weight amino acids composition, 8 high-quality indices and support vector machines, is designed to identify hydroxyproline and hydroxylysine sites.

Download Full-text

Based on 9-gram Coding of Amino Acids Predicting Proteases Types by Using Support Vector Machine

Recent Patents on Computer Science ◽

10.2174/2213275911205030220 ◽

2012 ◽

Vol 5 (3) ◽

pp. 220-225 ◽

Cited By ~ 1

Author(s):

Cunshuan Xu ◽

Ruijia Shi

Keyword(s):

Amino Acids ◽

Support Vector Machine ◽

Support Vector

Download Full-text

Prediction of Incident Cancers in the Lifelines Population-Based Cohort

Cancers ◽

10.3390/cancers13092133 ◽

2021 ◽

Vol 13 (9) ◽

pp. 2133

Author(s):

Francisco O. Cortés-Ibañez ◽

Sunil Belur Nagaraj ◽

Ludo Cornelissen ◽

Gerjan J. Navis ◽

Bert van der Vegt ◽

...

Keyword(s):

Cancer Incidence ◽

Binary Classification ◽

Predictive Performance ◽

Population Based ◽

Support Vector ◽

Clinical Variables ◽

Incident Cancer ◽

History Of ◽

Diagnosis Of Cancer ◽

Auc Value

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.

Download Full-text