scholarly journals Incorporating Amino Acids Composition and Functional Domains for Identifying Bacterial Toxin Proteins

2014 ◽  
Vol 2014 ◽  
pp. 1-7 ◽  
Author(s):  
Min-Gang Su ◽  
Chien-Hsun Huang ◽  
Tzong-Yi Lee ◽  
Yu-Ju Chen ◽  
Hsin-Yi Wu

Aside from pathogenesis, bacterial toxins also have been used for medical purpose such as drugs for cancer and immune diseases. Correctly identifying bacterial toxins and their types (endotoxins and exotoxins) has great impact on the cell biology study and therapy development. However, experimental methods for bacterial toxins identification are time-consuming and labor-intensive, implying an urgent need for computational prediction. Thus, we are motivated to develop a method for computational identification of bacterial toxins based on amino acid sequences and functional domain information. In this study, a nonredundant dataset of 167 bacterial toxins including 77 exotoxins and 90 endotoxins is adopted to learn the predictive model by using support vector machines (SVMs). The cross-validation evaluation shows that the SVM models trained with amino acids and dipeptides composition could yield an accuracy of 96.07% and 92.50%, respectively. For discriminating endotoxins from exotoxins, the SVM models trained with amino acids and dipeptides composition have achieved an accuracy of 95.71% and 92.86%, respectively. After incorporating functional domain information, the predictive performance is further improved. The proposed method has been demonstrated to be able to more effectively identify and classify bacterial toxins than the other two features on independent dataset, which may aid in bacterial biomedical development.

2019 ◽  
Vol 20 (S2) ◽  
Author(s):  
Abel Chandra ◽  
Alok Sharma ◽  
Abdollah Dehzangi ◽  
Daichi Shigemizu ◽  
Tatsuhiko Tsunoda

Abstract Background The biological process known as post-translational modification (PTM) is a condition whereby proteomes are modified that affects normal cell biology, and hence the pathogenesis. A number of PTMs have been discovered in the recent years and lysine phosphoglycerylation is one of the fairly recent developments. Even with a large number of proteins being sequenced in the post-genomic era, the identification of phosphoglycerylation remains a big challenge due to factors such as cost, time consumption and inefficiency involved in the experimental efforts. To overcome this issue, computational techniques have emerged to accurately identify phosphoglycerylated lysine residues. However, the computational techniques proposed so far hold limitations to correctly predict this covalent modification. Results We propose a new predictor in this paper called Bigram-PGK which uses evolutionary information of amino acids to try and predict phosphoglycerylated sites. The benchmark dataset which contains experimentally labelled sites is employed for this purpose and profile bigram occurrences is calculated from position specific scoring matrices of amino acids in the protein sequences. The statistical measures of this work, such as sensitivity, specificity, precision, accuracy, Mathews correlation coefficient and area under ROC curve have been reported to be 0.9642, 0.8973, 0.8253, 0.9193, 0.8330, 0.9306, respectively. Conclusions The proposed predictor, based on the feature of evolutionary information and support vector machine classifier, has shown great potential to effectively predict phosphoglycerylated and non-phosphoglycerylated lysine residues when compared against the existing predictors. The data and software of this work can be acquired from https://github.com/abelavit/Bigram-PGK.


2017 ◽  
Author(s):  
Karmen L Dykstra ◽  
Juho Rousu ◽  
Mikko Arvas

AbstractIn this paper we study the problem of predicting the producibility of recombinant proteins in filamentous fungi, especially T. reesei, using machine learning methods. We train supervised and semi-supervised support vector machines with protein sequences, represented by their amino acid composition as well as protein family and domain information. Our results indicate, somewhat surprisingly, that quite modest amount of proteins with experimental data are required to build a state-of-the-art classifier and that additional unlabeled sequences in semi-supervised models do not bring increased predictive performance. Our experiments in cross-species prediction show that models trained for the filamentous fungus A. niger protein dataset can be generalized to predict protein producibility in T. reesei, and vice versa, without sacrificing too much accuracy, regardless of their approximately 500 millions years of divergence. However, predictors trained on E. coli and S. cerevisiae datasets gave variable performance when applied to the filamentous fungi datasets, indicating that while protein producibility prediction can be generalized accross related species, fully generic prediction tools applicable to any protein production host may not be realistic to achieve.


PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e3261 ◽  
Author(s):  
BingHua Wang ◽  
Minghui Wang ◽  
Ao Li

Protein post-translational modification (PTM) is an important mechanism that is involved in the regulation of protein function. Considering the high-cost and labor-intensive of experimental identification, many computational prediction methods are currently available for the prediction of PTM sites by using protein local sequence information in the context of conserved motif. Here we proposed a novel computational method by using the combination of multiple kernel support vector machines (SVM) for predicting PTM sites including phosphorylation, O-linked glycosylation, acetylation, sulfation and nitration. To largely make use of local sequence information and site-modification relationships, we developed a local sequence kernel and Gaussian interaction profile kernel, respectively. Multiple kernels were further combined to train SVM for efficiently leveraging kernel information to boost predictive performance. We compared the proposed method with existing PTM prediction methods. The experimental results revealed that the proposed method performed comparable or better performance than the existing prediction methods, suggesting the feasibility of the developed kernels and the usefulness of the proposed method in PTM sites prediction.


2013 ◽  
Vol 2013 ◽  
pp. 1-7 ◽  
Author(s):  
Xiaowei Zhao ◽  
Jian Zhang ◽  
Qiao Ning ◽  
Pingping Sun ◽  
Zhiqiang Ma ◽  
...  

Pupylation, one of the most important posttranslational modifications of proteins, typically takes place when prokaryotic ubiquitin-like protein (Pup) is attached to specific lysine residues on a target protein. Identification of pupylation substrates and their corresponding sites will facilitate the understanding of the molecular mechanism of pupylation. Comparing with the labor-intensive and time-consuming experiment approaches, computational prediction of pupylation sites is much desirable for their convenience and fast speed. In this study, a new bioinformatics tool named EnsemblePup was developed that used an ensemble of support vector machine classifiers to predict pupylation sites. The highlight of EnsemblePup was to utilize the Bi-profile Bayes feature extraction as the encoding scheme. The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross validation on the training dataset. When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629. The experimental results suggested that EnsemblePup presented here might be useful to identify and annotate potential pupylation sites in proteins of interest. A web server for predicting pupylation sites was developed.


2015 ◽  
Author(s):  
Saw Simeon ◽  
Watshara Shoombuatong ◽  
Likit Preeyanon ◽  
Virapong Prachayasittikul ◽  
Chanin Nantasenamat

Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering to create monomeric FPs by saving time and money. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from their amino acid sequences. An exhaustive dataset consisting of 397 unique FP oligomeric states was compiled from the literature. FP were described by 3 classes of protein descriptors including amino acid composition, dipeptide composition and physicochemical properties. The oligomeric states of FP was predicted using decision tree (DT) algorithm and results demonstrated that DT provided robust performance with accuracies in ranges of 79.97-81.72% and 80.76-82.63% for the internal (e.g. 10-fold cross-validation) and external sets, respectively. This approach was also benchmarked with other common machine learning algorithms such as artificial neural network, support vector machine and random forest. A thorough analysis of amino acid sequence features was conducted to provide informative insights into FP oligomerization, which may aid in engineering novel monomeric fluorescent proteins. The following differentiating characteristics of monomeric and oligomeric fluorescent proteins were derived from DT: (i) substitution of any amino acid to Glu led to the reduction of aggregated proteins and (ii) oligomerization of FP appears to be stabilized by several hydrophobic contacts. Datasets and R source code are available at http://dx.doi.org/10.6084/m9.figshare.1348575.


Symmetry ◽  
2021 ◽  
Vol 13 (8) ◽  
pp. 1506
Author(s):  
Haitao Han ◽  
Wenhong Zhu ◽  
Chenchen Ding ◽  
Taigang Liu

The classic structure of a bacteriophage is commonly characterized by complex symmetry. The head of the structure features icosahedral symmetry, whereas the tail features helical symmetry. The phage virion protein (PVP), a type of bacteriophage structural protein, is an essential material of the infectious viral particles and is responsible for multiple biological functions. Accurate identification of PVPs is of great significance for comprehending the interaction between phages and host bacteria and developing new antimicrobial drugs or antibiotics. However, traditional experimental approaches for identifying PVPs are often time-consuming and laborious. Therefore, the development of computational methods that can efficiently and accurately identify PVPs is desired. In this study, we proposed a multi-classifier voting model called iPVP-MCV to enhance the predictive performance of PVPs based on their amino acid sequences. First, three types of evolutionary features were extracted from the position-specific scoring matrix (PSSM) profiles to represent PVPs and non-PVPs. Then, a set of baseline models were trained based on the support vector machine (SVM) algorithm combined with each type of feature descriptors. Finally, the outputs of these baseline models were integrated to construct the proposed method iPVP-MCV by using the majority voting strategy. Our results demonstrated that the proposed iPVP-MCV model was superior to existing methods when performing the rigorous independent dataset test.


2015 ◽  
Author(s):  
Saw Simeon ◽  
Watshara Shoombuatong ◽  
Likit Preeyanon ◽  
Virapong Prachayasittikul ◽  
Chanin Nantasenamat

Currently, monomeric fluorescent proteins (FP) are ideal markers for protein tagging. The prediction of oligomeric states is helpful for enhancing live biomedical imaging. Computational prediction of FP oligomeric states can accelerate the effort of protein engineering to create monomeric FPs by saving time and money. To the best of our knowledge, this study represents the first computational model for predicting and analyzing FP oligomerization directly from their amino acid sequences. An exhaustive dataset consisting of 397 unique FP oligomeric states was compiled from the literature. FP were described by 3 classes of protein descriptors including amino acid composition, dipeptide composition and physicochemical properties. The oligomeric states of FP was predicted using decision tree (DT) algorithm and results demonstrated that DT provided robust performance with accuracies in ranges of 79.97-81.72% and 80.76-82.63% for the internal (e.g. 10-fold cross-validation) and external sets, respectively. This approach was also benchmarked with other common machine learning algorithms such as artificial neural network, support vector machine and random forest. A thorough analysis of amino acid sequence features was conducted to provide informative insights into FP oligomerization, which may aid in engineering novel monomeric fluorescent proteins. The following differentiating characteristics of monomeric and oligomeric fluorescent proteins were derived from DT: (i) substitution of any amino acid to Glu led to the reduction of aggregated proteins and (ii) oligomerization of FP appears to be stabilized by several hydrophobic contacts. Datasets and R source code are available at http://dx.doi.org/10.6084/m9.figshare.1348575.


2015 ◽  
Vol 11 (3) ◽  
pp. 819-825 ◽  
Author(s):  
Shao-Ping Shi ◽  
Xiang Chen ◽  
Hao-Dong Xu ◽  
Jian-Ding Qiu

A predictor PredHydroxy, based on position weight amino acids composition, 8 high-quality indices and support vector machines, is designed to identify hydroxyproline and hydroxylysine sites.


Cancers ◽  
2021 ◽  
Vol 13 (9) ◽  
pp. 2133
Author(s):  
Francisco O. Cortés-Ibañez ◽  
Sunil Belur Nagaraj ◽  
Ludo Cornelissen ◽  
Gerjan J. Navis ◽  
Bert van der Vegt ◽  
...  

Cancer incidence is rising, and accurate prediction of incident cancers could be relevant to understanding and reducing cancer incidence. The aim of this study was to develop machine learning (ML) models that could predict an incident diagnosis of cancer. Participants without any history of cancer within the Lifelines population-based cohort were followed for a median of 7 years. Data were available for 116,188 cancer-free participants and 4232 incident cancer cases. At baseline, socioeconomic, lifestyle, and clinical variables were assessed. The main outcome was an incident cancer during follow-up (excluding skin cancer), based on linkage with the national pathology registry. The performance of three ML algorithms was evaluated using supervised binary classification to identify incident cancers among participants. Elastic net regularization and Gini index were used for variables selection. An overall area under the receiver operator curve (AUC) <0.75 was obtained, the highest AUC value was for prostate cancer (random forest AUC = 0.82 (95% CI 0.77–0.87), logistic regression AUC = 0.81 (95% CI 0.76–0.86), and support vector machines AUC = 0.83 (95% CI 0.78–0.88), respectively); age was the most important predictor in these models. Linear and non-linear ML algorithms including socioeconomic, lifestyle, and clinical variables produced a moderate predictive performance of incident cancers in the Lifelines cohort.


Sign in / Sign up

Export Citation Format

Share Document