Identification of Protein Pupylation Sites Using Bi-Profile Bayes Feature Extraction and Ensemble Learning

Mathematical Problems in Engineering ◽

10.1155/2013/283129 ◽

2013 ◽

Vol 2013 ◽

pp. 1-7 ◽

Cited By ~ 4

Author(s):

Xiaowei Zhao ◽

Jian Zhang ◽

Qiao Ning ◽

Pingping Sun ◽

Zhiqiang Ma ◽

...

Keyword(s):

Feature Extraction ◽

Correlation Coefficient ◽

Posttranslational Modifications ◽

Protein Identification ◽

Matthews Correlation Coefficient ◽

Predictive Performance ◽

Computational Prediction ◽

Training Dataset ◽

Support Vector ◽

Lysine Residues

Pupylation, one of the most important posttranslational modifications of proteins, typically takes place when prokaryotic ubiquitin-like protein (Pup) is attached to specific lysine residues on a target protein. Identification of pupylation substrates and their corresponding sites will facilitate the understanding of the molecular mechanism of pupylation. Comparing with the labor-intensive and time-consuming experiment approaches, computational prediction of pupylation sites is much desirable for their convenience and fast speed. In this study, a new bioinformatics tool named EnsemblePup was developed that used an ensemble of support vector machine classifiers to predict pupylation sites. The highlight of EnsemblePup was to utilize the Bi-profile Bayes feature extraction as the encoding scheme. The performance of EnsemblePup was measured with a sensitivity of 79.49%, a specificity of 82.35%, an accuracy of 85.43%, and a Matthews correlation coefficient of 0.617 using the 5-fold cross validation on the training dataset. When compared with other existing methods on a benchmark dataset, the EnsemblePup provided better predictive performance, with a sensitivity of 80.00%, a specificity of 83.33%, an accuracy of 82.00%, and a Matthews correlation coefficient of 0.629. The experimental results suggested that EnsemblePup presented here might be useful to identify and annotate potential pupylation sites in proteins of interest. A web server for predicting pupylation sites was developed.

Download Full-text

Detection of misinformation on garlic and COVID-19 in Twitter: A machine learning-based approach (Preprint)

10.2196/preprints.33056 ◽

2021 ◽

Author(s):

Myeong Gyu Kim ◽

Jae Hyun Kim ◽

Kyungim Kim

Keyword(s):

Machine Learning ◽

Social Media ◽

Latent Dirichlet Allocation ◽

Predictive Performance ◽

Machine Learning Algorithms ◽

Training Dataset ◽

Polynomial Kernel ◽

Support Vector ◽

Accurate Information ◽

Probability Number

BACKGROUND Garlic-related misinformation is prevalent whenever a virus outbreak occurs. Again, with the outbreak of coronavirus disease 2019 (COVID-19), garlic-related misinformation is spreading through social media sites, including Twitter. Machine learning-based approaches can be used to detect misinformation from vast tweets. OBJECTIVE This study aimed to develop machine learning algorithms for detecting misinformation on garlic and COVID-19 in Twitter. METHODS This study used 5,929 original tweets mentioning garlic and COVID-19. Tweets were manually labeled as misinformation, accurate information, and others. We tested the following algorithms: k-nearest neighbors; random forest; support vector machine (SVM) with linear, radial, and polynomial kernels; and neural network. Features for machine learning included user-based features (verified account, user type, number of followers, and follower rate) and text-based features (uniform resource locator, negation, sentiment score, Latent Dirichlet Allocation topic probability, number of retweets, and number of favorites). A model with the highest accuracy in the training dataset (70% of overall dataset) was tested using a test dataset (30% of overall dataset). Predictive performance was measured using overall accuracy, sensitivity, specificity, and balanced accuracy. RESULTS SVM with the polynomial kernel model showed the highest accuracy of 0.670. The model also showed a balanced accuracy of 0.757, sensitivity of 0.819, and specificity of 0.696 for misinformation. Important features in the misinformation and accurate information classes included topic 4 (common myths), topic 13 (garlic-specific myths), number of followers, topic 11 (misinformation on social media), and follower rate. Topic 3 (cooking recipes) was the most important feature in the others class. CONCLUSIONS Our SVM model showed good performance in detecting misinformation. The results of our study will help detect misinformation related to garlic and COVID-19. It could also be applied to prevent misinformation related to dietary supplements in the event of a future outbreak of a disease other than COVID-19.

Download Full-text

Machine Learning Readmission Risk Modeling: A Pediatric Case Study

BioMed Research International ◽

10.1155/2019/8532892 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9 ◽

Cited By ~ 3

Author(s):

Patricio Wolff ◽

Manuel Graña ◽

Sebastián A. Ríos ◽

Maria Begoña Yarza

Keyword(s):

Machine Learning ◽

Multilayer Perceptron ◽

Naive Bayes ◽

Class Imbalance ◽

Predictive Performance ◽

Naïve Bayes ◽

Distribution Model ◽

Training Dataset ◽

Support Vector ◽

Pediatric Hospital

Background. Hospital readmission prediction in pediatric hospitals has received little attention. Studies have focused on the readmission frequency analysis stratified by disease and demographic/geographic characteristics but there are no predictive modeling approaches, which may be useful to identify preventable readmissions that constitute a major portion of the cost attributed to readmissions.Objective. To assess the all-cause readmission predictive performance achieved by machine learning techniques in the emergency department of a pediatric hospital in Santiago, Chile.Materials. An all-cause admissions dataset has been collected along six consecutive years in a pediatric hospital in Santiago, Chile. The variables collected are the same used for the determination of the child’s treatment administrative cost.Methods. Retrospective predictive analysis of 30-day readmission was formulated as a binary classification problem. We report classification results achieved with various model building approaches after data curation and preprocessing for correction of class imbalance. We compute repeated cross-validation (RCV) with decreasing number of folders to assess performance and sensitivity to effect of imbalance in the test set and training set size.Results. Increase in recall due to SMOTE class imbalance correction is large and statistically significant. The Naive Bayes (NB) approach achieves the best AUC (0.65); however the shallow multilayer perceptron has the best PPV and f-score (5.6 and 10.2, resp.). The NB and support vector machines (SVM) give comparable results if we consider AUC, PPV, and f-score ranking for all RCV experiments. High recall of deep multilayer perceptron is due to high false positive ratio. There is no detectable effect of the number of folds in the RCV on the predictive performance of the algorithms.Conclusions. We recommend the use of Naive Bayes (NB) with Gaussian distribution model as the most robust modeling approach for pediatric readmission prediction, achieving the best results across all training dataset sizes. The results show that the approach could be applied to detect preventable readmissions.

Download Full-text

Prediction of novel mouse TLR9 agonists using a random forest approach

BMC Molecular and Cell Biology ◽

10.1186/s12860-019-0241-0 ◽

2019 ◽

Vol 20 (S2) ◽

Author(s):

Varun Khanna ◽

Lei Li ◽

Johnson Fung ◽

Shoba Ranganathan ◽

Nikolai Petrovsky

Keyword(s):

Machine Learning ◽

Random Forest ◽

Correlation Coefficient ◽

Matthews Correlation Coefficient ◽

Learning Algorithms ◽

Ensemble Classifier ◽

Innate Immune ◽

Machine Learning Algorithms ◽

Support Vector ◽

Random Forest Algorithm

Abstract Background Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling. Results Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including ‘CC’, ‘GG’,‘AG’, ‘CCCG’ and ‘CGGC’ were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity. Conclusion We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

Download Full-text

Incorporating Amino Acids Composition and Functional Domains for Identifying Bacterial Toxin Proteins

BioMed Research International ◽

10.1155/2014/972692 ◽

2014 ◽

Vol 2014 ◽

pp. 1-7 ◽

Cited By ~ 2

Author(s):

Min-Gang Su ◽

Chien-Hsun Huang ◽

Tzong-Yi Lee ◽

Yu-Ju Chen ◽

Hsin-Yi Wu

Keyword(s):

Amino Acids ◽

Cell Biology ◽

Predictive Performance ◽

Computational Prediction ◽

Amino Acid Sequences ◽

Bacterial Toxins ◽

Bacterial Toxin ◽

Support Vector ◽

Functional Domain ◽

Domain Information

Aside from pathogenesis, bacterial toxins also have been used for medical purpose such as drugs for cancer and immune diseases. Correctly identifying bacterial toxins and their types (endotoxins and exotoxins) has great impact on the cell biology study and therapy development. However, experimental methods for bacterial toxins identification are time-consuming and labor-intensive, implying an urgent need for computational prediction. Thus, we are motivated to develop a method for computational identification of bacterial toxins based on amino acid sequences and functional domain information. In this study, a nonredundant dataset of 167 bacterial toxins including 77 exotoxins and 90 endotoxins is adopted to learn the predictive model by using support vector machines (SVMs). The cross-validation evaluation shows that the SVM models trained with amino acids and dipeptides composition could yield an accuracy of 96.07% and 92.50%, respectively. For discriminating endotoxins from exotoxins, the SVM models trained with amino acids and dipeptides composition have achieved an accuracy of 95.71% and 92.86%, respectively. After incorporating functional domain information, the predictive performance is further improved. The proposed method has been demonstrated to be able to more effectively identify and classify bacterial toxins than the other two features on independent dataset, which may aid in bacterial biomedical development.

Download Full-text

Prediction of post-translational modification sites using multiple kernel support vector machine

PeerJ ◽

10.7717/peerj.3261 ◽

2017 ◽

Vol 5 ◽

pp. e3261 ◽

Cited By ~ 7

Author(s):

BingHua Wang ◽

Minghui Wang ◽

Ao Li

Keyword(s):

Protein Function ◽

Predictive Performance ◽

Computational Prediction ◽

Computational Method ◽

Support Vector ◽

Prediction Methods ◽

Sequence Information ◽

Post Translational Modification ◽

Multiple Kernel ◽

Local Sequence

Protein post-translational modification (PTM) is an important mechanism that is involved in the regulation of protein function. Considering the high-cost and labor-intensive of experimental identification, many computational prediction methods are currently available for the prediction of PTM sites by using protein local sequence information in the context of conserved motif. Here we proposed a novel computational method by using the combination of multiple kernel support vector machines (SVM) for predicting PTM sites including phosphorylation, O-linked glycosylation, acetylation, sulfation and nitration. To largely make use of local sequence information and site-modification relationships, we developed a local sequence kernel and Gaussian interaction profile kernel, respectively. Multiple kernels were further combined to train SVM for efficiently leveraging kernel information to boost predictive performance. We compared the proposed method with existing PTM prediction methods. The experimental results revealed that the proposed method performed comparable or better performance than the existing prediction methods, suggesting the feasibility of the developed kernels and the usefulness of the proposed method in PTM sites prediction.

Download Full-text

Feature Extraction of Ship-Radiated Noise Based on Enhanced Variational Mode Decomposition, Normalized Correlation Coefficient and Permutation Entropy

Entropy ◽

10.3390/e22040468 ◽

2020 ◽

Vol 22 (4) ◽

pp. 468 ◽

Cited By ~ 1

Author(s):

Dongri Xie ◽

Hamada Esmaiel ◽

Haixin Sun ◽

Jie Qi ◽

Zeyad A. H. Qasem

Keyword(s):

Feature Extraction ◽

Correlation Coefficient ◽

Extraction Method ◽

Recognition Rate ◽

Permutation Entropy ◽

Support Vector ◽

Variational Mode Decomposition ◽

Feature Extraction Method ◽

Radiated Noise ◽

Mode Decomposition

Due to the complexity and variability of underwater acoustic channels, ship-radiated noise (SRN) detected using the passive sonar is prone to be distorted. The entropy-based feature extraction method can improve this situation, to some extent. However, it is impractical to directly extract the entropy feature for the detected SRN signals. In addition, the existing conventional methods have a lack of suitable de-noising processing under the presence of marine environmental noise. To this end, this paper proposes a novel feature extraction method based on enhanced variational mode decomposition (EVMD), normalized correlation coefficient (norCC), permutation entropy (PE), and the particle swarm optimization-based support vector machine (PSO-SVM). Firstly, EVMD is utilized to obtain a group of intrinsic mode functions (IMFs) from the SRN signals. The noise-dominant IMFs are then eliminated by a de-noising processing prior to PE calculation. Next, the correlation coefficient between each signal-dominant IMF and the raw signal and PE of each signal-dominant IMF are calculated, respectively. After this, the norCC is used to weigh the corresponding PE and the sum of these weighted PE is considered as the final feature parameter. Finally, the feature vectors are fed into the PSO-SVM multi-class classifier to classify the SRN samples. The experimental results demonstrate that the recognition rate of the proposed methodology is up to 100%, which is much higher than the currently existing methods. Hence, the method proposed in this paper is more suitable for the feature extraction of SRN signals.

Download Full-text

Position-Specific Analysis and Prediction of Protein Pupylation Sites Based on Multiple Features

BioMed Research International ◽

10.1155/2013/109549 ◽

2013 ◽

Vol 2013 ◽

pp. 1-9 ◽

Cited By ~ 15

Author(s):

Xiaowei Zhao ◽

Jiangyan Dai ◽

Qiao Ning ◽

Zhiqiang Ma ◽

Minghao Yin ◽

...

Keyword(s):

Posttranslational Modifications ◽

Computational Prediction ◽

Support Vector ◽

Good Prediction ◽

Accurate Identification ◽

Multiple Features ◽

Specific Analysis ◽

Experimental Approaches ◽

Optimal Feature

Pupylation is one of the most important posttranslational modifications of proteins; accurate identification of pupylation sites will facilitate the understanding of the molecular mechanism of pupylation. Besides the conventional experimental approaches, computational prediction of pupylation sites is much desirable for their convenience and fast speed. In this study, we developed a novel predictor to predict the pupylation sites. First, the maximum relevance minimum redundancy (mRMR) and incremental feature selection methods were made on five kinds of features to select the optimal feature set. Then the prediction model was built based on the optimal feature set with the assistant of the support vector machine algorithm. As a result, the overall jackknife success rate by the new predictor on a newly constructed benchmark dataset was 0.764, and the Mathews correlation coefficient was 0.522, indicating a good prediction. Feature analysis showed that all features types contributed to the prediction of protein pupylation sites. Further site-specific features analysis revealed that the features of sites surrounding the central lysine contributed more to the determination of pupylation sites than the other sites.

Download Full-text

Min3: Predict microRNA target gene using an improved binding-site representation method and support vector machine

Journal of Bioinformatics and Computational Biology ◽

10.1142/s021972001950032x ◽

2019 ◽

Vol 17 (05) ◽

pp. 1950032

Author(s):

Tinghua Huang ◽

Xiali Huang ◽

Min Yao

Keyword(s):

Binding Site ◽

Target Genes ◽

Experimental Testing ◽

Mrna Level ◽

Computational Prediction ◽

Training Dataset ◽

Support Vector ◽

Microrna Target ◽

Specificity And Sensitivity ◽

Representation Method

MicroRNAs are single-stranded noncoding RNAs known to down-regulate target genes at the protein or mRNA level. Computational prediction of targets is essential for elucidating the detailed functions of microRNA. However, prediction specificity and sensitivity of the existing algorithms still need to be improved to generate useful hypotheses for subsequent experimental testing. A new microRNA binding-site representation method was developed, which uses four symbols “[Formula: see text]”, “:”, “[Formula: see text]”, and “[Formula: see text]” (indicating paired, unpaired, insertion, and bulge, respectively) to represent the status of each nucleotide base pair in the microRNA binding site. New features were established with the information of every two adjacent symbols. There are 12 possible combinations and the frequency of each defines a set of novel and useful features. A comprehensive training dataset is constructed for mammalian microRNAs with positive targets obtained from the microRNA target depository in the miRTarbase, while negative targets were derived from pseudo-microRNA bindings. An SVM model was established using the training dataset and a new software called Min3 was developed. Performance of Min3 was assessed with intensively studied examples of miR-155 and miR-92a. Prediction results showed that Min3 can discover 47% of experimental conformed targets on average. The overlapping is above 20% on average when compared with TargetScan and miRanda. Annotations of the public microRNA datasets showed that there is a negative effect (up-regulation) of the Min3 targets for the knock out/down of miR-155 and miR-92a. Six top ranked targets were selected for validation by wet-lab experiments, and five of them showed a regulation effect. The Min3 can be a good alternative to current microRNA target discovery software. This tool is available at https://sourceforge.net/projects/mirt3 .

Download Full-text

PhytoAFP: In Silico Approaches for Designing Plant-Derived Antifungal Peptides

Antibiotics ◽

10.3390/antibiotics10070815 ◽

2021 ◽

Vol 10 (7) ◽

pp. 815

Author(s):

Atul Tyagi ◽

Sudeep Roy ◽

Sanjay Singh ◽

Manoj Semwal ◽

Ajit K. Shasany ◽

...

Keyword(s):

Matthews Correlation Coefficient ◽

Emerging Infectious Diseases ◽

Training Dataset ◽

Support Vector ◽

Composition Analysis ◽

Position Preference ◽

Motif Analysis ◽

Data Set ◽

Physiochemical Properties ◽

Antifungal Peptides

Emerging infectious diseases (EID) are serious problems caused by fungi in humans and plant species. They are a severe threat to food security worldwide. In our current work, we have developed a support vector machine (SVM)-based model that attempts to design and predict therapeutic plant-derived antifungal peptides (PhytoAFP). The residue composition analysis shows the preference of C, G, K, R, and S amino acids. Position preference analysis shows that residues G, K, R, and A dominate the N-terminal. Similarly, residues N, S, C, and G prefer the C-terminal. Motif analysis reveals the presence of motifs like NYVF, NYVFP, YVFP, NYVFPA, and VFPA. We have developed two models using various input functions such as mono-, di-, and tripeptide composition, as well as binary, hybrid, and physiochemical properties, based on methods that are applied to the main data set. The TPC-based monopeptide composition model achieved more accuracy, 94.4%, with a Matthews correlation coefficient (MCC) of 0.89. Correspondingly, the second-best model based on dipeptides achieved an accuracy of 94.28% under the MCC 0.89 of the training dataset.

Download Full-text

OFFLINE YORÙBÁ HANDWRITTEN WORD RECOGNITION USING GEOMETRIC FEATURE EXTRACTION AND SUPPORT VECTOR MACHINE CLASSIFIER

MALAYSIAN JOURNAL OF COMPUTING ◽

10.24191/mjoc.v5i2.8947 ◽

2020 ◽

Vol 5 (2) ◽

pp. 504

Author(s):

Matthias Omotayo Oladele ◽

Temilola Morufat Adepoju ◽

Olaide ` Abiodun Olatoke ◽

Oluwaseun Adewale Ojo

Keyword(s):

Support Vector Machine ◽

Feature Extraction ◽

Word Recognition ◽

Support Vector Machine Classifier ◽

Recognition Accuracy ◽

Recognition System ◽

Support Vector ◽

Geometric Features ◽

Total Length ◽

Yoruba Language

Yorùbá language is one of the three main languages that is been spoken in Nigeria. It is a tonal language that carries an accent on the vowel alphabets. There are twenty-five (25) alphabets in Yorùbá language with one of the alphabets a digraph (GB). Due to the difficulty in typing handwritten Yorùbá documents, there is a need to develop a handwritten recognition system that can convert the handwritten texts to digital format. This study discusses the offline Yorùbá handwritten word recognition system (OYHWR) that recognizes Yorùbá uppercase alphabets. Handwritten characters and words were obtained from different writers using the paint application and M708 graphics tablets. The characters were used for training and the words were used for testing. Pre-processing was done on the images and the geometric features of the images were extracted using zoning and gradient-based feature extraction. Geometric features are the different line types that form a particular character such as the vertical, horizontal, and diagonal lines. The geometric features used are the number of horizontal lines, number of vertical lines, number of right diagonal lines, number of left diagonal lines, total length of all horizontal lines, total length of all vertical lines, total length of all right slanting lines, total length of all left-slanting lines and the area of the skeleton. The characters are divided into 9 zones and gradient feature extraction was used to extract the horizontal and vertical components and geometric features in each zone. The words were fed into the support vector machine classifier and the performance was evaluated based on recognition accuracy. Support vector machine is a two-class classifier, hence a multiclass SVM classifier least square support vector machine (LSSVM) was used for word recognition. The one vs one strategy and RBF kernel were used and the recognition accuracy obtained from the tested words ranges between 66.7%, 83.3%, 85.7%, 87.5%, and 100%. The low recognition rate for some of the words could be as a result of the similarity in the extracted features.

Download Full-text