HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection

Computational and Mathematical Methods in Medicine ◽

10.1155/2020/1384749 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10 ◽

Cited By ~ 3

Author(s):

Xiuzhi Sang ◽

Wanyue Xiao ◽

Huiwen Zheng ◽

Yang Yang ◽

Taigang Liu

Keyword(s):

Feature Selection ◽

Dna Binding ◽

Binding Proteins ◽

Biological Activities ◽

Dna Binding Proteins ◽

Gradient Boosting ◽

Support Vector ◽

Svm Classifier ◽

Cross Covariance ◽

Extreme Gradient Boosting

Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.

Download Full-text

DBP-GAPred: An intelligent method for prediction of DNA-binding proteins types by enhanced evolutionary profile features with ensemble learning

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720021500189 ◽

2021 ◽

pp. 2150018

Author(s):

Omar Barukab ◽

Farman Ali ◽

Sher Afzal Khan

Keyword(s):

Dna Binding ◽

Binding Proteins ◽

Biological Activities ◽

Dna Binding Proteins ◽

Support Vector ◽

The Novel ◽

Double Stranded Dna ◽

Difference Formula ◽

Ensemble Strategy ◽

Inflammatory Drugs

DNA-binding proteins (DBPs) perform an influential role in diverse biological activities like DNA replication, slicing, repair, and transcription. Some DBPs are indispensable for understanding many types of human cancers (i.e. lung, breast, and liver cancer) and chronic diseases (i.e. AIDS/HIV, asthma), while other kinds are involved in antibiotics, steroids, and anti-inflammatory drugs designing. These crucial processes are closely related to DBPs types. DBPs are categorized into single-stranded DNA-binding proteins (ssDBPs) and double-stranded DNA-binding proteins (dsDBPs). Few computational predictors have been reported for discriminating ssDBPs and dsDBPs. However, due to the limitations of the existing methods, an intelligent computational system is still highly desirable. In this work, features from protein sequences are discovered by extending the notion of dipeptide composition (DPC), evolutionary difference formula (EDF), and K-separated bigram (KSB) into the position-specific scoring matrix (PSSM). The highly intrinsic information was encoded by a compression approach named discrete cosine transform (DCT) and the model was trained with support vector machine (SVM). The prediction performance was further boosted by the genetic algorithm (GA) ensemble strategy. The novel predictor (DBP-GAPred) acquired 1.89%, 0.28%, and 6.63% higher accuracies on jackknife, 10-fold, and independent dataset tests, respectively than the best predictor. These outcomes confirm the superiority of our method over the existing predictors.

Download Full-text

StackPDB: predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier

10.1101/2020.08.24.264267 ◽

2020 ◽

Author(s):

Qingmei Zhang ◽

Peishun Liu ◽

Yu Han ◽

Yaqun Zhang ◽

Xue Wang ◽

...

Keyword(s):

Dna Binding ◽

Binding Proteins ◽

Ensemble Classifier ◽

Dna Binding Proteins ◽

Position Specific Scoring Matrix ◽

Gradient Boosting ◽

Feature Subset ◽

Extreme Gradient Boosting ◽

Independent Test ◽

Scoring Matrix

ABSTRACTDNA binding proteins (DBPs) not only play an important role in all aspects of genetic activities such as DNA replication, recombination, repair, and modification but also are used as key components of antibiotics, steroids, and anticancer drugs in the field of drug discovery. Identifying DBPs becomes one of the most challenging problems in the domain of proteomics research. Considering the high-priced and inefficient of the experimental method, constructing a detailed DBPs prediction model becomes an urgent problem for researchers. In this paper, we propose a stacked ensemble classifier based method for predicting DBPs called StackPDB. Firstly, pseudo amino acid composition (PseAAC), pseudo position-specific scoring matrix (PsePSSM), position-specific scoring matrix-transition probability composition (PSSM-TPC), evolutionary distance transformation (EDT), and residue probing transformation (RPT) are applied to extract protein sequence features. Secondly, extreme gradient boosting-recursive feature elimination (XGB-RFE) is employed to gain an excellent feature subset. Finally, the best features are applied to the stacked ensemble classifier composed of XGBoost, LightGBM, and SVM to construct StackPDB. After applying leave-one-out cross-validation (LOOCV), StackPDB obtains high ACC and MCC on PDB1075, 93.44% and 0.8687, respectively. Besides, the ACC of the independent test datasets PDB186 and PDB180 are 84.41% and 90.00%, respectively. The MCC of the independent test datasets PDB186 and PDB180 are 0.6882 and 0.7997, respectively. The results on the training dataset and the independent test dataset show that StackPDB has a great predictive ability to predict DBPs.

Download Full-text

MK-FSVM-SVDD: A Multiple Kernel-based Fuzzy SVM Model for Predicting DNA-binding Proteins via Support Vector Data Description

Current Bioinformatics ◽

10.2174/1574893615999200607173829 ◽

2020 ◽

Vol 15 ◽

Author(s):

Yi Zou ◽

Hongjie Wu ◽

Xiaoyi Guo ◽

Li Peng ◽

Yijie Ding ◽

...

Keyword(s):

Dna Binding ◽

Binding Proteins ◽

Detection Efficiency ◽

Dna Binding Proteins ◽

Support Vector ◽

Support Vector Data Description ◽

Vector Data ◽

Data Description ◽

Multiple Kernel ◽

Svm Model

Background: Detecting DNA-binding proetins (DBPs) based on biological and chemical methods is time consuming and expensive. Objective: In recent years, the rise of computational biology methods based on Machine Learning (ML) has greatly improved the detection efficiency of DBPs. Method: In this study, Multiple Kernel-based Fuzzy SVM Model with Support Vector Data Description (MK-FSVM-SVDD) is proposed to predict DBPs. Firstly, sex features are extracted from protein sequence. Secondly, multiple kernels are constructed via these sequence feature. Than, multiple kernels are integrated by Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL). Next, fuzzy membership scores of training samples are calculated with Support Vector Data Description (SVDD). FSVM is trained and employed to detect new DBPs. Results: Our model is test on several benchmark datasets. Compared with other methods, MK-FSVM-SVDD achieves best Matthew's Correlation Coefficient (MCC) on PDB186 (0.7250) and PDB2272 (0.5476). Conclusion: We can conclude that MK-FSVM-SVDD is more suitable than common SVM, as the classifier for DNA-binding proteins identification.

Download Full-text

FTWSVM-SR: DNA-Binding Proteins Identification via Fuzzy Twin Support Vector Machines on Self-Representation

Interdisciplinary Sciences Computational Life Sciences ◽

10.1007/s12539-021-00489-6 ◽

2021 ◽

Author(s):

Yi Zou ◽

Yijie Ding ◽

Li Peng ◽

Quan Zou

Keyword(s):

Support Vector Machines ◽

Dna Binding ◽

Binding Proteins ◽

Dna Binding Proteins ◽

Support Vector ◽

Twin Support Vector Machines ◽

Vector Machines

Download Full-text

Sequence Based Prediction of DNA-Binding Proteins Based on Hybrid Feature Selection Using Random Forest and Gaussian Naïve Bayes

PLoS ONE ◽

10.1371/journal.pone.0086703 ◽

2014 ◽

Vol 9 (1) ◽

pp. e86703 ◽

Cited By ~ 81

Author(s):

Wangchao Lou ◽

Xiaoqing Wang ◽

Fan Chen ◽

Yixiao Chen ◽

Bo Jiang ◽

...

Keyword(s):

Feature Selection ◽

Random Forest ◽

Dna Binding ◽

Binding Proteins ◽

Naive Bayes ◽

Dna Binding Proteins ◽

Naïve Bayes

Download Full-text

Techniques for Detecting Malware Traffic: A Comprehensive Approach to Feature Selection and Classification

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2021.39088 ◽

2021 ◽

Vol 9 (12) ◽

pp. 1-10

Author(s):

Harsha A K

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Random Forest ◽

Learning Algorithms ◽

Malware Detection ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Steady Increase ◽

Extreme Gradient Boosting

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.

Download Full-text

Identification of DNA-binding proteins via Hypergraph based Laplacian Support Vector Machine

Current Bioinformatics ◽

10.2174/1574893616666210806091922 ◽

2021 ◽

Vol 16 ◽

Author(s):

Yuqing Qian ◽

Hao Meng ◽

Weizhong Lu ◽

Zhijun Liao ◽

Yijie Ding ◽

...

Keyword(s):

Machine Learning ◽

Dna Binding ◽

Large Scale ◽

Binding Proteins ◽

Predictive Accuracy ◽

Dna Binding Proteins ◽

Research Field ◽

Support Vector ◽

Data Sets ◽

Independent Test

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.

Download Full-text

Evaluating Variable Selection and Machine Learning Algorithms for Estimating Forest Heights by Combining Lidar and Hyperspectral Data

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi9090507 ◽

2020 ◽

Vol 9 (9) ◽

pp. 507

Author(s):

Sanjiwana Arjasakusuma ◽

Sandiaga Swahyu Kusuma ◽

Stuart Phinn

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Learning Algorithms ◽

Principal Component ◽

Hyperspectral Data ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Support Vector ◽

Forest Height ◽

Extreme Gradient Boosting

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.

Download Full-text

Identification of DNA-binding proteins using support vector machines and evolutionary profiles

BMC Bioinformatics ◽

10.1186/1471-2105-8-463 ◽

2007 ◽

Vol 8 (1) ◽

pp. 463 ◽

Cited By ~ 127

Author(s):

Manish Kumar ◽

Michael M Gromiha ◽

Gajendra PS Raghava

Keyword(s):

Support Vector Machines ◽

Dna Binding ◽

Binding Proteins ◽

Dna Binding Proteins ◽

Support Vector ◽

Vector Machines

Download Full-text

FC- SVM: DNA binding Proteins prediction with Average Blocks (AB) descriptors using SVM with FC feature Selection

2019 International Conference on Sustainable Information Engineering and Technology (SIET) ◽

10.1109/siet48054.2019.8986070 ◽

2019 ◽

Author(s):

Achmad Ridok ◽

Nashi Widodo ◽

Wayan Firdaus Mahmudy ◽

Muhaimin Rifai

Keyword(s):

Feature Selection ◽

Dna Binding ◽

Binding Proteins ◽

Dna Binding Proteins

Download Full-text