scholarly journals HMMPred: Accurate Prediction of DNA-Binding Proteins Based on HMM Profiles and XGBoost Feature Selection

2020 ◽  
Vol 2020 ◽  
pp. 1-10 ◽  
Author(s):  
Xiuzhi Sang ◽  
Wanyue Xiao ◽  
Huiwen Zheng ◽  
Yang Yang ◽  
Taigang Liu

Prediction of DNA-binding proteins (DBPs) has become a popular research topic in protein science due to its crucial role in all aspects of biological activities. Even though considerable efforts have been devoted to developing powerful computational methods to solve this problem, it is still a challenging task in the field of bioinformatics. A hidden Markov model (HMM) profile has been proved to provide important clues for improving the prediction performance of DBPs. In this paper, we propose a method, called HMMPred, which extracts the features of amino acid composition and auto- and cross-covariance transformation from the HMM profiles, to help train a machine learning model for identification of DBPs. Then, a feature selection technique is performed based on the extreme gradient boosting (XGBoost) algorithm. Finally, the selected optimal features are fed into a support vector machine (SVM) classifier to predict DBPs. The experimental results tested on two benchmark datasets show that the proposed method is superior to most of the existing methods and could serve as an alternative tool to identify DBPs.

Author(s):  
Omar Barukab ◽  
Farman Ali ◽  
Sher Afzal Khan

DNA-binding proteins (DBPs) perform an influential role in diverse biological activities like DNA replication, slicing, repair, and transcription. Some DBPs are indispensable for understanding many types of human cancers (i.e. lung, breast, and liver cancer) and chronic diseases (i.e. AIDS/HIV, asthma), while other kinds are involved in antibiotics, steroids, and anti-inflammatory drugs designing. These crucial processes are closely related to DBPs types. DBPs are categorized into single-stranded DNA-binding proteins (ssDBPs) and double-stranded DNA-binding proteins (dsDBPs). Few computational predictors have been reported for discriminating ssDBPs and dsDBPs. However, due to the limitations of the existing methods, an intelligent computational system is still highly desirable. In this work, features from protein sequences are discovered by extending the notion of dipeptide composition (DPC), evolutionary difference formula (EDF), and K-separated bigram (KSB) into the position-specific scoring matrix (PSSM). The highly intrinsic information was encoded by a compression approach named discrete cosine transform (DCT) and the model was trained with support vector machine (SVM). The prediction performance was further boosted by the genetic algorithm (GA) ensemble strategy. The novel predictor (DBP-GAPred) acquired 1.89%, 0.28%, and 6.63% higher accuracies on jackknife, 10-fold, and independent dataset tests, respectively than the best predictor. These outcomes confirm the superiority of our method over the existing predictors.


2020 ◽  
Author(s):  
Qingmei Zhang ◽  
Peishun Liu ◽  
Yu Han ◽  
Yaqun Zhang ◽  
Xue Wang ◽  
...  

ABSTRACTDNA binding proteins (DBPs) not only play an important role in all aspects of genetic activities such as DNA replication, recombination, repair, and modification but also are used as key components of antibiotics, steroids, and anticancer drugs in the field of drug discovery. Identifying DBPs becomes one of the most challenging problems in the domain of proteomics research. Considering the high-priced and inefficient of the experimental method, constructing a detailed DBPs prediction model becomes an urgent problem for researchers. In this paper, we propose a stacked ensemble classifier based method for predicting DBPs called StackPDB. Firstly, pseudo amino acid composition (PseAAC), pseudo position-specific scoring matrix (PsePSSM), position-specific scoring matrix-transition probability composition (PSSM-TPC), evolutionary distance transformation (EDT), and residue probing transformation (RPT) are applied to extract protein sequence features. Secondly, extreme gradient boosting-recursive feature elimination (XGB-RFE) is employed to gain an excellent feature subset. Finally, the best features are applied to the stacked ensemble classifier composed of XGBoost, LightGBM, and SVM to construct StackPDB. After applying leave-one-out cross-validation (LOOCV), StackPDB obtains high ACC and MCC on PDB1075, 93.44% and 0.8687, respectively. Besides, the ACC of the independent test datasets PDB186 and PDB180 are 84.41% and 90.00%, respectively. The MCC of the independent test datasets PDB186 and PDB180 are 0.6882 and 0.7997, respectively. The results on the training dataset and the independent test dataset show that StackPDB has a great predictive ability to predict DBPs.


2020 ◽  
Vol 15 ◽  
Author(s):  
Yi Zou ◽  
Hongjie Wu ◽  
Xiaoyi Guo ◽  
Li Peng ◽  
Yijie Ding ◽  
...  

Background: Detecting DNA-binding proetins (DBPs) based on biological and chemical methods is time consuming and expensive. Objective: In recent years, the rise of computational biology methods based on Machine Learning (ML) has greatly improved the detection efficiency of DBPs. Method: In this study, Multiple Kernel-based Fuzzy SVM Model with Support Vector Data Description (MK-FSVM-SVDD) is proposed to predict DBPs. Firstly, sex features are extracted from protein sequence. Secondly, multiple kernels are constructed via these sequence feature. Than, multiple kernels are integrated by Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL). Next, fuzzy membership scores of training samples are calculated with Support Vector Data Description (SVDD). FSVM is trained and employed to detect new DBPs. Results: Our model is test on several benchmark datasets. Compared with other methods, MK-FSVM-SVDD achieves best Matthew's Correlation Coefficient (MCC) on PDB186 (0.7250) and PDB2272 (0.5476). Conclusion: We can conclude that MK-FSVM-SVDD is more suitable than common SVM, as the classifier for DNA-binding proteins identification.


PLoS ONE ◽  
2014 ◽  
Vol 9 (1) ◽  
pp. e86703 ◽  
Author(s):  
Wangchao Lou ◽  
Xiaoqing Wang ◽  
Fan Chen ◽  
Yixiao Chen ◽  
Bo Jiang ◽  
...  

Author(s):  
Harsha A K

Abstract: Since the advent of encryption, there has been a steady increase in malware being transmitted over encrypted networks. Traditional approaches to detect malware like packet content analysis are inefficient in dealing with encrypted data. In the absence of actual packet contents, we can make use of other features like packet size, arrival time, source and destination addresses and other such metadata to detect malware. Such information can be used to train machine learning classifiers in order to classify malicious and benign packets. In this paper, we offer an efficient malware detection approach using classification algorithms in machine learning such as support vector machine, random forest and extreme gradient boosting. We employ an extensive feature selection process to reduce the dimensionality of the chosen dataset. The dataset is then split into training and testing sets. Machine learning algorithms are trained using the training set. These models are then evaluated against the testing set in order to assess their respective performances. We further attempt to tune the hyper parameters of the algorithms, in order to achieve better results. Random forest and extreme gradient boosting algorithms performed exceptionally well in our experiments, resulting in area under the curve values of 0.9928 and 0.9998 respectively. Our work demonstrates that malware traffic can be effectively classified using conventional machine learning algorithms and also shows the importance of dimensionality reduction in such classification problems. Keywords: Malware Detection, Extreme Gradient Boosting, Random Forest, Feature Selection.


2021 ◽  
Vol 16 ◽  
Author(s):  
Yuqing Qian ◽  
Hao Meng ◽  
Weizhong Lu ◽  
Zhijun Liao ◽  
Yijie Ding ◽  
...  

Background: The identification of DNA binding proteins (DBP) is an important research field. Experiment-based methods are time-consuming and labor-intensive for detecting DBP. Objective: To solve the problem of large-scale DBP identification, some machine learning methods are proposed. However, these methods have insufficient predictive accuracy. Our aim is to develop a sequence-based machine learning model to predict DBP. Methods: In our study, we extract six types of features (including NMBAC, GE, MCD, PSSM-AB, PSSM-DWT, and PsePSSM) from protein sequences. We use Multiple Kernel Learning based on Hilbert-Schmidt Independence Criterion (MKL-HSIC) to estimate the optimal kernel. Then, we construct a hypergraph model to describe the relationship between labeled and unlabeled samples. Finally, Laplacian Support Vector Machines (LapSVM) is employed to train the predictive model. Our method is tested on PDB186, PDB1075, PDB2272 and PDB14189 data sets. Result: Compared with other methods, our model achieves best results on benchmark data sets. Conclusion: The accuracy of 87.1% and 74.2% are achieved on PDB186 (Independent test of PDB1075) and PDB2272 (Independent test of PDB14189), respectively.


2020 ◽  
Vol 9 (9) ◽  
pp. 507
Author(s):  
Sanjiwana Arjasakusuma ◽  
Sandiaga Swahyu Kusuma ◽  
Stuart Phinn

Machine learning has been employed for various mapping and modeling tasks using input variables from different sources of remote sensing data. For feature selection involving high- spatial and spectral dimensionality data, various methods have been developed and incorporated into the machine learning framework to ensure an efficient and optimal computational process. This research aims to assess the accuracy of various feature selection and machine learning methods for estimating forest height using AISA (airborne imaging spectrometer for applications) hyperspectral bands (479 bands) and airborne light detection and ranging (lidar) height metrics (36 metrics), alone and combined. Feature selection and dimensionality reduction using Boruta (BO), principal component analysis (PCA), simulated annealing (SA), and genetic algorithm (GA) in combination with machine learning algorithms such as multivariate adaptive regression spline (MARS), extra trees (ET), support vector regression (SVR) with radial basis function, and extreme gradient boosting (XGB) with trees (XGbtree and XGBdart) and linear (XGBlin) classifiers were evaluated. The results demonstrated that the combinations of BO-XGBdart and BO-SVR delivered the best model performance for estimating tropical forest height by combining lidar and hyperspectral data, with R2 = 0.53 and RMSE = 1.7 m (18.4% of nRMSE and 0.046 m of bias) for BO-XGBdart and R2 = 0.51 and RMSE = 1.8 m (15.8% of nRMSE and −0.244 m of bias) for BO-SVR. Our study also demonstrated the effectiveness of BO for variables selection; it could reduce 95% of the data to select the 29 most important variables from the initial 516 variables from lidar metrics and hyperspectral data.


Sign in / Sign up

Export Citation Format

Share Document