Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods

Jiu-Xin Tan; Fu-Ying Dao; Hao Lv; Peng-Mian Feng; Hui Ding

doi:10.3390/molecules23082000

Identifying Phage Virion Proteins by Using Two-Step Feature Selection Methods

Molecules ◽

10.3390/molecules23082000 ◽

2018 ◽

Vol 23 (8) ◽

pp. 2000 ◽

Cited By ~ 15

Author(s):

Jiu-Xin Tan ◽

Fu-Ying Dao ◽

Hao Lv ◽

Peng-Mian Feng ◽

Hui Ding

Keyword(s):

Feature Selection ◽

Cross Validation ◽

Support Vector ◽

Virion Protein ◽

Accurate Identification ◽

Machine Learning Methods ◽

Minimal Redundancy ◽

Maximal Relevance ◽

Optimal Feature ◽

Fold Cross Validation

Accurate identification of phage virion protein is not only a key step for understanding the function of the phage virion protein but also helpful for further understanding the lysis mechanism of the bacterial cell. Since traditional experimental methods are time-consuming and costly for identifying phage virion proteins, it is extremely urgent to apply machine learning methods to accurately and efficiently identify phage virion proteins. In this work, a support vector machine (SVM) based method was proposed by mixing multiple sets of optimal g-gap dipeptide compositions. The analysis of variance (ANOVA) and the minimal-redundancy-maximal-relevance (mRMR) with an increment feature selection (IFS) were applied to single out the optimal feature set. In the five-fold cross-validation test, the proposed method achieved an overall accuracy of 87.95%. We believe that the proposed method will become an efficient and powerful method for scientists concerning phage virion proteins.

Download Full-text

Abstract 473: Identification of Apolipoproteins Using Feature Selection Technique

Arteriosclerosis Thrombosis and Vascular Biology ◽

10.1161/atvb.36.suppl_1.473 ◽

2016 ◽

Vol 36 (suppl_1) ◽

Author(s):

Hua Tang ◽

Hao Lin

Keyword(s):

Support Vector Machine ◽

Cross Validation ◽

Support Vector ◽

Feature Subset ◽

Risk Markers ◽

Dipeptide Composition ◽

Accurate Identification ◽

Feature Selection Technique ◽

Physiological Importance ◽

Fold Cross Validation

Objective: Apolipoproteins are of great physiological importance and are associated with different diseases such as dyslipidemia, thrombogenesis and angiocardiopathy. Apolipoproteins have therefore emerged as key risk markers and important research targets yet the types of apolipoproteins has not been fully elucidated. Accurate identification of the apoliproproteins is very crucial to the comprehension of cardiovascular diseases and drug design. The aim of this study is to develop a powerful model to precisely identify apolipoproteins. Approach and Results: We manually collected a non-redundant dataset of 53 apoliproproteins and 136 non-apoliproproteins with the sequence identify of less than 40% from UniProt. After formulating the protein sequence samples with g -gap dipeptide composition (here g =1~10), the analysis of various (ANOVA) was adopted to find out the best feature subset which can achieve the best accuracy. Support Vector Machine (SVM) was then used to perform classification. The predictive model was evaluated using a five-fold cross-validation which yielded a sensitivity of 96.2%, a specificity of 99.3%, and an accuracy of 98.4%. The study indicated that the proposed method could be a feasible means of conducting preliminary analyses of apoliproproteins. Conclusion: We demonstrated that apoliproproteins can be predicted from their primary sequences. Also we discovered the special dipeptide distribution in apoliproproteins. These findings open new perspectives to improve apoliproproteins prediction by considering the specific dipeptides. We expect that these findings will help to improve drug development in anti-angiocardiopathy disease. Key words: Apoliproproteins Angiocardiopathy Support Vector Machine

Download Full-text

Predictor Selection for Bacterial Vaginosis Diagnosis Using Decision Tree and Relief Algorithms

Applied Sciences ◽

10.3390/app10093291 ◽

2020 ◽

Vol 10 (9) ◽

pp. 3291

Author(s):

Jesús F. Pérez-Gómez ◽

Juana Canul-Reich ◽

José Hernández-Torruco ◽

Betania Hernández-Ocaña

Keyword(s):

Feature Selection ◽

Decision Tree ◽

Bacterial Vaginosis ◽

Cross Validation ◽

Performance Comparison ◽

Support Vector ◽

Ongoing Research ◽

Selection For ◽

Comparison Of The Results ◽

Fold Cross Validation

Requiring only a few relevant characteristics from patients when diagnosing bacterial vaginosis is highly useful for physicians as it makes it less time consuming to collect these data. This would result in having a dataset of patients that can be more accurately diagnosed using only a subset of informative or relevant features in contrast to using the entire set of features. As such, this is a feature selection (FS) problem. In this work, decision tree and Relief algorithms were used as feature selectors. Experiments were conducted on a real dataset for bacterial vaginosis with 396 instances and 252 features/attributes. The dataset was obtained from universities located in Baltimore and Atlanta. The FS algorithms utilized feature rankings, from which the top fifteen features formed a new dataset that was used as input for both support vector machine (SVM) and logistic regression (LR) algorithms for classification. For performance evaluation, averages of 30 runs of 10-fold cross-validation were reported, along with balanced accuracy, sensitivity, and specificity as performance measures. A performance comparison of the results was made between using the total number of features against using the top fifteen. These results found similar attributes from our rankings compared to those reported in the literature. This study is part of ongoing research that is investigating a range of feature selection and classification methods.

Download Full-text

FEATURE SELECTION FOR IDENTIFYING PROTEIN-DISORDERED REGIONS

Biomedical Engineering Applications Basis and Communications ◽

10.4015/s1016237210001839 ◽

2010 ◽

Vol 22 (02) ◽

pp. 119-125 ◽

Cited By ~ 1

Author(s):

Hui-Huang Hsu ◽

Cheng-Wei Hsieh

Keyword(s):

Feature Selection ◽

Structure Prediction ◽

Cross Validation ◽

Tertiary Structure ◽

Information Gain ◽

Support Vector ◽

Stable Performance ◽

Selection For ◽

Fold Cross Validation ◽

Disordered Regions

Determining the structure of a protein is not an easy task, which usually involved a time-consuming and costly process in the web lab. Using computational methods to predict a protein's tertiary structure from its primary structure (the amino acid sequence) is desirable. Disordered regions are segments of a protein that do not have a fixed conformation, which makes the structure prediction harder. Also, these disordered regions are functionally important for a protein. In this research, we would like to identify such regions with a focus on selecting a proper feature set. Three feature selection methods, namely F-score, information gain (IG), and k-medoids clustering, are used for feature selection. The support vector machine (SVM) is then used for classification. The results show that the classification accuracy can be raised with a smaller feature set. The k-medoids clustering feature selection can reduce the number of features from 440 to 150 and improve the accuracy from 84.66 to 86.81% in five-fold cross validation. It also has a more stable performance than F-score and IG.

Download Full-text

A Two-Step Feature Selection Method to Predict Cancerlectins by Multiview Features and Synthetic Minority Oversampling Technique

BioMed Research International ◽

10.1155/2018/9364182 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 4

Author(s):

Runtao Yang ◽

Chengjin Zhang ◽

Lina Zhang ◽

Rui Gao

Keyword(s):

Feature Selection ◽

Molecular Mechanisms ◽

Cross Validation ◽

Selection Process ◽

Feature Selection Method ◽

Imbalanced Data ◽

Computational Method ◽

Accurate Identification ◽

Comparison Results ◽

Fold Cross Validation

Cancerlectins have an inhibitory effect on the growth of cancer cells and are currently being employed as therapeutic agents. The accurate identification of the cancerlectins should provide insight into the molecular mechanisms of cancers. In this study, a new computational method based on the RF (Random Forest) algorithm is proposed for further improving the performance of identifying cancerlectins. Hybrid feature space before feature selection is developed by combining different individual feature spaces, CTD (Composition, Transition, and Distribution), PseAAC (Pseudo Amino Acid Composition), PSSM (Position-Specific Scoring Matrix), and disorder. The SMOTE (Synthetic Minority Oversampling Technique) is applied to solve the imbalanced data problem. To reduce feature redundancy and computation complexity, we propose a two-step feature selection process to select informative features. A 5-fold cross-validation technique is used for the evaluation of various prediction strategies. The proposed method achieves a sensitivity of 0.779, a specificity of 0.717, an accuracy of 0.748, and an MCC (Matthew’s Correlation Coefficient) of 0.497. The prediction results are also compared with other existing methods on the same dataset using 5-fold cross-validation. The comparison results demonstrate the high effectiveness of our method for predicting cancerlectins.

Download Full-text

Klasifikasi Keluhan Menggunakan Metode Support Vector Machine (SVM) Pada Akun Facebook Group iRaise Helpdesk

Jurnal CoreIT Jurnal Hasil Penelitian Ilmu Komputer dan Teknologi Informasi ◽

10.24014/coreit.v3i1.3552 ◽

2018 ◽

Vol 3 (1) ◽

pp. 24

Author(s):

Fatmawati Fatmawati ◽

Muhammad Affandes

Keyword(s):

Support Vector Machine ◽

Feature Selection ◽

Cross Validation ◽

Support Vector ◽

Microsoft Word ◽

Customer Services ◽

Fold Cross Validation

Abstrak – Facebook Group iRaise Helpdesk merupakan salah satu layanan media sosial yang digunakan pihak PTIPD UIN Suska Riau sebagai layanan pelanggan (customer services) sistem akademik. Mengingat sistem akademik baru mengalami peralihan yang sebelumnya bernama SIMAK menjadi iRaise, sehingga masih ada permasalahan yang ditimbulkan, dan menjadi keluhan bagi penggunanya. Untuk pengolahan data keluhan, pihak PTIPD masih menggunakan proses manual dengan menggunakan microsoft word dan excel. Sehingga pada penelitian ini akan dilakukan pengklasifikasian permasalahan sistem iRaise pada kategori multiclass yaitu: login, krs, nilai dan personal. Dengan menggunakan metode Support Vector Machine (SVM) dengan kernel RBF. Jumlah dataset sebanyak 1040 data keluhan. Pengujian dilakukan menggunakan aplikasi RapidMiner dan diuji dengan menggunakan 10-Fold cross validation dan diukur dengan confussion matrix. Dari hasil uji coba aplikasi menunjukkan akurasi tertinggi sebesar 95.67% pengujian tanpa menggunakan feature selection pada titik C=2 dan .Kata Kunci : confussion matrix, cross validation, iraise, keluhan, klasifikasi, rapidminer, support vector machine.

Download Full-text

Combination of Support Vector Machine and K-Fold cross-validation for prediction of long-term degradation of the compressive strength of marine concrete

International Journal of Computational Physics Series ◽

10.29167/a1i1p120-130 ◽

2018 ◽

Vol 1 (1) ◽

pp. 120-130 ◽

Cited By ~ 1

Author(s):

Chunxiang Qian ◽

Wence Kang ◽

Hao Ling ◽

Hua Dong ◽

Chengyao Liang ◽

...

Keyword(s):

Support Vector Machine ◽

Environmental Factors ◽

Cross Validation ◽

Concrete Strength ◽

Simulation Method ◽

Support Vector ◽

Svm Model ◽

Artificial Neural Network Ann ◽

Influence Degree ◽

Fold Cross Validation

Support Vector Machine (SVM) model optimized by K-Fold cross-validation was built to predict and evaluate the degradation of concrete strength in a complicated marine environment. Meanwhile, several mathematical models, such as Artificial Neural Network (ANN) and Decision Tree (DT), were also built and compared with SVM to determine which one could make the most accurate predictions. The material factors and environmental factors that influence the results were considered. The materials factors mainly involved the original concrete strength, the amount of cement replaced by fly ash and slag. The environmental factors consisted of the concentration of Mg2+, SO42-, Cl-, temperature and exposing time. It was concluded from the prediction results that the optimized SVM model appeared to perform better than other models in predicting the concrete strength. Based on SVM model, a simulation method of variables limitation was used to determine the sensitivity of various factors and the influence degree of these factors on the degradation of concrete strength.

Download Full-text

Pengenalan Wajah Manusia berbasis Algoritma Local Binary Pattern

Emitor: Jurnal Teknik Elektro ◽

10.23917/emitor.v17i2.6232 ◽

2017 ◽

Vol 17 (2) ◽

pp. 29-38

Author(s):

Ratih Purwati ◽

Gunawan Ariyanto

Keyword(s):

Computer Vision ◽

Support Vector Machine ◽

Face Recognition ◽

Local Binary Pattern ◽

Cross Validation ◽

Support Vector ◽

Fold Cross Validation

Face Recognition merupakan teknologi komputer untuk mengidentifikasi wajah manusia melalui gambar digital yang tersimpan di database. Wajah manusia dapat berubah bentuk sesuai dengan ekspresi yang dimilikinya. Wajah manusia dapat berubah bentuk sesuai dengan eskpresi yang dimilikinya. Ekspresi wajah manusia memiliki kemiripan satu sama lain sehingga untuk mengenali suatu ekspresi adalah kepunyaan siapa akan sedikit sulit. Pengenalan wajah terus menjadi topik aktif di zaman sekarang pada penelitian bidang computer vision. Penggunaan wajah manusia sering kita jumpai pada fitur-fitur aplikasi media sosial seperti Snapchat, Snapgram dari Instagram dan banyak aplikasi sosial media lainnya yang menggunakan teknologi tersebut. Pada penelitian ini dilakukan analisa pengenalan ekpresi wajah manusia dengan pendekatan fitur alogaritma Local Binary Pattern dan mencari pengembangan alogaritma dasar Local Binary Pattern yang paling optimal dengan cara menggabungkan metode Hisogram Equalization, Support Vector Machine, dan K-fold cross validation sehingga dapat meningkatkan pengenalan gambar wajah manusia pada hasil yang terbaik. Penelitian ini menginput beberapa database wajah manusia seperti JAFFE yang merupakan gambar wajah manusia wanita jepang yang berjumlah 10 orang dengan 7 ekspresi emosional seperti marah, sedih, bahagia, jijik, kaget, takut dan netral ke dalam sistem. YALE yaitu merupakan gambar wajah manusia orang Amerika. Serta menggunakan dataset CALTECH yang merupakan gambar manusia yang terdiri dari 450 gambar dengan ukuran 896 x 592 piksel dan disimpan dalam format JPEG. Kemudian data tersebut di sesuaikan dengan bentuk tekstur wajah masing-masing. Dari hasil penggabungan ketiga metode diatas dan percobaan-percobaan yang sudah dilakukan, didapatkan hasil yang paling optimal dalam pengenalan wajah manusia yaitu menggunakan dataset JAFFE dengan resolusi 92 x 112 piksel dan dengan tingkat penggunaan processor yang tinggi dapat mempengaruhi waktu kecepatan komputasi dalam proses menjalankan sistem sehingga menghasilkan prediksi yang lebih tepat.

Download Full-text

The Animal Classification: An Evaluation of Different Transfer Learning Pipeline

Mekatronika ◽

10.15282/mekatronika.v3i1.6680 ◽

2021 ◽

Vol 3 (1) ◽

pp. 27-31

Author(s):

Ken-ji Ee ◽

Ahmad Fakhri Bin Ab. Nasir ◽

Anwar P. P. Abdul Majeed ◽

Mohd Azraai Mohd Razman ◽

Nur Hafieza Ismail

Keyword(s):

Transfer Learning ◽

Classification System ◽

Cross Validation ◽

Support Vector ◽

Svm Classifier ◽

Average Classification Accuracy ◽

Validation Technique ◽

Search Approach ◽

Fold Cross Validation

The animal classification system is a technology to classify the animal class (type) automatically and useful in many applications. There are many types of learning models applied to this technology recently. Nonetheless, it is worth noting that the extraction of the features and the classification of the animal features is non-trivial, particularly in the deep learning approach for a successful animal classification system. The use of Transfer Learning (TL) has been demonstrated to be a powerful tool in the extraction of essential features. However, the employment of such a method towards animal classification applications are somewhat limited. The present study aims to determine a suitable TL-conventional classifier pipeline for animal classification. The VGG16 and VGG19 were used in extracting features and then coupled with either k-Nearest Neighbour (k-NN) or Support Vector Machine (SVM) classifier. Prior to that, a total of 4000 images were gathered consisting of a total of five classes which are cows, goats, buffalos, dogs, and cats. The data was split into the ratio of 80:20 for train and test. The classifiers hyper parameters are tuned by the Grids Search approach that utilises the five-fold cross-validation technique. It was demonstrated from the study that the best TL pipeline identified is the VGG16 along with an optimised SVM, as it was able to yield an average classification accuracy of 0.975. The findings of the present investigation could facilitate animal classification application, i.e. for monitoring animals in wildlife.

Download Full-text

Feature Selection Using Random Forest Algorithm to Diagnose Tuberculosis From Lung CT Images

AI Innovation in Medical Imaging Diagnostics - Advances in Medical Technologies and Clinical Practice ◽

10.4018/978-1-7998-3092-4.ch005 ◽

2021 ◽

pp. 92-100

Author(s):

Beaulah Jeyavathana Rajendran ◽

Kanimozhi K. V.

Keyword(s):

Feature Selection ◽

Random Forest ◽

The Body ◽

Support Vector ◽

Feature Descriptor ◽

Feature Sets ◽

Modified Particle Swarm Optimization ◽

Tuberculosis Disease ◽

Optimal Feature ◽

Lung Ct

Tuberculosis is one of the hazardous infectious diseases that can be categorized by the evolution of tubercles in the tissues. This disease mainly affects the lungs and also the other parts of the body. The disease can be easily diagnosed by the radiologists. The main objective of this chapter is to get best solution selected by means of modified particle swarm optimization is regarded as optimal feature descriptor. Five stages are being used to detect tuberculosis disease. They are pre-processing an image, segmenting the lungs and extracting the feature, feature selection and classification. These stages that are used in medical image processing to identify the tuberculosis. In the feature extraction, the GLCM approach is used to extract the features and from the extracted feature sets the optimal features are selected by random forest. Finally, support vector machine classifier method is used for image classification. The experimentation is done, and intermediate results are obtained. The proposed system accuracy results are better than the existing method in classification.

Download Full-text

PredAmyl-MLP: Prediction of Amyloid Proteins Using Multilayer Perceptron

Computational and Mathematical Methods in Medicine ◽

10.1155/2020/8845133 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Yanjuan Li ◽

Zitong Zhang ◽

Zhixia Teng ◽

Xiaoyan Liu

Keyword(s):

Feature Extraction ◽

Feature Selection ◽

Prediction Model ◽

Multilayer Perceptron ◽

Type Ii Diabetes ◽

Cross Validation ◽

Experimental Results ◽

Type Ii ◽

Fold Cross Validation ◽

Better Than

Amyloid is generally an aggregate of insoluble fibrin; its abnormal deposition is the pathogenic mechanism of various diseases, such as Alzheimer’s disease and type II diabetes. Therefore, accurately identifying amyloid is necessary to understand its role in pathology. We proposed a machine learning-based prediction model called PredAmyl-MLP, which consists of the following three steps: feature extraction, feature selection, and classification. In the step of feature extraction, seven feature extraction algorithms and different combinations of them are investigated, and the combination of SVMProt-188D and tripeptide composition (TPC) is selected according to the experimental results. In the step of feature selection, maximum relevant maximum distance (MRMD) and binomial distribution (BD) are, respectively, used to remove the redundant or noise features, and the appropriate features are selected according to the experimental results. In the step of classification, we employed multilayer perceptron (MLP) to train the prediction model. The 10-fold cross-validation results show that the overall accuracy of PredAmyl-MLP reached 91.59%, and the performance was better than the existing methods.

Download Full-text