scholarly journals Regulatory genes identification within functional genomics experiments for tissue classification into binary classes via machine learning techniques

Author(s):  
Bushra Wazir ◽  
Dost Muhammad Khan ◽  
Umair Khalil ◽  
Muhammad Hamraz ◽  
Naz Gul ◽  
...  

Abstract Objectives: The aim of this study is to filter out the most informative genes that mainly regulate the target tissue class, increase classification accuracy, reduce the curse of dimensionality,  and discard redundant and irrelevant genes. Methods: This paper presented the idea of gene selection using bagging sub-forest (BSF). The proposed method provided genes importance grounded on the idea specified in the standard random forest algorithm. The new method is compared with three state-of-the art methods, i.e., Wilcoxon, masked painter and proportional overlapped score (POS). These methods were applied on 5 data sets, i.e. Colon, Lymph node breast cancer, Leukemia, Serrated colorectal carcinomas, and Breast Cancer. Comparison was done by selecting top 20 genes by applying the gene selection methods and applying random forest (RF) and support vector machine (SVM) classifiers to assess their predictive performance on the datasets with selected genes. Classification accuracy, Brier score, and sensitivity have been used as performance measures. Results: The proposed method gave better results than the other methods using both random forest and SVM classifiers on all the datasets among all the feature selection methods. Conclusion: The proposed method showed improved performance in terms of classification accuracy, Brier score and sensitivity, and hence, could be used as a novel method for gene selection to classify tissue samples into their correct classes. Key Words: Gene selection, classification, random forest, cancer, microarray gene expression.

2018 ◽  
Vol 7 (3.27) ◽  
pp. 62
Author(s):  
J Briso Becky Bell ◽  
S Maria Celestin Vigila

In the latest field of gene expression profiling, the identification of most highly expressed genes with respect to diseases is been in focus lately, As to study the disease types and classify normal from disease syndrome samples. This paper portrays four gene selection approaches such as Pearson correlation, Signal to Noise Correlation, Feature Assessment by Sliding threshold and Feature Assessment by Information Retrieval for retrieving highly relevant genes oriented to a specific disease. This experiment uses various disease dataset for operating on the typical gene selection methods and to select top ten most relevant genes and thus selected genes are learned on using classifiers such as Support Vector Machine, K-Nearest Neighbour and Naïve Bayes to classify the specific disease oriented classes distinctively. Here we also compare the performance of our classifier with the previous papers techniques using classification Accuracy.  


2021 ◽  
Vol 7 ◽  
pp. e562
Author(s):  
Muhammad Hamraz ◽  
Naz Gul ◽  
Mushtaq Raza ◽  
Dost Muhammad Khan ◽  
Umair Khalil ◽  
...  

In this paper, a novel feature selection method called Robust Proportional Overlapping Score (RPOS), for microarray gene expression datasets has been proposed, by utilizing the robust measure of dispersion, i.e., Median Absolute Deviation (MAD). This method robustly identifies the most discriminative genes by considering the overlapping scores of the gene expression values for binary class problems. Genes with a high degree of overlap between classes are discarded and the ones that discriminate between the classes are selected. The results of the proposed method are compared with five state-of-the-art gene selection methods based on classification error, Brier score, and sensitivity, by considering eleven gene expression datasets. Classification of observations for different sets of selected genes by the proposed method is carried out by three different classifiers, i.e., random forest, k-nearest neighbors (k-NN), and support vector machine (SVM). Box-plots and stability scores of the results are also shown in this paper. The results reveal that in most of the cases the proposed method outperforms the other methods.


Author(s):  
B. Venkatesh ◽  
J. Anuradha

In Microarray Data, it is complicated to achieve more classification accuracy due to the presence of high dimensions, irrelevant and noisy data. And also It had more gene expression data and fewer samples. To increase the classification accuracy and the processing speed of the model, an optimal number of features need to extract, this can be achieved by applying the feature selection method. In this paper, we propose a hybrid ensemble feature selection method. The proposed method has two phases, filter and wrapper phase in filter phase ensemble technique is used for aggregating the feature ranks of the Relief, minimum redundancy Maximum Relevance (mRMR), and Feature Correlation (FC) filter feature selection methods. This paper uses the Fuzzy Gaussian membership function ordering for aggregating the ranks. In wrapper phase, Improved Binary Particle Swarm Optimization (IBPSO) is used for selecting the optimal features, and the RBF Kernel-based Support Vector Machine (SVM) classifier is used as an evaluator. The performance of the proposed model are compared with state of art feature selection methods using five benchmark datasets. For evaluation various performance metrics such as Accuracy, Recall, Precision, and F1-Score are used. Furthermore, the experimental results show that the performance of the proposed method outperforms the other feature selection methods.


Sensors ◽  
2020 ◽  
Vol 21 (1) ◽  
pp. 194
Author(s):  
Sarah Gonzalez ◽  
Paul Stegall ◽  
Harvey Edwards ◽  
Leia Stirling ◽  
Ho Chit Siu

The field of human activity recognition (HAR) often utilizes wearable sensors and machine learning techniques in order to identify the actions of the subject. This paper considers the activity recognition of walking and running while using a support vector machine (SVM) that was trained on principal components derived from wearable sensor data. An ablation analysis is performed in order to select the subset of sensors that yield the highest classification accuracy. The paper also compares principal components across trials to inform the similarity of the trials. Five subjects were instructed to perform standing, walking, running, and sprinting on a self-paced treadmill, and the data were recorded while using surface electromyography sensors (sEMGs), inertial measurement units (IMUs), and force plates. When all of the sensors were included, the SVM had over 90% classification accuracy using only the first three principal components of the data with the classes of stand, walk, and run/sprint (combined run and sprint class). It was found that sensors that were placed only on the lower leg produce higher accuracies than sensors placed on the upper leg. There was a small decrease in accuracy when the force plates are ablated, but the difference may not be operationally relevant. Using only accelerometers without sEMGs was shown to decrease the accuracy of the SVM.


2020 ◽  
Vol 14 ◽  

Breast Cancer (BC) is amongst the most common and leading causes of deaths in women throughout the world. Recently, classification and data analysis tools are being widely used in the medical field for diagnosis, prognosis and decision making to help lower down the risks of people dying or suffering from diseases. Advanced machine learning methods have proven to give hope for patients as this has helped the doctors in early detection of diseases like Breast Cancer that can be fatal, in support with providing accurate outcomes. However, the results highly depend on the techniques used for feature selection and classification which will produce a strong machine learning model. In this paper, a performance comparison is conducted using four classifiers which are Multilayer Perceptron (MLP), Support Vector Machine (SVM), K-Nearest Neighbors (KNN) and Random Forest on the Wisconsin Breast Cancer dataset to spot the most effective predictors. The main goal is to apply best machine learning classification methods to predict the Breast Cancer as benign or malignant using terms such as accuracy, f-measure, precision and recall. Experimental results show that Random forest is proven to achieve the highest accuracy of 99.26% on this dataset and features, while SVM and KNN show 97.78% and 97.04% accuracy respectively. MLP shows the least accuracy of 94.07%. All the experiments are conducted using RStudio as the data mining tool platform.


2021 ◽  
Vol 12 ◽  
Author(s):  
Yuan Zhao ◽  
Zhao-Yu Fang ◽  
Cui-Xiang Lin ◽  
Chao Deng ◽  
Yun-Pei Xu ◽  
...  

In recent years, the application of single cell RNA-seq (scRNA-seq) has become more and more popular in fields such as biology and medical research. Analyzing scRNA-seq data can discover complex cell populations and infer single-cell trajectories in cell development. Clustering is one of the most important methods to analyze scRNA-seq data. In this paper, we focus on improving scRNA-seq clustering through gene selection, which also reduces the dimensionality of scRNA-seq data. Studies have shown that gene selection for scRNA-seq data can improve clustering accuracy. Therefore, it is important to select genes with cell type specificity. Gene selection not only helps to reduce the dimensionality of scRNA-seq data, but also can improve cell type identification in combination with clustering methods. Here, we proposed RFCell, a supervised gene selection method, which is based on permutation and random forest classification. We first use RFCell and three existing gene selection methods to select gene sets on 10 scRNA-seq data sets. Then, three classical clustering algorithms are used to cluster the cells obtained by these gene selection methods. We found that the gene selection performance of RFCell was better than other gene selection methods.


Author(s):  
Anirudh Reddy Cingireddy ◽  
Robin Ghosh ◽  
Supratik Kar ◽  
Venkata Melapu ◽  
Sravanthi Joginipeli ◽  
...  

Frequent testing of the entire population would help to identify individuals with active COVID-19 and allow us to identify concealed carriers. Molecular tests, antigen tests, and antibody tests are being widely used to confirm COVID-19 in the population. Molecular tests such as the real-time reverse transcription-polymerase chain reaction (rRT-PCR) test will take a minimum of 3 hours to a maximum of 4 days for the results. The authors suggest using machine learning and data mining tools to filter large populations at a preliminary level to overcome this issue. The ML tools could reduce the testing population size by 20 to 30%. In this study, they have used a subset of features from full blood profile which are drawn from patients at Israelita Albert Einstein hospital located in Brazil. They used classification models, namely KNN, logistic regression, XGBooting, naive Bayes, decision tree, random forest, support vector machine, and multilayer perceptron with k-fold cross-validation, to validate the models. Naïve bayes, KNN, and random forest stand out as the most predictive ones with 88% accuracy each.


2020 ◽  
Vol 9 (2) ◽  
pp. 25-44
Author(s):  
Usha N. ◽  
Sriraam N. ◽  
Kavya N. ◽  
Bharathi Hiremath ◽  
Anupama K Pujar ◽  
...  

Breast cancer is one among the most common cancers in women. The early detection of breast cancer reduces the risk of death. Mammograms are an efficient breast imaging technique for breast cancer screening. Computer aided diagnosis (CAD) systems reduce manual errors and helps radiologists to analyze the mammogram images. The mammogram images are typically in two views, cranial-caudal (CC) and medio lateral oblique (MLO) views. MLO contains pectoral muscles (chest muscles) at the upper right or left corner of the image. In this study, it was removed by using a semi-automated method. All the normal and abnormal images were filtered and enhanced to improve the quality. GLCM (Gray Level Co-occurrence Matrix) texture features were extracted and analyzed by changing the number of features in a feature set. Linear Support Vector Machine (LSVM) was used as classifier. The classification accuracy was improved as the number of features in GLCM feature set increases. Simulation results show an overall classification accuracy of 96.7% with 19 GLCM features using SVM classifiers.


2019 ◽  
Vol 21 (3) ◽  
pp. 80-92
Author(s):  
Madhuri Gupta ◽  
Bharat Gupta

Cancer is a disease in which cells in body grow and divide beyond the control. Breast cancer is the second most common disease after lung cancer in women. Incredible advances in health sciences and biotechnology have prompted a huge amount of gene expression and clinical data. Machine learning techniques are improving the prior detection of breast cancer from this data. The research work carried out focuses on the application of machine learning methods, data analytic techniques, tools, and frameworks in the field of breast cancer research with respect to cancer survivability, cancer recurrence, cancer prediction and detection. Some of the widely used machine learning techniques used for detection of breast cancer are support vector machine and artificial neural network. Apache Spark data processing engine is found to be compatible with most of the machine learning frameworks.


RSC Advances ◽  
2014 ◽  
Vol 4 (106) ◽  
pp. 61624-61630 ◽  
Author(s):  
N. S. Hari Narayana Moorthy ◽  
Silvia A. Martins ◽  
Sergio F. Sousa ◽  
Maria J. Ramos ◽  
Pedro A. Fernandes

Classification models to predict the solvation free energies of organic molecules were developed using decision tree, random forest and support vector machine approaches and with MACCS fingerprints, MOE and PaDEL descriptors.


Sign in / Sign up

Export Citation Format

Share Document