A Hybrid Supervised Machine Learning Classifier System for Breast Cancer Prognosis Using Feature Selection and Data Imbalance Handling Approaches

Yogendra Singh Solanki; Prasun Chakrabarti; Michal Jasinski; Zbigniew Leonowicz; Vadim Bolshev; Alexander Vinogradov; Elzbieta Jasinska; Radomir Gono; Mohammad Nami

doi:10.3390/electronics10060699

A Hybrid Supervised Machine Learning Classifier System for Breast Cancer Prognosis Using Feature Selection and Data Imbalance Handling Approaches

Electronics ◽

10.3390/electronics10060699 ◽

2021 ◽

Vol 10 (6) ◽

pp. 699

Author(s):

Yogendra Singh Solanki ◽

Prasun Chakrabarti ◽

Michal Jasinski ◽

Zbigniew Leonowicz ◽

Vadim Bolshev ◽

...

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Decision Tree ◽

Search Algorithm ◽

Cancer Prognosis ◽

Support Vector ◽

Kappa Statistics ◽

Genetic Search ◽

Tree Classifier

Nowadays, breast cancer is the most frequent cancer among women. Early detection is a critical issue that can be effectively achieved by machine learning (ML) techniques. Thus in this article, the methods to improve the accuracy of ML classification models for the prognosis of breast cancer are investigated. Wrapper-based feature selection approach along with nature-inspired algorithms such as Particle Swarm Optimization, Genetic Search, and Greedy Stepwise has been used to identify the important features. On these selected features popular machine learning classifiers Support Vector Machine, J48 (C4.5 Decision Tree Algorithm), Multilayer-Perceptron (a feed-forward ANN) were used in the system. The methodology of the proposed system is structured into five stages which include (1) Data Pre-processing; (2) Data imbalance handling; (3) Feature Selection; (4) Machine Learning Classifiers; (5) classifier’s performance evaluation. The dataset under this research experimentation is referred from the UCI Machine Learning Repository, named Breast Cancer Wisconsin (Diagnostic) Data Set. This article indicated that the J48 decision tree classifier is the appropriate machine learning-based classifier for optimum breast cancer prognosis. Support Vector Machine with Particle Swarm Optimization algorithm for feature selection achieves the accuracy of 98.24%, MCC = 0.961, Sensitivity = 99.11%, Specificity = 96.54%, and Kappa statistics of 0.9606. It is also observed that the J48 Decision Tree classifier with the Genetic Search algorithm for feature selection achieves the accuracy of 98.83%, MCC = 0.974, Sensitivity = 98.95%, Specificity = 98.58%, and Kappa statistics of 0.9735. Furthermore, Multilayer Perceptron ANN classifier with Genetic Search algorithm for feature selection achieves the accuracy of 98.59%, MCC = 0.968, Sensitivity = 98.6%, Specificity = 98.57%, and Kappa statistics of 0.9682.

Download Full-text

Importance of Feature Selection and Data Visualization Towards Prediction of Breast Cancer

Recent Patents on Computer Science ◽

10.2174/2213275912666190101121058 ◽

2019 ◽

Vol 12 (4) ◽

pp. 317-328 ◽

Cited By ~ 2

Author(s):

Rajalakshmi Krishnamurthi ◽

Niyati Aggrawal ◽

Lokendra Sharma ◽

Diva Srivastava ◽

Shivangi Sharma

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Feature Selection ◽

Support Vector ◽

Data Visualisation ◽

Learning Models ◽

Cancer Data ◽

Comparison Results ◽

Tree Classifier ◽

Machine Learning Models

Background: Breast cancer is one of the most common forms of cancers among women and the leading cause of death among them. Countries like United States, England and Canada have reported a high number of breast cancer patients every year and this number is continuously increasing due to detection at later stages. Hence, it is very important to create awareness among women and develop such algorithms which help to detect malignant cancer. Several research studies have been conducted to analyze the breast cancer data. Objective: This paper presents an effective method in predicting breast cancer and its stage and will also analyze the performance of different supervised learning algorithms such as Random Classifier, Chi2 Square test used in order to predict. The paper focuses on the three important aspects such as the feature selection, the corresponding data visualisation and finally making a prediction call on different machine learning models. Methods: The dataset used for this work is breast cancer Wisconsin data taken from UCI library. The dataset has been used to show the different 32 features which are all important and how it can be achieved using data visualisation. Secondly, after the feature selection, different machine learning models have been applied. Conclusion: The machine learning models involved are namely Support Vector Machine (SVM), KNearest Neighbour (KNN), Random Forest, Principal Component Analysis (PCA), Neural Network using Perceptron (NNP). This has been done to check which type of model is better under what conditions. At different stages several charts have been plotted and eliminated based on relative comparison. Results have shown that Random Tree classifier along with Chi2 Square proves to be an efficient one.

Download Full-text

PENERAPAN DECISION TREE C4.5 SEBAGAI SELEKSI FITUR DAN SUPPORT VECTOR MACHINE (SVM) UNTUK DIAGNOSA KANKER PAYUDARA

Jurnal Informatika ◽

10.30873/ji.v19i1.1442 ◽

2019 ◽

Vol 19 (1) ◽

pp. 54-61

Author(s):

Pakarti Riswanto ◽

RZ. Abdul Aziz ◽

Sriyanto -

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Data Mining ◽

Decision Tree ◽

Cancer Cells ◽

Support Vector ◽

Data Set ◽

Advantages And Disadvantages ◽

New Findings ◽

Tree Classifier

In the field of medicine, the use of data mining has a quite important and evolutionary role that can change the perspective of doctors, practitioners and health researchers in the process of detecting breast cancer in a patient. There are 2 classification applications in it, namely the process of diagnosing (diagnosing) cancer cells that distinguishes between tumors (benign cancer) or malignant cancer and prognosis (prognosis) to determine the possibility of reappearance of cancer cells in patients who have been operated on in the future. Data mining aims to describe new findings in the dataset and explain a process that uses statistical, mathematical, artificial intelligence, and machine learning techniques to extract and identify useful information and related knowledge from the database.Classification with data mining can be done using several methods, namely Decision Tree, K-Nearest Neighbor, Naive Bayes, ID3, CART, Linear Discriminant Analysis, etc., which certainly have advantages and disadvantages of each. But in this study, the author focuses on the classification of data mining using the Support Vector Mechine and Deccision Tree algorithms.This study will analyze the Breast Cancer Wisconsin Original data set obtained from the UCI Machine Learning Repository (repository of research data) to classify breast cancer malignancies. This time the author correlates between the Decision Tree classifier algorithm which has good ability to process large databases as a feature selection, then with a proper and relevant SVM Method used in analyzing and diagnosing breast breast cancer patients because it has accurate results for existing problems and several bases . Keywords— Data Mining, diagnosis, Decision Tree, SVM Method

Download Full-text

A Dataset Centric Feature Selection and Stacked Model to Detect Breast Cancer

International Journal of Intelligent Systems and Applications ◽

10.5815/ijisa.2021.04.03 ◽

2021 ◽

Vol 13 (4) ◽

pp. 24-37

Author(s):

Avijit Kumar Chaudhuri ◽

◽

Dilip K. Banerjee ◽

Anirban Das

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Support Vector Machine ◽

Random Forest ◽

Decision Tree ◽

Naive Bayes ◽

Naïve Bayes ◽

World Health ◽

Support Vector ◽

Kappa Statistics

World Health Organisation declared breast cancer (BC) as the most frequent suffering among women and accounted for 15 percent of all cancer deaths. Its accurate prediction is of utmost significance as it not only prevents deaths but also stops mistreatments. The conventional way of diagnosis includes the estimation of the tumor size as a sign of plausible cancer. Machine learning (ML) techniques have shown the effectiveness of predicting disease. However, the ML methods have been method centric rather than being dataset centric. In this paper, the authors introduce a dataset centric approach(DCA) deploying a genetic algorithm (GA) method to identify the features and a learning ensemble classifier algorithm to predict using the right features. Adaboost is such an approach that trains the model assigning weights to individual records rather than experimenting on the splitting of datasets alone and perform hyper-parameter optimization. The authors simulate the results by varying base classifiers i.e, using logistic regression (LR), decision tree (DT), support vector machine (SVM), naive bayes (NB), random forest (RF), and 10-fold crossvalidations with a different split of the dataset as training and testing. The proposed DCA model with RF and 10-fold cross-validations demonstrated its potential with almost 100% performance in the classification results that no research could suggest so far. The DCA satisfies the underlying principles of data mining: the principle of parsimony, the principle of inclusion, the principle of discrimination, and the principle of optimality. This DCA is a democratic and unbiased ensemble approach as it allows all features and methods in the start to compete, but filters out the most reliable chain (of steps and combinations) that give the highest accuracy. With fewer characteristics and splits of 50-50, 66-34, and 10 fold cross-validations, the Stacked model achieves 97 % accuracy. These values and the reduction of features improve upon prior research works. Further, the proposed classifier is compared with some state-of-the-art machine-learning classifiers, namely random forest, naive Bayes, support-vector machine with radial basis function kernel, and decision tree. For testing the classifiers, different performance metrics have been employed – accuracy, detection rate, sensitivity, specificity, receiver operating characteristic, area under the curve, and some statistical tests such as the Wilcoxon signed-rank test and kappa statistics – to check the strength of the proposed DCA classifier. Various splits of training and testing data –namely, 50–50%, 66–34%, 80–20% and 10-fold cross-validation – have been incorporated in this research to test the credibility of the classification models in handling the unbalanced data. Finally, the proposed DCA model demonstrated its potential with almost 100% performance in the classification results. The output results have also been compared with other research on the same dataset where the proposed classifiers were found to be best across all the performance dimensions.

Download Full-text

Performance Evaluation of Intrusion Detection System using Selected Features and Machine Learning Classifiers

Baghdad Science Journal ◽

10.21123/bsj.2021.18.2(suppl.).0884 ◽

2021 ◽

Vol 18 (2(Suppl.)) ◽

pp. 0884

Author(s):

Raja Azlina Raja Mahmood ◽

AmirHossien Abdi ◽

Masnida Hussin

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Intrusion Detection ◽

Decision Tree ◽

Network Traffic ◽

Intrusion Detection System ◽

Detection System ◽

Support Vector ◽

Large Network ◽

Tree Classifier

Some of the main challenges in developing an effective network-based intrusion detection system (IDS) include analyzing large network traffic volumes and realizing the decision boundaries between normal and abnormal behaviors. Deploying feature selection together with efficient classifiers in the detection system can overcome these problems. Feature selection finds the most relevant features, thus reduces the dimensionality and complexity to analyze the network traffic. Moreover, using the most relevant features to build the predictive model, reduces the complexity of the developed model, thus reducing the building classifier model time and consequently improves the detection performance. In this study, two different sets of selected features have been adopted to train four machine-learning based classifiers. The two sets of selected features are based on Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) approach respectively. These evolutionary-based algorithms are known to be effective in solving optimization problems. The classifiers used in this study are Naïve Bayes, k-Nearest Neighbor, Decision Tree and Support Vector Machine that have been trained and tested using the NSL-KDD dataset. The performance of the abovementioned classifiers using different features values was evaluated. The experimental results indicate that the detection accuracy improves by approximately 1.55% when implemented using the PSO-based selected features than that of using GA-based selected features. The Decision Tree classifier that was trained with PSO-based selected features outperformed other classifiers with accuracy, precision, recall, and f-score result of 99.38%, 99.36%, 99.32%, and 99.34% respectively. The results show that using optimal features coupling with a good classifier in a detection system able to reduce the classifier model building time, reduce the computational burden to analyze data, and consequently attain high detection rate.

Download Full-text

A Contemporary Machine Learning Method for Accurate Prediction of Cervical Cancer

SHS Web of Conferences ◽

10.1051/shsconf/202110204004 ◽

2021 ◽

Vol 102 ◽

pp. 04004

Author(s):

Jesse Jeremiah Tanimu ◽

Mohamed Hamada ◽

Mohammed Hassan ◽

Saratu Yusuf Ilu

Keyword(s):

Machine Learning ◽

Cervical Cancer ◽

Feature Selection ◽

Decision Tree ◽

Sensitivity And Specificity ◽

Missing Values ◽

New Technologies ◽

Machine Learning Techniques ◽

Screening Tests ◽

Tree Classifier

With the advent of new technologies in the medical field, huge amounts of cancerous data have been collected and are readily accessible to the medical research community. Over the years, researchers have employed advanced data mining and machine learning techniques to develop better models that can analyze datasets to extract the conceived patterns, ideas, and hidden knowledge. The mined information can be used as a support in decision making for diagnostic processes. These techniques, while being able to predict future outcomes of certain diseases effectively, can discover and identify patterns and relationships between them from complex datasets. In this research, a predictive model for predicting the outcome of patients’ cervical cancer results has been developed, given risk patterns from individual medical records and preliminary screening tests. This work presents a Decision tree (DT) classification algorithm and shows the advantage of feature selection approaches in the prediction of cervical cancer using recursive feature elimination technique for dimensionality reduction for improving the accuracy, sensitivity, and specificity of the model. The dataset employed here suffers from missing values and is highly imbalanced. Therefore, a combination of under and oversampling techniques called SMOTETomek was employed. A comparative analysis of the proposed model has been performed to show the effectiveness of feature selection and class imbalance based on the classifier’s accuracy, sensitivity, and specificity. The DT with the selected features and SMOTETomek has better results with an accuracy of 98%, sensitivity of 100%, and specificity of 97%. Decision Tree classifier is shown to have excellent performance in handling classification assignment when the features are reduced, and the problem of imbalance class is addressed.

Download Full-text

An improved diagnosis technique for breast cancer using LCFS and TreeHiCARe classifier model

Sensor Review ◽

10.1108/sr-09-2017-0200 ◽

2019 ◽

Vol 39 (1) ◽

pp. 107-120 ◽

Cited By ~ 2

Author(s):

Deepika Kishor Nagthane ◽

Archana M. Rajurkar

Keyword(s):

Breast Cancer ◽

Feature Extraction ◽

Feature Selection ◽

Decision Tree ◽

Association Rule ◽

Content Type ◽

Tree Classifier ◽

Mammogram Images ◽

Diagnosis Technique

PurposeOne of the main reasons for increase in mortality rate in woman is breast cancer. Accurate early detection of breast cancer seems to be the only solution for diagnosis. In the field of breast cancer research, many new computer-aided diagnosis systems have been developed to reduce the diagnostic test false positives because of the subtle appearance of breast cancer tissues. The purpose of this study is to develop the diagnosis technique for breast cancer using LCFS and TreeHiCARe classifier model.Design/methodology/approachThe proposed diagnosis methodology initiates with the pre-processing procedure. Subsequently, feature extraction is performed. In feature extraction, the image features which preserve the characteristics of the breast tissues are extracted. Consequently, feature selection is performed by the proposed least-mean-square (LMS)-Cuckoo search feature selection (LCFS) algorithm. The feature selection from the vast range of the features extracted from the images is performed with the help of the optimal cut point provided by the LCS algorithm. Then, the image transaction database table is developed using the keywords of the training images and feature vectors. The transaction resembles the itemset and the association rules are generated from the transaction representation based ona priorialgorithm with high conviction ratio and lift. After association rule generation, the proposed TreeHiCARe classifier model emanates in the diagnosis methodology. In TreeHICARe classifier, a new feature index is developed for the selection of a central feature for the decision tree centered on which the classification of images into normal or abnormal is performed.FindingsThe performance of the proposed method is validated over existing works using accuracy, sensitivity and specificity measures. The experimentation of proposed method on Mammographic Image Analysis Society database resulted in classification of normal and abnormal cancerous mammogram images with an accuracy of 0.8289, sensitivity of 0.9333 and specificity of 0.7273.Originality/valueThis paper proposes a new approach for the breast cancer diagnosis system by using mammogram images. The proposed method uses two new algorithms: LCFS and TreeHiCARe. LCFS is used to select optimal feature split points, and TreeHiCARe is the decision tree classifier model based on association rule agreements.

Download Full-text

A Composite Hybrid Feature Selection Learning-Based Optimization of Genetic Algorithm For Breast Cancer Detection

10.20944/preprints202003.0298.v1 ◽

2020 ◽

Author(s):

Ahmed Abdullah Farid ◽

Gamal Selim ◽

Hatem Khater

Keyword(s):

Breast Cancer ◽

Genetic Algorithm ◽

Feature Selection ◽

Early Stage ◽

Fitness Function ◽

Support Vector ◽

Initial Population ◽

Tree Classifier ◽

Selection Approach ◽

Feature Selection Approach

Breast cancer is a significant health issue across the world. Breast cancer is the most widely-diagnosed cancer in women; early-stage diagnosis of disease and therapies increase patient safety. This paper proposes a synthetic model set of features focused on the optimization of the genetic algorithm (CHFS-BOGA) to forecast breast cancer. This hybrid feature selection approach combines the advantages of three filter feature selection approaches with an optimize Genetic Algorithm (OGA) to select the best features to improve the performance of the classification process and scalability. We propose OGA by improving the initial population generating and genetic operators using the results of filter approaches as some prior information with using the C4.5 decision tree classifier as a fitness function instead of probability and random selection. The authors collected available updated data from Wisconsin UCI machine learning with a total of 569 rows and 32 columns. The dataset evaluated using an explorer set of weka data mining open-source software for the analysis purpose. The results show that the proposed hybrid feature selection approach significantly outperforms the single filter approaches and principal component analysis (PCA) for optimum feature selection. These characteristics are good indicators for the return prediction. The highest accuracy achieved with the proposed system before (CHFS-BOGA) using the support vector machine (SVM) classifiers was 97.3%. The highest accuracy after (CHFS-BOGA-SVM) was 98.25% on split 70.0% train, remainder test, and 100% on the full training set. Moreover, the receiver operating characteristic (ROC) curve was equal to 1.0. The results showed that the proposed (CHFS-BOGA-SVM) system was able to accurately classify the type of breast tumor, whether malignant or benign.

Download Full-text

Diagnosis of breast cancer using machine learning algorithms based on features selected by Genetic Algorithm: Assessed on five datasets

Journal of University of Shanghai for Science and Technology ◽

10.51201/jusst/21/11963 ◽

2021 ◽

Vol 23 (11) ◽

pp. 749-758

Author(s):

Saranya N ◽

◽

Kavi Priya S ◽

Keyword(s):

Breast Cancer ◽

Machine Learning ◽

Genetic Algorithm ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Cancer Prognosis ◽

Support Vector ◽

Breast Cancer Dataset ◽

Human Beings ◽

Original Dataset

Breast Cancer is one of the chronic diseases occurred to human beings throughout the world. Early detection of this disease is the most promising way to improve patients’ chances of survival. The strategy employed in this paper is to select the best features from various breast cancer datasets using a genetic algorithm and machine learning algorithm is applied to predict the outcomes. Two machine learning algorithms such as Support Vector Machines and Decision Tree are used along with Genetic Algorithm. The proposed work is experimented on five datasets such as Wisconsin Breast Cancer-Diagnosis Dataset, Wisconsin Breast Cancer-Original Dataset, Wisconsin Breast Cancer-Prognosis Dataset, ISPY1 Clinical trial Dataset, and Breast Cancer Dataset. The results exploit that SVM-GA achieves higher accuracy of 98.16% than DT-GA of 97.44%.

Download Full-text

Supervised machine learning based liver disease prediction approach with LASSO feature selection

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i6.3242 ◽

2021 ◽

Vol 10 (6) ◽

pp. 3369-3376

Author(s):

Saima Afrin ◽

F. M. Javed Mehedi Shamrat ◽

Tafsirul Islam Nibir ◽

Mst. Fahmida Muntasim ◽

Md. Shakil Moharram ◽

...

Keyword(s):

Machine Learning ◽

Feature Selection ◽

Liver Disease ◽

Decision Tree ◽

Medical Science ◽

Supervised Machine Learning ◽

Gradient Boosting ◽

Support Vector ◽

Machine Learning Classification ◽

Prediction Approach

In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1-score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.

Download Full-text

A Comparative Study to Evaluate the Performance of Classification Algorithms in Mammogram Analysis

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.6.14960 ◽

2018 ◽

Vol 7 (3.6) ◽

pp. 154

Author(s):

S K. Sajan ◽

M Germanus Alex

Keyword(s):

Breast Cancer ◽

Neural Network ◽

Decision Tree ◽

Automated System ◽

Support Vector ◽

Classification Algorithms ◽

Neural Network Classifier ◽

Decision Tree Classifier ◽

Tree Classifier ◽

Mammogram Image

Breast cancer is a major threat humans are facing irrespective of geographical limits. The awareness about breast cancer has increased during the last decade and many preventive measures were in practice to detect the breast cancer before the symptoms were felt. Mammography is a screening methodology currently in practice. In this paper the mammogram image is analyzed using automated system. The automated system is designed to be capable of distinguishing the mammogram image into a normal or malignant. This process involves image enhancement and image segmentation at preprocessing level. Histogram equalization technique is used to transform low contrast region of the mammogram into region with higher contrast and Fuzzy C Means (FCM) algorithm is used to segment the mammogram image into regions suitable for further analysis. After enhancement and segmentation at preprocessing level the classification is done using three classification algorithms like decision tree classifier, Neural Network classifier and Support Vector Machine (SVM). The performance of the classification algorithms is evaluated using the following criteria like speed, flexibility, robustness, scalability, interpretability, Time complexity and also based on accuracy, sensitivity and specificity. The results obtained in classification are compared with other classification algorithms. It is found that the neural network classifier approach produces better results compared to other classifiers.The average accuracy in diagnosis by Neural Network approach classifier is around 91%. Also it is found that the decision tree approach is much flexible and easy to use compared to other approaches.

Download Full-text