Dealing with Missing Values in a Probabilistic Decision Tree during Classification

Author(s):  
Lamis Hawarah ◽  
Ana Simonet ◽  
Michel Simonet
Keyword(s):  
2021 ◽  
Vol 102 ◽  
pp. 04004
Author(s):  
Jesse Jeremiah Tanimu ◽  
Mohamed Hamada ◽  
Mohammed Hassan ◽  
Saratu Yusuf Ilu

With the advent of new technologies in the medical field, huge amounts of cancerous data have been collected and are readily accessible to the medical research community. Over the years, researchers have employed advanced data mining and machine learning techniques to develop better models that can analyze datasets to extract the conceived patterns, ideas, and hidden knowledge. The mined information can be used as a support in decision making for diagnostic processes. These techniques, while being able to predict future outcomes of certain diseases effectively, can discover and identify patterns and relationships between them from complex datasets. In this research, a predictive model for predicting the outcome of patients’ cervical cancer results has been developed, given risk patterns from individual medical records and preliminary screening tests. This work presents a Decision tree (DT) classification algorithm and shows the advantage of feature selection approaches in the prediction of cervical cancer using recursive feature elimination technique for dimensionality reduction for improving the accuracy, sensitivity, and specificity of the model. The dataset employed here suffers from missing values and is highly imbalanced. Therefore, a combination of under and oversampling techniques called SMOTETomek was employed. A comparative analysis of the proposed model has been performed to show the effectiveness of feature selection and class imbalance based on the classifier’s accuracy, sensitivity, and specificity. The DT with the selected features and SMOTETomek has better results with an accuracy of 98%, sensitivity of 100%, and specificity of 97%. Decision Tree classifier is shown to have excellent performance in handling classification assignment when the features are reduced, and the problem of imbalance class is addressed.


2017 ◽  
Vol 7 (3) ◽  
pp. 1473-1477
Author(s):  
J Jayanthi ◽  
◽  
Gurpreet Kaur ◽  
K Suresh Joseph ◽  
◽  
...  

2014 ◽  
Vol 29 (4) ◽  
pp. 372-379 ◽  
Author(s):  
Masahiro Sugimoto ◽  
Masahiro Takada ◽  
Masakazu Toi

Nomograms are a standard computational tool to predict the likelihood of an outcome using multiple available patient features. We have developed a more powerful data mining methodology, to predict axillary lymph node (AxLN) metastasis and response to neoadjuvant chemotherapy (NAC) in primary breast cancer patients. We developed websites to use these tools. The tools calculate the probability of AxLN metastasis (AxLN model) and pathological complete response to NAC (NAC model). As a calculation algorithm, we employed a decision tree–based prediction model known as the alternative decision tree (ADTree), which is an analog development of if-then type decision trees. An ensemble technique was used to combine multiple ADTree predictions, resulting in higher generalization abilities and robustness against missing values. The AxLN model was developed with training datasets (n=148) and test datasets (n=143), and validated using an independent cohort (n=174), yielding an area under the receiver operating characteristic curve (AUC) of 0.768. The NAC model was developed and validated with n=150 and n=173 datasets from a randomized controlled trial, yielding an AUC of 0.787. AxLN and NAC models require users to input up to 17 and 16 variables, respectively. These include pathological features, including human epidermal growth factor receptor 2 (HER2) status and imaging findings. Each input variable has an option of “unknown,” to facilitate prediction for cases with missing values. The websites developed facilitate the use of these tools, and serve as a database for accumulating new datasets.


Sign in / Sign up

Export Citation Format

Share Document