scholarly journals The Hierarchical Classifier for COVID-19 Resistance Evaluation

Data ◽  
2021 ◽  
Vol 6 (1) ◽  
pp. 6
Author(s):  
Nataliya Shakhovska ◽  
Ivan Izonin ◽  
Nataliia Melnykova

Finding dependencies in the data requires the analysis of relations between dozens of parameters of the studied process and hundreds of possible sources of influence on this process. Dependencies are nondeterministic and therefore modeling requires the use of statistical methods for analyzing random processes. Part of the information is often hidden from observation or not monitored. That is why many difficulties have arisen in the process of analyzing the collected information. The paper aims to find frequent patterns and parameters affected by COVID-19. The novelty of the paper is hierarchical architecture comprises supervised and unsupervised methods. It allows the development of an ensemble of the methods based on k-means clustering and classification. The best classifiers from the ensemble are random forest with 500 trees and XGBoost. Classification for separated clusters gives us higher accuracy on 4% in comparison with dataset analysis. The proposed approach can be used also for personalized medicine decision support in other domains. The features selection allows us to analyze the following features with the highest impact on COVID-19: age, sex, blood group, had influenza.

2021 ◽  
Vol 2078 (1) ◽  
pp. 012027
Author(s):  
Ze yuan Liu ◽  
Xin long Li

Abstract The remarkable advances in ensemble machine learning methods have led to a significant analysis in large data, such as random forest algorithms. However, the algorithms only use the current features during the process of learning, which caused the initial upper accuracy’s limit no matter how well the algorithms are. Moreover, the low classification accuracy happened especially when one type of observation’s proportion is much lower than the other types in training datasets. The aim of the present study is to design a hierarchical classifier which try to extract new features by ensemble machine learning regressors and statistical methods inside the whole machine learning process. In stage 1, all the categorical variables will be characterized by random forest algorithm to create a new variable through regression analysis while the numerical variables left will serve as the sample of factor analysis (FA) process to calculate the factors value of each observation. Then, all the features will be learned by random forest classifier in stage 2. Diversified datasets consist of categorical and numerical variables will be used in the method. The experiment results show that the classification accuracy increased by 8.61%. Meanwhile, it also improves the classification accuracy of observations with low proportion in the training dataset significantly.


2020 ◽  
Vol 22 (Supplement_2) ◽  
pp. ii135-ii136
Author(s):  
John Lin ◽  
Michelle Mai ◽  
Saba Paracha

Abstract Glioblastoma multiforme (GBM), the most common form of glioma, is a malignant tumor with a high risk of mortality. By providing accurate survival estimates, prognostic models have been identified as promising tools in clinical decision support. In this study, we produced and validated two machine learning-based models to predict survival time for GBM patients. Publicly available clinical and genomic data from The Cancer Genome Atlas (TCGA) and Broad Institute GDAC Firehouse were obtained through cBioPortal. Random forest and multivariate regression models were created to predict survival. Predictive accuracy was assessed and compared through mean absolute error (MAE) and root mean square error (RMSE) calculations. 619 GBM patients were included in the dataset. There were 381 (62.9%) cases of recurrence/progression and 53 (8.7%) cases of disease-free survival. The MAE and RMSE values were 0.553 and 0.887 years respectively for the random forest regression model, and they were 1.756 and 2.451 years respectively for the multivariate regression model. Both models accurately predicted overall survival. Comparison of models through MAE, RMSE, and visual analysis produced higher accuracy values for random forest than multivariate linear regression. Further investigation on feature selection and model optimization may improve predictive power. These findings suggest that using machine learning in GBM prognostic modeling will improve clinical decision support. *Co-first authors.


2019 ◽  
Vol 40 (Supplement_1) ◽  
Author(s):  
E Watanabe ◽  
T Yamashita ◽  
H Inoue ◽  
H Atarashi ◽  
K Okumura ◽  
...  

Abstract Background Atrial fibrillation (AF) is associated with increased mortality and morbidity. Modelling the risk of thrombosis, major bleeding and total mortality are often limited by the inadequate number of independent predictors. Purpose We compared the predictive accuracy of the decision-support tool framework and conventional risk score in AF patients. Methods We used data of AF patients enrolled into the nationwide AF registry. A random forest model was implemented to predict each outcome, and its predictive power was tested by a 5-fold cross-validation. Results We analyzed 7,937 patients with AF (age 70±10 years, female 31%). The type of AF was paroxysmal (37%), persistent (14%), and permanent (49%). The number of antithrombotic treatments were follows: warfarin only (n=5461), antiplatelet only (n=581), both warfarin and antiplatelet (n=1471) and no antithrombotic agents (424). The mean CHA2DS2-VASc score was 2.8±1.6 and HAS-BLED score was 2.7±1.2, respectively. We selected 20 from 50 clinical parameters and compared by the area-under-curve with the CHA2DS2-VASc score for thromboses and the HAS-BLED score for major bleeding. During the 2-year follow-up, 126 patients (1.6%) had thromboses, 140 (1.8%) had major bleeding, and 195 (2.5%) died. A random forest model had a higher value of the area-under-curve for predicting thromboses compared with the CHA2DS2-VASc (0.66 vs. 0.61, P<0.05), and had a significantly higher area-under-curve for major bleeding compared with the HAS-BLED (0.67 vs. 0.61, P<0.05). The area-under-curve for the all-cause mortality was 0.77. Conclusions A random forest model has a higher accuracy than conventional risk scheme in predicting thromboses and major bleeding, in addition to total mortality.


Author(s):  
Nabilah Alias ◽  
Cik Feresa Mohd Foozy ◽  
Sofia Najwa Ramli ◽  
Naqliyah Zainuddin

<p>Nowadays, social media (e.g., YouTube and Facebook) provides connection and interaction between people by posting comments or videos. In fact, comments are a part of contents in a website that can attract spammer to spreading phishing, malware or advertising. Due to existing malicious users that can spread malware or phishing in the comments, this work proposes a technique used for video sharing spam comments feature detection. The first phase of the methodology used in this work is dataset collection. For this experiment, a dataset from UCI Machine Learning repository is used. In the next phase, the development of framework and experimentation. The dataset will be pre-processed using tokenization and lemmatization process. After that, the features to detect spam is selected and the experiments for classification were performed by using six classifiers which are Random Tree, Random Forest, Naïve Bayes, KStar, Decision Table, and Decision Stump. The result shows the highest accuracy is 90.57% and the lowest was 58.86%.</p>


2021 ◽  
Vol 348 ◽  
pp. 01002
Author(s):  
Assia Najm ◽  
Abdelali Zakrani ◽  
Abdelaziz Marzak

The software cost prediction is a crucial element for a project’s success because it helps the project managers to efficiently estimate the needed effort for any project. There exist in literature many machine learning methods like decision trees, artificial neural networks (ANN), and support vector regressors (SVR), etc. However, many studies confirm that accurate estimations greatly depend on hyperparameters optimization, and on the proper input feature selection that impacts highly the accuracy of software cost prediction models (SCPM). In this paper, we propose an enhanced model using SVR and the Optainet algorithm. The Optainet is used at the same time for 1-selecting the best set of features and 2-for tuning the parameters of the SVR model. The experimental evaluation was conducted using a 30% holdout over seven datasets. The performance of the suggested model is then compared to the tuned SVR model using Optainet without feature selection. The results were also compared to the Boruta and random forest features selection methods. The experiments show that for overall datasets, the Optainet-based method improves significantly the accuracy of the SVR model and it outperforms the random forest and Boruta feature selection methods.


Sign in / Sign up

Export Citation Format

Share Document