scholarly journals SELECTING KDD FEATURES AND USING RANDOM CLASSIFICATION TREE FOR PREDICTING ATTACKS

2014 ◽  
pp. 115-123
Author(s):  
Rachid Beghdad

The purpose of this study is to identify some higher-level KDD features, and to train the resulting set with an appropriate machine learning technique, in order to classify and predict attacks. To achieve that, a two-steps approach is proposed. Firstly, the Fisher’s ANOVA technique was used to deduce the important features. Secondly, 4 types of classification trees: ID3, C4.5, classification and regression tree (CART), and random tree (RnDT), were tested to classify and detect attacks. According to our tests, the RndT leads to the better results. That is why we will present here the classification and prediction results of this technique in details. Some of the remaining results will be used later to make comparisons. We used the KDD’99 data sets to evaluate the considered algorithms. For these evaluations, only the four attack categories’ case was considered. Our simulations show the efficiency of our approach, and show also that it is very competitive with some similar previous works.

10.2196/18910 ◽  
2020 ◽  
Vol 8 (7) ◽  
pp. e18910
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

Background The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. Objective This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. Methods A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. Results A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. Conclusions The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


Coatings ◽  
2021 ◽  
Vol 11 (12) ◽  
pp. 1529
Author(s):  
Raj Kumar Arya ◽  
Jyoti Sharma ◽  
Rahul Shrivastava ◽  
Devyani Thapliyal ◽  
George D. Verros

In this work, a machine learning technique based on a regression tree model was used to model the surfactant enhanced drying of poly(styrene)-p-xylene coatings. The predictions of the developed model based on regression trees are in excellent agreement with the experimental data. A total of 16,258 samples were obtained through experimentation. These samples were separated into two parts: 12,960 samples were used for the training of the regression tree, and the remaining 3298 samples were used to test the tree’s prediction accuracy. MATLAB software was used to grow the regression tree. The mean squared error between the model-predicted values and actual outputs was calculated to be 8.8415 × 10−6. This model has good generalizing ability; predicts weight loss for given values of time, thickness, and triphenyl phosphate; and has a maximum error of 1%. It is robust and for this system, can be used for any composition and thickness for this system, which will drastically reduce the need for further experimentations to explain diffusion and drying.


2021 ◽  
Vol 10 (6) ◽  
pp. 3794-3801
Author(s):  
Yusuf Aliyu Adamu

Malaria is a life-threatening disease that leads to death globally, its early prediction is necessary for preventing the rapid transmission. In this work, an enhanced ensemble learning approach for predicting malaria outbreaks is suggested. Using a mean-based splitting strategy, the dataset is randomly partitioned into smaller groups. The splits are then modelled using a classification and regression tree, and an accuracy-based weighted aging classifier ensemble is used to construct a homogenous ensemble from the several Classification and Regression Tree models. This approach ensures higher performance is achieved. Seven different Algorithms were tested and one ensemble method is used which combines all the seven classifiers together and finally, the accuracy, precision, and sensitivity achieved for the proposed method is 93%, 92%, and 100% respectively, which outperformed better than machine learning classifiers and ensemble method used in this research. The correlation between the variables used is established and how each factor contributes to the malaria incidence. The result indicates that malaria outbreaks can be predicted successfully using the suggested technique.


2020 ◽  
Author(s):  
Debbie Rankin ◽  
Michaela Black ◽  
Raymond Bond ◽  
Jonathan Wallace ◽  
Maurice Mulvenna ◽  
...  

BACKGROUND The exploitation of synthetic data in health care is at an early stage. Synthetic data could unlock the potential within health care datasets that are too sensitive for release. Several synthetic data generators have been developed to date; however, studies evaluating their efficacy and generalizability are scarce. OBJECTIVE This work sets out to understand the difference in performance of supervised machine learning models trained on synthetic data compared with those trained on real data. METHODS A total of 19 open health datasets were selected for experimental work. Synthetic data were generated using three synthetic data generators that apply classification and regression trees, parametric, and Bayesian network approaches. Real and synthetic data were used (separately) to train five supervised machine learning models: stochastic gradient descent, decision tree, k-nearest neighbors, random forest, and support vector machine. Models were tested only on real data to determine whether a model developed by training on synthetic data can used to accurately classify new, real examples. The impact of statistical disclosure control on model performance was also assessed. RESULTS A total of 92% of models trained on synthetic data have lower accuracy than those trained on real data. Tree-based models trained on synthetic data have deviations in accuracy from models trained on real data of 0.177 (18%) to 0.193 (19%), while other models have lower deviations of 0.058 (6%) to 0.072 (7%). The winning classifier when trained and tested on real data versus models trained on synthetic data and tested on real data is the same in 26% (5/19) of cases for classification and regression tree and parametric synthetic data and in 21% (4/19) of cases for Bayesian network-generated synthetic data. Tree-based models perform best with real data and are the winning classifier in 95% (18/19) of cases. This is not the case for models trained on synthetic data. When tree-based models are not considered, the winning classifier for real and synthetic data is matched in 74% (14/19), 53% (10/19), and 68% (13/19) of cases for classification and regression tree, parametric, and Bayesian network synthetic data, respectively. Statistical disclosure control methods did not have a notable impact on data utility. CONCLUSIONS The results of this study are promising with small decreases in accuracy observed in models trained with synthetic data compared with models trained with real data, where both are tested on real data. Such deviations are expected and manageable. Tree-based classifiers have some sensitivity to synthetic data, and the underlying cause requires further investigation. This study highlights the potential of synthetic data and the need for further evaluation of their robustness. Synthetic data must ensure individual privacy and data utility are preserved in order to instill confidence in health care departments when using such data to inform policy decision-making.


2001 ◽  
Vol 23 (3-4) ◽  
pp. 153-158 ◽  
Author(s):  
Josef Smolle ◽  
Peter Kahofer

Objective: To evaluate the feasibility of the CART (Classification and Regression Tree) procedure for the recognition of microscopic structures in tissue counter analysis. Methods: Digital microscopic images of H&E stained slides of normal human skin and of primary malignant melanoma were overlayed with regularly distributed square measuring masks (elements) and grey value, texture and colour features within each mask were recorded. In the learning set, elements were interactively labeled as representing either connective tissue of the reticular dermis, other tissue components or background. Subsequently, CART models were based on these data sets.Results: Implementation of the CART classification rules into the image analysis program showed that in an independent test set 94.1% of elements classified as connective tissue of the reticular dermis were correctly labeled. Automated measurements of the total amount of tissue and of the amount of connective tissue within a slide showed high reproducibility (r=0.97 andr=0.94, respectively; p < 0.001).Conclusions: CART procedure in tissue counter analysis yields simple and reproducible classification rules for tissue elements.


2021 ◽  
pp. 1-18
Author(s):  
Shashikant Rathod ◽  
Leena Phadke ◽  
Uttam Chaskar ◽  
Chetankumar Patil

BACKGROUND: According to the World Health Organization, one in ten adults will have Type 2 Diabetes Mellitus (T2DM) in the next few years. Autonomic dysfunction is one of the significant complications of T2DM. Autonomic dysfunction is usually assessed by standard Ewing’s test and resting Heart Rate Variability (HRV) indices. OBJECTIVE: Resting HRV has limited use in screening due to its large intra and inter-individual variations. Therefore, a combined approach of resting and orthostatic challenge HRV measurement with a machine learning technique was used in the present study. METHODS: A total of 213 subjects of both genders between 20 to 70 years of age participated in this study from March 2018 to December 2019 at Smt. Kashibai Navale Medical College and General Hospital (SKNMCGH) in Pune, India. The volunteers were categorized according to their glycemic status as control (n= 51 Euglycemic) and T2DM (n= 162). The short-term ECG signal in the resting and after an orthostatic challenge was recorded. The HRV indices were extracted from the ECG signal as per HRV-Taskforce guidelines. RESULTS: We observed a significant difference in time, frequency, and non-linear resting HRV indices between the control and T2DM groups. A blunted autonomic response to an orthostatic challenge quantified by percentage difference was observed in T2DM compared to the control group. HRV patterns during rest and the orthostatic challenge were extracted by various machine learning algorithms. The classification and regression tree (CART) model has shown better performance among all the machine learning algorithms. It has shown an accuracy of 84.04%, the sensitivity of 89.51%, a specificity of 66.67%, with an Area Under Receiver Operating Characteristic Curve (AUC) of 0.78 compared to resting HRV alone with 75.12% accuracy, 86.42% sensitivity, 39.22% specificity, with an AUC of 0.63 for differentiating autonomic dysfunction in non-diabetic control and T2DM. CONCLUSION: It was possible to develop a Classification and Regression Tree (CART) model to detect autonomic dysfunction. The technique of percentage difference between resting and orthostatic challenge HRV indicates the blunted autonomic response. The developed CART model can differentiate the autonomic dysfunction using both resting and orthostatic challenge HRV data compared to only resting HRV data in T2DM. Thus, monitoring HRV parameters using the CART model during rest and after orthostatic challenge may be a better alternative to detect autonomic dysfunction in T2DM as against only resting HRV.


Author(s):  
Pardomuan Robinson Sihombing ◽  
Istiqomatul Fajriyah Yuliati

Penelitian ini akan mengkaji penerapan beberapa metode machine learning dengan memperhatikan kasus imbalanced data dalam pemodelan klasifikasi untuk penentuan risiko kejadian bayi dengan BBLR yang diharapkan dapat menjadi solusi dalam menurunkan kelahiran bayi dengan BBLR di Indonesia. Adapun metode meachine learning yang digunakan adalah Classification and Regression Tree (CART), Naïve Bayes, Random Forest dan Support Vector Machine (SVM). Pemodelan klasifikasi dengan menggunakan teknik resample pada kasus imbalanced data dan set data besar terbukti mampu meningkatkan ketepatan klasifikasi khususnya terhadap kelas minoritas yang dapat diihat dari nilai sensitivity yang tinggi dibandingkan data asli (tanpa treatment). Selanjutnya, dari kelima model klasifikasi yang iuji menunjukkan bahwa model random forest memberikan kinerja terbaik berdasarkan nilai sensitivity, specificity, G-mean dan AUC tertinggi. Variabel terpenting/paling berpengaruh dalam klasifikasi resiko kejadian BBLR adalah jarak dan urutan kelahiran, pemeriksaan kehamilan, dan umur ibu


2019 ◽  
Vol 32 (3) ◽  
pp. 403-416
Author(s):  
Lazar Sladojevic ◽  
Aleksandar Janjic

This paper represents an approach for the estimation and forecast of losses in a distribution power grid from data which are normally collected by the grid operator. The proposed approach utilizes the least squares optimization method in order to calculate the coefficients needed for estimation of losses. Besides optimization, a machine learning technique is introduced for clustering of coefficients into several seasons. The amount of data used in calculations is very large due to the fact that electrical energy injected in distribution grid is measured every fifteen minutes. Therefore, this approach is classified as the big data analysis. The used data sets are available in the Serbian distribution grid operator?s report for the year 2017. Obtained results are fairly accurate and can be used for losses classification as well as future losses estimation.


2021 ◽  
Author(s):  
MONALISHA PATTNAIK ◽  
ARYAN PATTNAIK

The COVID-19 is declared as a public health emergency of global concern by World Health Organisation (WHO) affecting a total of 201 countries across the globe during the period December 2019 to January 2021. As of January 25, 2021, it has caused a pandemic outbreak with more than 99 million confirmed cases and more than 2 million deaths worldwide. The crisp of this paper is to estimate the global risk in terms of CFR of the COVID-19 pandemic for seventy deeply affected countries. An optimal regression tree algorithm under machine learning technique is applied which identified four significant features like diabetes prevalence, total number of deaths in thousands, total number of confirmed cases in thousands, and hospital beds per 1000 out of fifteen input features. This real-time estimation will provide deep insights into the early detection of CFR for the countries under study.


Sign in / Sign up

Export Citation Format

Share Document