Innovative Platform for Designing Hybrid Collaborative & Context-Aware Data Mining Scenarios

Anca Avram; Oliviu Matei; Camelia Pintea; Carmen Anton

doi:10.3390/math8050684

Innovative Platform for Designing Hybrid Collaborative & Context-Aware Data Mining Scenarios

Mathematics ◽

10.3390/math8050684 ◽

2020 ◽

Vol 8 (5) ◽

pp. 684 ◽

Cited By ~ 1

Author(s):

Anca Avram ◽

Oliviu Matei ◽

Camelia Pintea ◽

Carmen Anton

Keyword(s):

Data Mining ◽

Prediction Models ◽

Real Life ◽

Machine Learning Algorithms ◽

Complex Data ◽

Context Aware ◽

Nearest Neighbour ◽

Efficient Tool ◽

Collaborative Context ◽

Collaborative Data Mining

The process of knowledge discovery involves nowadays a major number of techniques. Context-Aware Data Mining (CADM) and Collaborative Data Mining (CDM) are some of the recent ones. the current research proposes a new hybrid and efficient tool to design prediction models called Scenarios Platform-Collaborative & Context-Aware Data Mining (SP-CCADM). Both CADM and CDM approaches are included in the new platform in a flexible manner; SP-CCADM allows the setting and testing of multiple configurable scenarios related to data mining at once. The introduced platform was successfully tested and validated on real life scenarios, providing better results than each standalone technique—CADM and CDM. Nevertheless, SP-CCADM was validated with various machine learning algorithms—k-Nearest Neighbour (k-NN), Deep Learning (DL), Gradient Boosted Trees (GBT) and Decision Trees (DT). SP-CCADM makes a step forward when confronting complex data, properly approaching data contexts and collaboration between data. Numerical experiments and statistics illustrate in detail the potential of the proposed platform.

Download Full-text

Enhancing Generalizability of Predictive Models with Synergy of Data and Physics

Measurement Science and Technology ◽

10.1088/1361-6501/ac3944 ◽

2021 ◽

Author(s):

Yingjun Shen ◽

Zhe Song ◽

Andrew Kusiak

Keyword(s):

Machine Learning ◽

Data Mining ◽

Predictive Models ◽

Prediction Models ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Predictive Maintenance ◽

Divide And Conquer ◽

Less Is More ◽

Industrial Big Data

Abstract Wind farm needs prediction models for predictive maintenance. There is a need to predict values of non-observable parameters beyond ranges reflected in available data. A prediction model developed for one machine many not perform well in another similar machine. This is usually due to lack of generalizability of data-driven models. To increase generalizability of predictive models, this research integrates the data mining with first-principle knowledge. Physics-based principles are combined with machine learning algorithms through feature engineering, strong rules and divide-and-conquer. The proposed synergy concept is illustrated with the wind turbine blade icing prediction and achieves significant prediction accuracy across different turbines. The proposed process is widely accepted by wind energy predictive maintenance practitioners because of its simplicity and efficiency. Furthermore, the testing scores of KNN, CART and DNN algorithm are increased by 44.78%, 32.72% and 9.13% with our proposed process. We demonstrated the importance of embedding physical principles within the machine learning process, and also highlight an important point that the need for more complex machine learning algorithms in industrial big data mining is often much less than it is in other applications, making it essential to incorporate physics and follow “Less is More” philosophy.

Download Full-text

Chronic Diseases Prediction over Bigdata by using Machine Learning

International Journal of Scientific Research in Computer Science Engineering and Information Technology ◽

10.32628/cseit195350 ◽

2019 ◽

pp. 246-250

Author(s):

Shreekanth Jogar ◽

Pavankumar Naik ◽

Veeramma Vyapari ◽

Madevi Vaddar ◽

Kavita Dambal ◽

...

Keyword(s):

Machine Learning ◽

Big Data ◽

Chronic Disease ◽

Prediction Models ◽

Disease Risk ◽

Real Life ◽

Community Services ◽

Medical Data ◽

Machine Learning Algorithms ◽

Central China

With big data growth in biomedical and healthcare communities, accurate analysis of medical data benefits early disease detection, patient care and community services. However, the analysis accuracy is reduced when the quality of medical data is incomplete. Moreover, different regions exhibit unique characteristics of certain regional diseases, which may weaken the prediction of disease outbreaks. In this paper, we streamline machine-learning algorithms for effective prediction of chronic disease outbreak in disease-frequent communities. We experiment the modified prediction models over real-life hospital data collected from central China in 2013-2015. To overcome the difficulty of incomplete data, we use a latent factor model to reconstruct the missing data. We experiment on a regional chronic disease of cerebral infarction. To the best of our knowledge, none of the existing work focused on both data types in the area of medical big data analytics. Compared to several typical prediction algorithms, the prediction accuracy of our proposed algorithm reaches 94.8% with a convergence speed which is faster than that of the CNN-based unimodal disease risk prediction (CNN-UDRP) algorithm.

Download Full-text

Algorithmic methods to explore the automation of the appraisal of structured and unstructured digital data

Records Management Journal ◽

10.1108/rmj-09-2019-0049 ◽

2020 ◽

Vol 30 (2) ◽

pp. 175-200

Author(s):

Basma Makhlouf Shabou ◽

Julien Tièche ◽

Julien Knafou ◽

Arnaud Gaudinat

Keyword(s):

Data Mining ◽

Conceptual Framework ◽

Three Dimensional ◽

Real Life ◽

Digital Data ◽

Complex Data ◽

Scoring Method ◽

Content Type ◽

Appraisal Process ◽

Archival Practice

Purpose This paper aims to describe an interdisciplinary and innovative research conducted in Switzerland, at the Geneva School of Business Administration HES-SO and supported by the State Archives of Neuchâtel (Office des archives de l'État de Neuchâtel, OAEN). The problem to be addressed is one of the most classical ones: how to extract and discriminate relevant data in a huge amount of diversified and complex data record formats and contents. The goal of this study is to provide a framework and a proof of concept for a software that helps taking defensible decisions on the retention and disposal of records and data proposed to the OAEN. For this purpose, the authors designed two axes: the archival axis, to propose archival metrics for the appraisal of structured and unstructured data, and the data mining axis to propose algorithmic methods as complementary or/and additional metrics for the appraisal process. Design/methodology/approach Based on two axes, this exploratory study designs and tests the feasibility of archival metrics that are paired to data mining metrics, to advance, as much as possible, the digital appraisal process in a systematic or even automatic way. Under Axis 1, the authors have initiated three steps: first, the design of a conceptual framework to records data appraisal with a detailed three-dimensional approach (trustworthiness, exploitability, representativeness). In addition, the authors defined the main principles and postulates to guide the operationalization of the conceptual dimensions. Second, the operationalization proposed metrics expressed in terms of variables supported by a quantitative method for their measurement and scoring. Third, the authors shared this conceptual framework proposing the dimensions and operationalized variables (metrics) with experienced professionals to validate them. The expert’s feedback finally gave the authors an idea on: the relevance and the feasibility of these metrics. Those two aspects may demonstrate the acceptability of such method in a real-life archival practice. In parallel, Axis 2 proposes functionalities to cover not only macro analysis for data but also the algorithmic methods to enable the computation of digital archival and data mining metrics. Based on that, three use cases were proposed to imagine plausible and illustrative scenarios for the application of such a solution. Findings The main results demonstrate the feasibility of measuring the value of data and records with a reproducible method. More specifically, for Axis 1, the authors applied the metrics in a flexible and modular way. The authors defined also the main principles needed to enable computational scoring method. The results obtained through the expert’s consultation on the relevance of 42 metrics indicate an acceptance rate above 80%. In addition, the results show that 60% of all metrics can be automated. Regarding Axis 2, 33 functionalities were developed and proposed under six main types: macro analysis, microanalysis, statistics, retrieval, administration and, finally, the decision modeling and machine learning. The relevance of metrics and functionalities is based on the theoretical validity and computational character of their method. These results are largely satisfactory and promising. Originality/value This study offers a valuable aid to improve the validity and performance of archival appraisal processes and decision-making. Transferability and applicability of these archival and data mining metrics could be considered for other types of data. An adaptation of this method and its metrics could be tested on research data, medical data or banking data.

Download Full-text

Performance Analysis of Collaborative Data Mining vs Context Aware Data Mining in a Practical Scenario for Predicting Air Humidity

Computational Statistics and Mathematical Modeling Methods in Intelligent Systems - Advances in Intelligent Systems and Computing ◽

10.1007/978-3-030-31362-3_5 ◽

2019 ◽

pp. 31-40 ◽

Cited By ~ 1

Author(s):

Carmen Ana Anton ◽

Anca Avram ◽

Adrian Petrovan ◽

Oliviu Matei

Keyword(s):

Data Mining ◽

Performance Analysis ◽

Context Aware ◽

Air Humidity ◽

Collaborative Data Mining

Download Full-text

Implementation of data mining as a support of business application strategy

Journal of Applied Information, Communication and Technology ◽

10.33555/ejaict.v5i1.49 ◽

2018 ◽

Vol 5 (1) ◽

pp. 47-55

Author(s):

Florensia Unggul Damayanti

Keyword(s):

Data Mining ◽

Random Forest ◽

Business Strategy ◽

Input Parameter ◽

Data Mining Algorithm ◽

Complex Data ◽

Business Decision ◽

Marketing Department ◽

Business Application ◽

Complex Data Sets

Data mining help industries create intelligent decision on complex problems. Data mining algorithm can be applied to the data in order to forecasting, identity pattern, make rules and recommendations, analyze the sequence in complex data sets and retrieve fresh insights. Yet, increasing of technology and various techniques among data mining availability data give opportunity to industries to explore and gain valuable information from their data and use the information to support business decision making. This paper implement classification data mining in order to retrieve knowledge in customer databases to support marketing department while planning strategy for predict plan premium. The dataset decompose into conceptual analytic to identify characteristic data that can be used as input parameter of data mining model. Business decision and application is characterized by processing step, processing characteristic and processing outcome (Seng, J.L., Chen T.C. 2010). This paper set up experimental of data mining based on J48 and Random Forest classifiers and put a light on performance evaluation between J48 and random forest in the context of dataset in insurance industries. The experiment result are about classification accuracy and efficiency of J48 and Random Forest , also find out the most attribute that can be used to predict plan premium in context of strategic planning to support business strategy.

Download Full-text

Development of Prediction Models Using Machine Learning Algorithms for Girls with Suspected Central Precocious Puberty: Retrospective Study (Preprint)

10.2196/preprints.11728 ◽

2018 ◽

Author(s):

Liyan Pan ◽

Guangjian Liu ◽

Xiaojian Mao ◽

Huixian Li ◽

Jiexin Zhang ◽

...

Keyword(s):

Machine Learning ◽

Retrospective Study ◽

Random Forest ◽

Precocious Puberty ◽

Prediction Models ◽

Central Precocious Puberty ◽

Machine Learning Algorithms ◽

Stimulation Test ◽

Gnrh Analogue ◽

Prediction Probability

BACKGROUND Central precocious puberty (CPP) in girls seriously affects their physical and mental development in childhood. The method of diagnosis—gonadotropin-releasing hormone (GnRH)–stimulation test or GnRH analogue (GnRHa)–stimulation test—is expensive and makes patients uncomfortable due to the need for repeated blood sampling. OBJECTIVE We aimed to combine multiple CPP–related features and construct machine learning models to predict response to the GnRHa-stimulation test. METHODS In this retrospective study, we analyzed clinical and laboratory data of 1757 girls who underwent a GnRHa test in order to develop XGBoost and random forest classifiers for prediction of response to the GnRHa test. The local interpretable model-agnostic explanations (LIME) algorithm was used with the black-box classifiers to increase their interpretability. We measured sensitivity, specificity, and area under receiver operating characteristic (AUC) of the models. RESULTS Both the XGBoost and random forest models achieved good performance in distinguishing between positive and negative responses, with the AUC ranging from 0.88 to 0.90, sensitivity ranging from 77.91% to 77.94%, and specificity ranging from 84.32% to 87.66%. Basal serum luteinizing hormone, follicle-stimulating hormone, and insulin-like growth factor-I levels were found to be the three most important factors. In the interpretable models of LIME, the abovementioned variables made high contributions to the prediction probability. CONCLUSIONS The prediction models we developed can help diagnose CPP and may be used as a prescreening tool before the GnRHa-stimulation test.

Download Full-text

Multi-GPU approach to global induction of classification trees for large-scale data mining

Applied Intelligence ◽

10.1007/s10489-020-01952-5 ◽

2021 ◽

Author(s):

Krzysztof Jurczuk ◽

Marcin Czajkowski ◽

Marek Kretowski

Keyword(s):

Data Mining ◽

Large Scale ◽

Real Life ◽

Population Based ◽

Tree Structure ◽

Global Approach ◽

Data Parallel ◽

Large Scale Data ◽

The Impact ◽

Scale Data

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Development of Machine Learning Models for Prediction of Smoking Cessation Outcome

International Journal of Environmental Research and Public Health ◽

10.3390/ijerph18052584 ◽

2021 ◽

Vol 18 (5) ◽

pp. 2584

Author(s):

Cheng-Chien Lai ◽

Wei-Hsin Huang ◽

Betty Chia-Chen Chang ◽

Lee-Ching Hwang

Keyword(s):

Machine Learning ◽

Smoking Cessation ◽

Success Rate ◽

Prediction Models ◽

Smoking Status ◽

Medical Center ◽

Machine Learning Algorithms ◽

Classification And Regression Tree ◽

Support Vector ◽

Smoking Cessation Outcome

Predictors for success in smoking cessation have been studied, but a prediction model capable of providing a success rate for each patient attempting to quit smoking is still lacking. The aim of this study is to develop prediction models using machine learning algorithms to predict the outcome of smoking cessation. Data was acquired from patients underwent smoking cessation program at one medical center in Northern Taiwan. A total of 4875 enrollments fulfilled our inclusion criteria. Models with artificial neural network (ANN), support vector machine (SVM), random forest (RF), logistic regression (LoR), k-nearest neighbor (KNN), classification and regression tree (CART), and naïve Bayes (NB) were trained to predict the final smoking status of the patients in a six-month period. Sensitivity, specificity, accuracy, and area under receiver operating characteristic (ROC) curve (AUC or ROC value) were used to determine the performance of the models. We adopted the ANN model which reached a slightly better performance, with a sensitivity of 0.704, a specificity of 0.567, an accuracy of 0.640, and an ROC value of 0.660 (95% confidence interval (CI): 0.617–0.702) for prediction in smoking cessation outcome. A predictive model for smoking cessation was constructed. The model could aid in providing the predicted success rate for all smokers. It also had the potential to achieve personalized and precision medicine for treatment of smoking cessation.

Download Full-text

Models for predicting treatment efficacy of antiepileptic drugs and prognosis of treatment withdrawal in epilepsy patients

Acta Epileptologica ◽

10.1186/s42494-020-00035-9 ◽

2021 ◽

Vol 3 (1) ◽

Author(s):

Shijun Yang ◽

Bin Wang ◽

Xiong Han

Keyword(s):

Antiepileptic Drugs ◽

Statistical Models ◽

Prediction Models ◽

External Validation ◽

Machine Learning Algorithms ◽

Treatment Withdrawal ◽

Patient Treatment ◽

Recent Developments ◽

Progression Of Disease ◽

Patients With Epilepsy

AbstractAlthough antiepileptic drugs (AEDs) are the most effective treatment for epilepsy, 30–40% of patients with epilepsy would develop drug-refractory epilepsy. An accurate, preliminary prediction of the efficacy of AEDs has great clinical significance for patient treatment and prognosis. Some studies have developed statistical models and machine-learning algorithms (MLAs) to predict the efficacy of AEDs treatment and the progression of disease after treatment withdrawal, in order to provide assistance for making clinical decisions in the aim of precise, personalized treatment. The field of prediction models with statistical models and MLAs is attracting growing interest and is developing rapidly. What’s more, more and more studies focus on the external validation of the existing model. In this review, we will give a brief overview of recent developments in this discipline.

Download Full-text

CAFD: Context-Aware Fault Diagnostic Scheme towards Sensor Faults Utilizing Machine Learning

Sensors ◽

10.3390/s21020617 ◽

2021 ◽

Vol 21 (2) ◽

pp. 617

Author(s):

Umer Saeed ◽

Young-Doo Lee ◽

Sana Ullah Jan ◽

Insoo Koo

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Diagnostic System ◽

Machine Learning Algorithms ◽

Support Vector ◽

Context Aware ◽

Sensor Faults ◽

Training Time ◽

Low Intensity ◽

Fault Diagnostic

Sensors’ existence as a key component of Cyber-Physical Systems makes it susceptible to failures due to complex environments, low-quality production, and aging. When defective, sensors either stop communicating or convey incorrect information. These unsteady situations threaten the safety, economy, and reliability of a system. The objective of this study is to construct a lightweight machine learning-based fault detection and diagnostic system within the limited energy resources, memory, and computation of a Wireless Sensor Network (WSN). In this paper, a Context-Aware Fault Diagnostic (CAFD) scheme is proposed based on an ensemble learning algorithm called Extra-Trees. To evaluate the performance of the proposed scheme, a realistic WSN scenario composed of humidity and temperature sensor observations is replicated with extreme low-intensity faults. Six commonly occurring types of sensor fault are considered: drift, hard-over/bias, spike, erratic/precision degradation, stuck, and data-loss. The proposed CAFD scheme reveals the ability to accurately detect and diagnose low-intensity sensor faults in a timely manner. Moreover, the efficiency of the Extra-Trees algorithm in terms of diagnostic accuracy, F1-score, ROC-AUC, and training time is demonstrated by comparison with cutting-edge machine learning algorithms: a Support Vector Machine and a Neural Network.

Download Full-text