Data mining for microrna gene prediction: On the impact of class imbalance and feature number for microrna gene prediction

AbstractThis paper concerns the evolutionary induction of decision trees (DT) for large-scale data. Such a global approach is one of the alternatives to the top-down inducers. It searches for the tree structure and tests simultaneously and thus gives improvements in the prediction and size of resulting classifiers in many situations. However, it is the population-based and iterative approach that can be too computationally demanding to apply for big data mining directly. The paper demonstrates that this barrier can be overcome by smart distributed/parallel processing. Moreover, we ask the question whether the global approach can truly compete with the greedy systems for large-scale data. For this purpose, we propose a novel multi-GPU approach. It incorporates the knowledge of global DT induction and evolutionary algorithm parallelization together with efficient utilization of memory and computing GPU’s resources. The searches for the tree structure and tests are performed simultaneously on a CPU, while the fitness calculations are delegated to GPUs. Data-parallel decomposition strategy and CUDA framework are applied. Experimental validation is performed on both artificial and real-life datasets. In both cases, the obtained acceleration is very satisfactory. The solution is able to process even billions of instances in a few hours on a single workstation equipped with 4 GPUs. The impact of data characteristics (size and dimension) on convergence and speedup of the evolutionary search is also shown. When the number of GPUs grows, nearly linear scalability is observed what suggests that data size boundaries for evolutionary DT mining are fading.

Download Full-text

Scalable kernel-based SVM classification algorithm on imbalance air quality data for proficient healthcare

Complex & Intelligent Systems ◽

10.1007/s40747-021-00435-5 ◽

2021 ◽

Author(s):

Shwet Ketu ◽

Pramod Kumar Mishra

Keyword(s):

Air Pollution ◽

Air Quality ◽

Class Imbalance ◽

Imbalanced Data ◽

Classification Algorithm ◽

Quality Data ◽

Pollution Level ◽

Classification Problems ◽

Chi Square ◽

The Impact

AbstractIn the last decade, we have seen drastic changes in the air pollution level, which has become a critical environmental issue. It should be handled carefully towards making the solutions for proficient healthcare. Reducing the impact of air pollution on human health is possible only if the data is correctly classified. In numerous classification problems, we are facing the class imbalance issue. Learning from imbalanced data is always a challenging task for researchers, and from time to time, possible solutions have been developed by researchers. In this paper, we are focused on dealing with the imbalanced class distribution in a way that the classification algorithm will not compromise its performance. The proposed algorithm is based on the concept of the adjusting kernel scaling (AKS) method to deal with the multi-class imbalanced dataset. The kernel function's selection has been evaluated with the help of weighting criteria and the chi-square test. All the experimental evaluation has been performed on sensor-based Indian Central Pollution Control Board (CPCB) dataset. The proposed algorithm with the highest accuracy of 99.66% wins the race among all the classification algorithms i.e. Adaboost (59.72%), Multi-Layer Perceptron (95.71%), GaussianNB (80.87%), and SVM (96.92). The results of the proposed algorithm are also better than the existing literature methods. It is also clear from these results that our proposed algorithm is efficient for dealing with class imbalance problems along with enhanced performance. Thus, accurate classification of air quality through our proposed algorithm will be useful for improving the existing preventive policies and will also help in enhancing the capabilities of effective emergency response in the worst pollution situation.

Download Full-text

Automatic Deep Learning-Based Consolidation/Collapse Classification in Lung Ultrasound Images for COVID-19 Induced Pneumonia

10.36227/techrxiv.17912387 ◽

2022 ◽

Author(s):

Nabeel Durrani ◽

Damjan Vukovic ◽

Maria Antico ◽

Jeroen van der Burgt ◽

Ruud JG van van Sloun ◽

...

Keyword(s):

Deep Learning ◽

Class Imbalance ◽

Ultrasound Images ◽

Pleural Effusions ◽

Data Set ◽

Label Noise ◽

Set Size ◽

Video Frames ◽

Two Factors ◽

The Impact

<div>Our automated deep learning-based approach identifies consolidation/collapse in LUS images to aid in the diagnosis of late stages of COVID-19 induced pneumonia, where consolidation/collapse is one of the possible associated pathologies. A common challenge in training such models is that annotating each frame of an ultrasound video requires high labelling effort. This effort in practice becomes prohibitive for large ultrasound datasets. To understand the impact of various degrees of labelling precision, we compare labelling strategies to train fully supervised models (frame-based method, higher labelling effort) and inaccurately supervised models (video-based methods, lower labelling effort), both of which yield binary predictions for LUS videos on a frame-by-frame level. We moreover introduce a novel sampled quaternary method which randomly samples only 10% of the LUS video frames and subsequently assigns (ordinal) categorical labels to all frames in the video based on the fraction of positively annotated samples. This method outperformed the inaccurately supervised video-based method of our previous work on pleural effusions. More surprisingly, this method outperformed the supervised frame-based approach with respect to metrics such as precision-recall area under curve (PR-AUC) and F1 score that are suitable for the class imbalance scenario of our dataset despite being a form of inaccurate learning. This may be due to the combination of a significantly smaller data set size compared to our previous work and the higher complexity of consolidation/collapse compared to pleural effusion, two factors which contribute to label noise and overfitting; specifically, we argue that our video-based method is more robust with respect to label noise and mitigates overfitting in a manner similar to label smoothing. Using clinical expert feedback, separate criteria were developed to exclude data from the training and test sets respectively for our ten-fold cross validation results, which resulted in a PR-AUC score of 73% and an accuracy of 89%. While the efficacy of our classifier using the sampled quaternary method must be verified on a larger consolidation/collapse dataset, when considering the complexity of the pathology, our proposed classifier using the sampled quaternary video-based method is clinically comparable with trained experts and improves over the video-based method of our previous work on pleural effusions.</div>

Download Full-text

Student Academic Performance Prediction using Supervised Learning Techniques

International Journal of Emerging Technologies in Learning (iJET) ◽

10.3991/ijet.v14i14.10310 ◽

2019 ◽

Vol 14 (14) ◽

pp. 92 ◽

Cited By ~ 1

Author(s):

Muhammad Imran ◽

Shahzad Latif ◽

Danish Mehmood ◽

Muhammad Saqlain Shah

Keyword(s):

Data Mining ◽

Supervised Learning ◽

Student Performance ◽

Performance Prediction ◽

Class Imbalance ◽

Ensemble Methods ◽

Fine Tuning ◽

Classification Error ◽

Decision Tree Classifier ◽

Tree Classifier

Automatic Student performance prediction is a crucial job due to the large volume of data in educational databases. This job is being addressed by educational data mining (EDM). EDM develop methods for discovering data that is derived from educational environment. These methods are used for understanding student and their learning environment. The educational institutions are often curious that how many students will be pass/fail for necessary arrangements. In previous studies, it has been observed that many researchers have intension on the selection of appropriate algorithm for just classification and ignores the solutions of the problems which comes during data mining phases such as data high dimensionality ,class imbalance and classification error etc. Such types of problems reduced the accuracy of the model. Several well-known classification algorithms are applied in this domain but this paper proposed a student performance prediction model based on supervised learning decision tree classifier. In addition, an ensemble method is applied to improve the performance of the classifier. Ensemble methods approach is designed to solve classification, predictions problems. This study proves the importance of data preprocessing and algorithms fine-tuning tasks to resolve the data quality issues. The experimental dataset used in this work belongs to Alentejo region of Portugal which is obtained from UCI Machine Learning Repository. Three supervised learning algorithms (J48, NNge and MLP) are employed in this study for experimental purposes. The results showed that J48 achieved highest accuracy 95.78% among others.

Download Full-text

Association rule hiding using integer linear programming

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i4.pp3451-3458 ◽

2021 ◽

Vol 11 (4) ◽

pp. 3451

Author(s):

Suma B. ◽

Shobha G.

Keyword(s):

Data Mining ◽

Linear Programming ◽

Integer Linear Programming ◽

Association Rule ◽

Real Life ◽

Linear Program ◽

Integer Linear Program ◽

Optimization Approach ◽

Integer Linear Program Formulation ◽

The Impact

<span>Privacy preserving data mining has become the focus of attention of government statistical agencies and database security research community who are concerned with preventing privacy disclosure during data mining. Repositories of large datasets include sensitive rules that need to be concealed from unauthorized access. Hence, association rule hiding emerged as one of the powerful techniques for hiding sensitive knowledge that exists in data before it is published. In this paper, we present a constraint-based optimization approach for hiding a set of sensitive association rules, using a well-structured integer linear program formulation. The proposed approach reduces the database sanitization problem to an instance of the integer linear programming problem. The solution of the integer linear program determines the transactions that need to be sanitized in order to conceal the sensitive rules while minimizing the impact of sanitization on the non-sensitive rules. We also present a heuristic sanitization algorithm that performs hiding by reducing the support or the confidence of the sensitive rules. The results of the experimental evaluation of the proposed approach on real-life datasets indicate the promising performance of the approach in terms of side effects on the original database.</span>

Download Full-text

Design and Implementation System to Measure the Impact of Diabetic Retinopathy Using Data Mining Techniques

International Journal of Innovative Research in Electronics and Communications ◽

10.20431/2349-4050.0401001 ◽

2017 ◽

Vol 4 (1) ◽

Keyword(s):

Data Mining ◽

Diabetic Retinopathy ◽

Data Mining Techniques ◽

Design And Implementation ◽

Using Data ◽

The Impact

Download Full-text

Integration of synthetic minority oversampling technique for imbalanced class

Indonesian Journal of Electrical Engineering and Computer Science ◽

10.11591/ijeecs.v13.i1.pp102-108 ◽

2019 ◽

Vol 13 (1) ◽

pp. 102

Author(s):

Noviyanti Santoso ◽

Wahyu Wibowo ◽

Hilda Hikmawati

Keyword(s):

Machine Learning ◽

Data Mining ◽

Support Vector Machine ◽

Class Imbalance ◽

Original Data ◽

Support Vector ◽

Classification Methods ◽

Problematic Issue ◽

Imbalanced Class ◽

F Measure

In the data mining, a class imbalance is a problematic issue to look for the solutions. It probably because machine learning is constructed by using algorithms with assuming the number of instances in each balanced class, so when using a class imbalance, it is possible that the prediction results are not appropriate. They are solutions offered to solve class imbalance issues, including oversampling, undersampling, and synthetic minority oversampling technique (SMOTE). Both oversampling and undersampling have its disadvantages, so SMOTE is an alternative to overcome it. By integrating SMOTE in the data mining classification method such as Naive Bayes, Support Vector Machine (SVM), and Random Forest (RF) is expected to improve the performance of accuracy. In this research, it was found that the data of SMOTE gave better accuracy than the original data. In addition to the three classification methods used, RF gives the highest average AUC, F-measure, and G-means score.

Download Full-text

An analysis on the impact of fluoride in human health (dental) using clustering data mining technique

International Conference on Pattern Recognition, Informatics and Medical Engineering (PRIME-2012) ◽

10.1109/icprime.2012.6208374 ◽

2012 ◽

Cited By ~ 7

Author(s):

T. Balasubramanian ◽

R. Umarani

Keyword(s):

Data Mining ◽

Human Health ◽

Data Mining Technique ◽

Mining Technique ◽

Clustering Data ◽

The Impact

Download Full-text

The Impact of Knowledge Management and Data Mining on CRM in the Service Industry

Nanoelectronics, Circuits and Communication Systems - Lecture Notes in Electrical Engineering ◽

10.1007/978-981-13-0776-8_4 ◽

2018 ◽

pp. 37-52 ◽

Cited By ~ 1

Author(s):

Sanjiv Kumar Srivastava ◽

Bibhas Chandra ◽

Praveen Srivastava

Keyword(s):

Data Mining ◽

Knowledge Management ◽

Service Industry ◽

The Impact

Download Full-text