A Data Segmentation-Based Ensemble Classification Method for Power System Transient Stability Status Prediction with Imbalanced Data

Zhen Chen; Xiaoyan Han; Chengwei Fan; Zirun He; Xueneng Su; Shengwei Mei

doi:10.3390/app9204216

A Data Segmentation-Based Ensemble Classification Method for Power System Transient Stability Status Prediction with Imbalanced Data

Applied Sciences ◽

10.3390/app9204216 ◽

2019 ◽

Vol 9 (20) ◽

pp. 4216 ◽

Cited By ~ 2

Author(s):

Zhen Chen ◽

Xiaoyan Han ◽

Chengwei Fan ◽

Zirun He ◽

Xueneng Su ◽

...

Keyword(s):

Transient Stability ◽

Imbalanced Data ◽

Classification Performance ◽

Ensemble Classification ◽

Segmentation Strategy ◽

Data Segmentation ◽

Unstable Set ◽

Data Problem ◽

Adaboost Classifier ◽

Training Subset

In recent years, machine learning methods have shown the great potential for real-time transient stability status prediction (TSSP) application. However, most existing studies overlook the imbalanced data problem in TSSP. To address this issue, a novel data segmentation-based ensemble classification (DSEC) method for TSSP is proposed in this paper. Firstly, the effects of the imbalanced data problem on the decision boundary and classification performance of TSSP are investigated in detail. Then, a three-step DSEC method is presented. In the first step, the data segmentation strategy is utilized for dividing the stable samples into multiple non-overlapping stable subsets, ensuring that the samples in each stable subset are not more than the unstable ones, then each stable subset is combined with the unstable set into a training subset. For the second step, an AdaBoost classifier is built based on each training subset. In the final step, decision values from each AdaBoost classifier are aggregated for determining the transient stability status. The experiments are conducted on the Northeast Power Coordinating Council 140-bus system and the simulation results indicate that the proposed approach can significantly improve the classification performance of TSSP with imbalanced data.

Download Full-text

Quality control of imbalanced mass spectra from isotopic labeling experiments

BMC Bioinformatics ◽

10.1186/s12859-019-3170-1 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 1

Author(s):

Tianjun Li ◽

Long Chen ◽

Min Gan

Keyword(s):

Quality Control ◽

Quality Assessment ◽

Mass Spectra ◽

Imbalanced Data ◽

Sampling Technique ◽

Isotopic Labeling ◽

Classification Performance ◽

Gradient Boosting ◽

Extreme Gradient Boosting ◽

Data Problem

Abstract Background Mass spectra are usually acquired from the Liquid Chromatography-Mass Spectrometry (LC-MS) analysis for isotope labeled proteomics experiments. In such experiments, the mass profiles of labeled (heavy) and unlabeled (light) peptide pairs are represented by isotope clusters (2D or 3D) that provide valuable information about the studied biological samples in different conditions. The core task of quality control in quantitative LC-MS experiment is to filter out low-quality peptides with questionable profiles. The commonly used methods for this problem are the classification approaches. However, the data imbalance problems in previous control methods are often ignored or mishandled. In this study, we introduced a quality control framework based on the extreme gradient boosting machine (XGBoost), and carefully addressed the imbalanced data problem in this framework. Results In the XGBoost based framework, we suggest the application of the Synthetic minority over-sampling technique (SMOTE) to re-balance data and use the balanced data to train the boosted trees as the classifier. Then the classifier is applied to other data for the peptide quality assessment. Experimental results show that our proposed framework increases the reliability of peptide heavy-light ratio estimation significantly. Conclusions Our results indicate that this framework is a powerful method for the peptide quality assessment. For the feature extraction part, the extracted ion chromatogram (XIC) based features contribute to the peptide quality assessment. To solve the imbalanced data problem, SMOTE brings a much better classification performance. Finally, the XGBoost is capable for the peptide quality control. Overall, our proposed framework provides reliable results for the further proteomics studies.

Download Full-text

A two-stage clustering-based cold-start method for active learning

Intelligent Data Analysis ◽

10.3233/ida-205393 ◽

2021 ◽

Vol 25 (5) ◽

pp. 1169-1185

Author(s):

Deniu He ◽

Hong Yu ◽

Guoyin Wang ◽

Jie Li

Keyword(s):

Active Learning ◽

Class Imbalance ◽

Imbalanced Data ◽

Cold Start ◽

Classification Performance ◽

The Novel ◽

Two Stage ◽

Minority Class ◽

Novel Method ◽

Multiple Clusters

The problem of initialization of active learning is considered in this paper. Especially, this paper studies the problem in an imbalanced data scenario, which is called as class-imbalance active learning cold-start. The novel method is two-stage clustering-based active learning cold-start (ALCS). In the first stage, to separate the instances of minority class from that of majority class, a multi-center clustering is constructed based on a new inter-cluster tightness measure, thus the data is grouped into multiple clusters. Then, in the second stage, the initial training instances are selected from each cluster based on an adaptive candidate representative instances determination mechanism and a clusters-cyclic instance query mechanism. The comprehensive experiments demonstrate the effectiveness of the proposed method from the aspects of class coverage, classification performance, and impact on active learning.

Download Full-text

A novel multi-stage ensemble model for credit scoring based on synthetic sampling and feature transformation

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-211467 ◽

2021 ◽

pp. 1-16

Author(s):

Fang He ◽

Wenyu Zhang ◽

Zhijia Yan

Keyword(s):

Credit Scoring ◽

Imbalanced Data ◽

Transformation Method ◽

Classification Performance ◽

Ensemble Model ◽

Feature Transformation ◽

Learning Methods ◽

Scoring Model ◽

Multi Stage ◽

Credit Scoring Model

Credit scoring has become increasingly important for financial institutions. With the advancement of artificial intelligence, machine learning methods, especially ensemble learning methods, have become increasingly popular for credit scoring. However, the problems of imbalanced data distribution and underutilized feature information have not been well addressed sufficiently. To make the credit scoring model more adaptable to imbalanced datasets, the original model-based synthetic sampling method is extended herein to balance the datasets by generating appropriate minority samples to alleviate class overlap. To enable the credit scoring model to extract inherent correlations from features, a new bagging-based feature transformation method is proposed, which transforms features using a tree-based algorithm and selects features using the chi-square statistic. Furthermore, a two-layer ensemble method that combines the advantages of dynamic ensemble selection and stacking is proposed to improve the classification performance of the proposed multi-stage ensemble model. Finally, four standardized datasets are used to evaluate the performance of the proposed ensemble model using six evaluation metrics. The experimental results confirm that the proposed ensemble model is effective in improving classification performance and is superior to other benchmark models.

Download Full-text

Ensemble Classification through Random Projections for single-cell RNA-seq data

10.1101/2020.06.24.169136 ◽

2020 ◽

Author(s):

Aristidis G. Vrahatis ◽

Sotiris Tasoulis ◽

Spiros Georgakopoulos ◽

Vassilis Plagianakos

Keyword(s):

Single Cell ◽

Random Projection ◽

Classification Performance ◽

Majority Voting ◽

Ensemble Classification ◽

High Dimensionality ◽

Computational Time ◽

Biomedical Data ◽

Rna Seq ◽

Low Dimensional

AbstractNowadays the biomedical data are generated exponentially, creating datasets for analysis with ultra-high dimensionality and complexity. This revolution, which has been caused by recent advents in biotechnologies, has driven to big-data and data-driven computational approaches. An indicative example is the emerging single-cell RNA-sequencing (scRNA-seq) technology, which isolates and measures individual cells. Although scRNA-seq has revolutionized the biotechnology domain, such data computational analysis is a major challenge because of their ultra-high dimensionality and complexity. Following this direction, in this work we study the properties, effectiveness and generalization of the recently proposed MRPV algorithm for single cell RNA-seq data. MRPV is an ensemble classification technique utilizing multiple ultra-low dimensional Random Projected spaces. A given classifier determines the class for each sample for all independent spaces while a majority voting scheme defines their predominant class. We show that Random Projection ensembles offer a platform not only for a low computational time analysis but also for enhancing classification performance. The developed methodologies were applied to four real biomedical high dimensional data from single-cell RNA-seq studies and compared against well-known and similar classification tools. Experimental results showed that based on simplistic tools we can create a computationally fast, simple, yet effective approach for single cell RNA-seq data with ultra-high dimensionality.

Download Full-text

An Ensemble Classification Method for High-Dimensional Data Using Neighborhood Rough Set

Complexity ◽

10.1155/2021/8358921 ◽

2021 ◽

Vol 2021 ◽

pp. 1-12

Author(s):

Jing Zhang ◽

Guang Lu ◽

Jiaquan Li ◽

Chuanwen Li

Keyword(s):

Feature Selection ◽

Rough Set ◽

Small Sample Size ◽

High Dimensional Data ◽

Classification Performance ◽

Small Sample ◽

Ensemble Classification ◽

High Dimensional ◽

Sample Classification ◽

Neighborhood Rough Set

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.

Download Full-text

Ensemble classification and segmentation for intracranial metastatic tumors on MRI images based on 2D U-nets

Scientific Reports ◽

10.1038/s41598-021-99984-5 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Cheng-Chung Li ◽

Meng-Yun Wu ◽

Ying-Chou Sun ◽

Hung-Hsun Chen ◽

Hsiu-Mei Wu ◽

...

Keyword(s):

Active Contour Model ◽

Brain Mri ◽

Gamma Knife Radiosurgery ◽

Imbalanced Data ◽

Brain Magnetic Resonance Imaging ◽

Region Of Interest ◽

Ensemble Classification ◽

Metastatic Tumors ◽

Post Contrast ◽

Magnetic Resonance Imaging Mri

AbstractThe extraction of brain tumor tissues in 3D Brain Magnetic Resonance Imaging (MRI) plays an important role in diagnosis before the gamma knife radiosurgery (GKRS). In this article, the post-contrast T1 whole-brain MRI images had been collected by Taipei Veterans General Hospital (TVGH) and stored in DICOM format (dated from 1999 to 2018). The proposed method starts with the active contour model to get the region of interest (ROI) automatically and enhance the image contrast. The segmentation models are trained by MRI images with tumors to avoid imbalanced data problem under model construction. In order to achieve this objective, a two-step ensemble approach is used to establish such diagnosis, first, classify whether there is any tumor in the image, and second, segment the intracranial metastatic tumors by ensemble neural networks based on 2D U-Net architecture. The ensemble for classification and segmentation simultaneously also improves segmentation accuracy. The result of classification achieves a F1-measure of $$75.64\%$$ 75.64 % , while the result of segmentation achieves an IoU of $$84.83\%$$ 84.83 % and a DICE score of $$86.21\%$$ 86.21 % . Significantly reduce the time for manual labeling from 30 min to 18 s per patient.

Download Full-text

A Novel Model for Imbalanced Data Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6145 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6680-6687

Author(s):

Jian Yin ◽

Chunjing Gan ◽

Kaiqi Zhao ◽

Xuan Lin ◽

Zhe Quan ◽

...

Keyword(s):

Imbalanced Data ◽

Data Classification ◽

Classification Performance ◽

Classification Model ◽

Proposed Model ◽

Imbalanced Data Classification ◽

Public Datasets ◽

Distribution Cost ◽

Novel Model ◽

Learning Data

Recently, imbalanced data classification has received much attention due to its wide applications. In the literature, existing researches have attempted to improve the classification performance by considering various factors such as the imbalanced distribution, cost-sensitive learning, data space improvement, and ensemble learning. Nevertheless, most of the existing methods focus on only part of these main aspects/factors. In this work, we propose a novel imbalanced data classification model that considers all these main aspects. To evaluate the performance of our proposed model, we have conducted experiments based on 14 public datasets. The results show that our model outperforms the state-of-the-art methods in terms of recall, G-mean, F-measure and AUC.

Download Full-text

Cross-validation Metrics for Evaluating Classification Performance on Imbalanced Data

2019 International Conference on Computer, Control, Informatics and its Applications (IC3INA) ◽

10.1109/ic3ina48034.2019.8949568 ◽

2019 ◽

Cited By ~ 1

Author(s):

Ni Wayan Surya Wardhani ◽

Masithoh Yessi Rochayani ◽

Atiek Iriany ◽

Agus Dwi Sulistyono ◽

Prayudi Lestantyo

Keyword(s):

Cross Validation ◽

Imbalanced Data ◽

Classification Performance ◽

Validation Metrics

Download Full-text

Improving the classification performance on imbalanced data sets via new hybrid parameterisation model

Journal of King Saud University - Computer and Information Sciences ◽

10.1016/j.jksuci.2019.04.009 ◽

2019 ◽

Author(s):

Masurah Mohamad ◽

Ali Selamat ◽

Imam Much Subroto ◽

Ondrej Krejcar

Keyword(s):

Imbalanced Data ◽

Classification Performance ◽

Data Sets ◽

Imbalanced Data Sets

Download Full-text

Addressing Imbalanced Data Problem with Generative Adversarial Network For Intrusion Detection

2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI) ◽

10.1109/iri49571.2020.00012 ◽

2020 ◽

Author(s):

Ibrahim Yilmaz ◽

Rahat Masum ◽

Ambareen Siraj

Keyword(s):

Intrusion Detection ◽

Imbalanced Data ◽

Generative Adversarial Network ◽

Adversarial Network ◽

Data Problem

Download Full-text