An Ensemble Framework of Multi-ratio Undersampling-based Imbalanced Classification

Takahiro Komamizu; Yasuhiro Ogawa; Katsuhiko Toyama

doi:10.26421/jdi2.1-2

An Ensemble Framework of Multi-ratio Undersampling-based Imbalanced Classification

journal of Data Intelligence ◽

10.26421/jdi2.1-2 ◽

2021 ◽

Vol 2 (1) ◽

pp. 30-46

Author(s):

Takahiro Komamizu ◽

Yasuhiro Ogawa ◽

Katsuhiko Toyama

Keyword(s):

Real World ◽

State Of The Art ◽

Weighting Function ◽

Class Imbalance ◽

Classification Performance ◽

Real World Data ◽

Imbalanced Classification ◽

Sampling Ratio ◽

Real World Applications ◽

F Measure

Class imbalance is commonly observed in real-world data, and it is problematic in that it degrades classification performance due to biased supervision. Undersampling is an effective resampling approach to the class imbalance. The conventional undersampling-based approaches involve a single fixed sampling ratio. However, different sampling ratios have different preferences toward classes. In this paper, an undersampling-based ensemble framework, MUEnsemble, is proposed. This framework involves weak classifiers of different sampling ratios, and it allows for a flexible design for weighting weak classifiers in different sampling ratios. To demonstrate the principle of the design, in this paper, a uniform weighting function and a Gaussian weighting function are presented. An extensive experimental evaluation shows that MUEnsemble outperforms undersampling-based and oversampling-based state-of-the-art methods in terms of recall, gmean, F-measure, and ROC-AUC metrics. Also, the evaluation showcases that the Gaussian weighting function is superior to the uniform weighting function. This indicates that the Gaussian weighting function can capture the different preferences of sampling ratios toward classes. An investigation into the effects of the parameters of the Gaussian weighting function shows that the parameters of this function can be chosen in terms of recall, which is preferred in many real-world applications.

Download Full-text

An Empirical Study of Boosting Methods on Severely Imbalanced Data

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.513-517.2510 ◽

2014 ◽

Vol 513-517 ◽

pp. 2510-2513 ◽

Cited By ~ 1

Author(s):

Xu Ying Liu

Keyword(s):

Empirical Study ◽

Real World ◽

Class Imbalance ◽

Imbalanced Data ◽

Real World Applications ◽

Under Sampling ◽

The Difference ◽

Imbalance Learning ◽

Class Imbalance Learning ◽

F Measure

Nowadays there are large volumes of data in real-world applications, which poses great challenge to class-imbalance learning: the large amount of the majority class examples and severe class-imbalance. Previous studies on class-imbalance learning mainly focused on relatively small or moderate class-imbalance. In this paper we conduct an empirical study to explore the difference between learning with small or moderate class-imbalance and learning with severe class-imbalance. The experimental results show that: (1) Traditional methods cannot handle severe class-imbalance effectively. (2) AUC, G-mean and F-measure can be very inconsistent for severe class-imbalance, which seldom appears when class-imbalance is moderate. And G-mean is not appropriate for severe class-imbalance learning because it is not sensitive to the change of imbalance ratio. (3) When AUC and G-mean are evaluation metrics, EasyEnsemble is the best method, followed by BalanceCascade and under-sampling. (4) A little under-full balance is better for under-sampling to handle severe class-imbalance. And it is important to handle false positives when design methods for severe class-imbalance.

Download Full-text

Co-GCN for Multi-View Semi-Supervised Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5901 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4691-4698

Author(s):

Shu Li ◽

Wen-Tao Li ◽

Wei Wang

Keyword(s):

Supervised Learning ◽

Real World ◽

State Of The Art ◽

Data Sets ◽

Real World Data ◽

Convolutional Network ◽

The Past ◽

Real World Applications ◽

Supervised Methods ◽

Disjoint Sets

In many real-world applications, the data have several disjoint sets of features and each set is called as a view. Researchers have developed many multi-view learning methods in the past decade. In this paper, we bring Graph Convolutional Network (GCN) into multi-view learning and propose a novel multi-view semi-supervised learning method Co-GCN by adaptively exploiting the graph information from the multiple views with combined Laplacians. Experimental results on real-world data sets verify that Co-GCN can achieve better performance compared with state-of-the-art multi-view semi-supervised methods.

Download Full-text

A Review of Fuzzy and Pattern-Based Approaches for Class Imbalance Problems

Applied Sciences ◽

10.3390/app11146310 ◽

2021 ◽

Vol 11 (14) ◽

pp. 6310

Author(s):

Ismael Lin ◽

Octavio Loyola-González ◽

Raúl Monroy ◽

Miguel Angel Medina-Pérez

Keyword(s):

Pattern Recognition ◽

Real World ◽

State Of The Art ◽

Class Imbalance ◽

Class Imbalance Problem ◽

Real World Data ◽

Minority Class ◽

Research Directions ◽

Imbalance Problem ◽

Medical Diagnostic

The usage of imbalanced databases is a recurrent problem in real-world data such as medical diagnostic, fraud detection, and pattern recognition. Nevertheless, in class imbalance problems, the classifiers are commonly biased by the class with more objects (majority class) and ignore the class with fewer objects (minority class). There are different ways to solve the class imbalance problem, and there has been a trend towards the usage of patterns and fuzzy approaches due to the favorable results. In this paper, we provide an in-depth review of popular methods for imbalanced databases related to patterns and fuzzy approaches. The reviewed papers include classifiers, data preprocessing, and evaluation metrics. We identify different application domains and describe how the methods are used. Finally, we suggest further research directions according to the analysis of the reviewed papers and the trend of the state of the art.

Download Full-text

Distributed Pareto Optimization for Subset Selection

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/207 ◽

2018 ◽

Cited By ~ 2

Author(s):

Chao Qian ◽

Guiying Li ◽

Chao Feng ◽

Ke Tang

Keyword(s):

Real World ◽

Large Scale ◽

State Of The Art ◽

Subset Selection ◽

Data Sets ◽

Mapreduce Framework ◽

Real World Data ◽

Real World Applications ◽

Approximation Guarantee ◽

Better Than

The subset selection problem that selects a few items from a ground set arises in many applications such as maximum coverage, influence maximization, sparse regression, etc. The recently proposed POSS algorithm is a powerful approximation solver for this problem. However, POSS requires centralized access to the full ground set, and thus is impractical for large-scale real-world applications, where the ground set is too large to be stored on one single machine. In this paper, we propose a distributed version of POSS (DPOSS) with a bounded approximation guarantee. DPOSS can be easily implemented in the MapReduce framework. Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RandGreeDi.

Download Full-text

A Survey on Bias and Fairness in Machine Learning

ACM Computing Surveys ◽

10.1145/3457607 ◽

2021 ◽

Vol 54 (6) ◽

pp. 1-35

Author(s):

Ninareh Mehrabi ◽

Fred Morstatter ◽

Nripsuta Saxena ◽

Kristina Lerman ◽

Aram Galstyan

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Deep Learning ◽

Real World ◽

State Of The Art ◽

Future Directions ◽

Discriminatory Behavior ◽

Real World Applications ◽

Near Future ◽

Different Sources

With the widespread use of artificial intelligence (AI) systems and applications in our everyday lives, accounting for fairness has gained significant importance in designing and engineering of such systems. AI systems can be used in many sensitive environments to make important and life-changing decisions; thus, it is crucial to ensure that these decisions do not reflect discriminatory behavior toward certain groups or populations. More recently some work has been developed in traditional machine learning and deep learning that address such challenges in different subdomains. With the commercialization of these systems, researchers are becoming more aware of the biases that these applications can contain and are attempting to address them. In this survey, we investigated different real-world applications that have shown biases in various ways, and we listed different sources of biases that can affect AI applications. We then created a taxonomy for fairness definitions that machine learning researchers have defined to avoid the existing bias in AI systems. In addition to that, we examined different domains and subdomains in AI showing what researchers have observed with regard to unfair outcomes in the state-of-the-art methods and ways they have tried to address them. There are still many future directions and solutions that can be taken to mitigate the problem of bias in AI systems. We are hoping that this survey will motivate researchers to tackle these issues in the near future by observing existing work in their respective fields.

Download Full-text

Generating real-world evidence from unstructured clinical notes to examine clinical utility of genetic tests: use case in BRCAness

BMC Medical Informatics and Decision Making ◽

10.1186/s12911-020-01364-y ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Yiqing Zhao ◽

Saravut J. Weroha ◽

Ellen L. Goode ◽

Hongfang Liu ◽

Chen Wang

Keyword(s):

Targeted Therapy ◽

Data Quality ◽

Real World ◽

Genetic Information ◽

Genetic Data ◽

Real World Data ◽

Rule Based ◽

Clinical Notes ◽

Real World Evidence ◽

F Measure

Abstract Background Next-generation sequencing provides comprehensive information about individuals’ genetic makeup and is commonplace in oncology clinical practice. However, the utility of genetic information in the clinical decision-making process has not been examined extensively from a real-world, data-driven perspective. Through mining real-world data (RWD) from clinical notes, we could extract patients’ genetic information and further associate treatment decisions with genetic information. Methods We proposed a real-world evidence (RWE) study framework that incorporates context-based natural language processing (NLP) methods and data quality examination before final association analysis. The framework was demonstrated in a Foundation-tested women cancer cohort (N = 196). Upon retrieval of patients’ genetic information using NLP system, we assessed the completeness of genetic data captured in unstructured clinical notes according to a genetic data-model. We examined the distribution of different topics regarding BRCA1/2 throughout patients’ treatment process, and then analyzed the association between BRCA1/2 mutation status and the discussion/prescription of targeted therapy. Results We identified seven topics in the clinical context of genetic mentions including: Information, Evaluation, Insurance, Order, Negative, Positive, and Variants of unknown significance. Our rule-based system achieved a precision of 0.87, recall of 0.93 and F-measure of 0.91. Our machine learning system achieved a precision of 0.901, recall of 0.899 and F-measure of 0.9 for four-topic classification and a precision of 0.833, recall of 0.823 and F-measure of 0.82 for seven-topic classification. We found in result-containing sentences, the capture of BRCA1/2 mutation information was 75%, but detailed variant information (e.g. variant types) is largely missing. Using cleaned RWD, significant associations were found between BRCA1/2 positive mutation and targeted therapies. Conclusions In conclusion, we demonstrated a framework to generate RWE using RWD from different clinical sources. Rule-based NLP system achieved the best performance for resolving contextual variability when extracting RWD from unstructured clinical notes. Data quality issues such as incompleteness and discrepancies exist thus manual data cleaning is needed before further analysis can be performed. Finally, we were able to use cleaned RWD to evaluate the real-world utility of genetic information to initiate a prescription of targeted therapy.

Download Full-text

Generating Real-World Evidence from Unstructured Clinical Notes to Examine Clinical Utility of Genetic Tests: Use Case in BRCAness

10.21203/rs.3.rs-41553/v1 ◽

2020 ◽

Author(s):

Yiqing ZHAO ◽

Saravut J Weroha ◽

Ellen Goode ◽

Hongfang Liu ◽

Chen Wang

Keyword(s):

Targeted Therapy ◽

Data Quality ◽

Real World ◽

Genetic Information ◽

Genetic Data ◽

Real World Data ◽

Rule Based ◽

Clinical Notes ◽

Real World Evidence ◽

F Measure

Abstract Background: Next-generation sequencing provides comprehensive information about individuals’ genetic makeup and is commonplace in oncology clinical practice. However, the utility of genetic information in clinical decision-making process has not been examined extensively from a real-world, data-driven perspective. Through mining real-world data (RWD) from clinical notes, we could extract patients’ genetic information and further associate treatment decisions with genetic information.Methods: We proposed a real-world evidence (RWE) study framework that incorporates context-based natural language processing (NLP) methods and data quality examination before final association analysis. The framework was demonstrated on a Foundation-tested women cancer cohort (N=196). Upon retrieval of patients’ genetic information using NLP system, we assessed completeness of genetic data captured in unstructured clinical notes according a genetic data-model. We examined the distribution of different topics regarding BRCA1/2 throughout patients’ treatment process, and then analyzed the association between BRCA1/2 mutation status and the discussion/prescription of targeted therapy. Results: We identified seven topics in clinical context of genetic mentions including: Information, Evaluation, Insurance, Order, Negative, Positive, and Variants of unknown significance (VUS). Our rule-based system achieved a precision of 0.87, recall of 0.93 and F-measure of 0.91. Our machine learning system achieved a precision of 0.901, recall of 0.899 and F-measure of 0.9 for four-topic classification and a precision of 0.833, recall of 0.823 and F-measure of 0.82 for seven-topic classification. We found in result-containing sentences, capture of BRCA1/2 mutation information was 75%, but detailed variant information (e.g. variant types) is largely missing. Using cleaned RWD, significant associations were found between BRCA1/2 positive mutation and targeted therapies.Conclusions: In conclusion, we demonstrated a framework to generate RWE using RWD from different clinical sources. Rule-based NLP system achieved the best performance for resolving contextual variability when extracting RWD from unstructured clinical notes. Data quality issue such as incompleteness and discrepancies exist thus manual data cleaning is needed before further analysis can be performed. Finally, we were able to use cleaned RWD to evaluate real-world utility of genetic information to initiate prescription of targeted therapy.

Download Full-text

Sliding-Window Thompson Sampling for Non-Stationary Settings

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11407 ◽

2020 ◽

Vol 68 ◽

pp. 311-364

Author(s):

Francesco Trovo ◽

Stefano Paladino ◽

Marcello Restelli ◽

Nicola Gatti

Keyword(s):

Real World ◽

State Of The Art ◽

Sliding Window ◽

Upper Bounds ◽

Decision Problems ◽

Sequential Decision ◽

Thompson Sampling ◽

The Past ◽

Real World Applications ◽

Window Approach

Multi-Armed Bandit (MAB) techniques have been successfully applied to many classes of sequential decision problems in the past decades. However, non-stationary settings -- very common in real-world applications -- received little attention so far, and theoretical guarantees on the regret are known only for some frequentist algorithms. In this paper, we propose an algorithm, namely Sliding-Window Thompson Sampling (SW-TS), for nonstationary stochastic MAB settings. Our algorithm is based on Thompson Sampling and exploits a sliding-window approach to tackle, in a unified fashion, two different forms of non-stationarity studied separately so far: abruptly changing and smoothly changing. In the former, the reward distributions are constant during sequences of rounds, and their change may be arbitrary and happen at unknown rounds, while, in the latter, the reward distributions smoothly evolve over rounds according to unknown dynamics. Under mild assumptions, we provide regret upper bounds on the dynamic pseudo-regret of SW-TS for the abruptly changing environment, for the smoothly changing one, and for the setting in which both the non-stationarity forms are present. Furthermore, we empirically show that SW-TS dramatically outperforms state-of-the-art algorithms even when the forms of non-stationarity are taken separately, as previously studied in the literature.

Download Full-text

Particle Swarm Contour Search Algorithm

Entropy ◽

10.3390/e22040407 ◽

2020 ◽

Vol 22 (4) ◽

pp. 407 ◽

Cited By ~ 1

Author(s):

Dominik Weikert ◽

Sebastian Mai ◽

Sanaz Mostaghim

Keyword(s):

Image Processing ◽

Real World ◽

State Of The Art ◽

Search Algorithm ◽

Particle Swarm ◽

Search Space ◽

Local Information ◽

The State ◽

Complete Knowledge ◽

Real World Applications

In this article, we present a new algorithm called Particle Swarm Contour Search (PSCS)—a Particle Swarm Optimisation inspired algorithm to find object contours in 2D environments. Currently, most contour-finding algorithms are based on image processing and require a complete overview of the search space in which the contour is to be found. However, for real-world applications this would require a complete knowledge about the search space, which may not be always feasible or possible. The proposed algorithm removes this requirement and is only based on the local information of the particles to accurately identify a contour. Particles search for the contour of an object and then traverse alongside using their known information about positions in- and out-side of the object. Our experiments show that the proposed PSCS algorithm can deliver comparable results as the state-of-the-art.

Download Full-text

Design and implementation of aggregate functions in the DLV system

Theory and Practice of Logic Programming ◽

10.1017/s1471068408003323 ◽

2008 ◽

Vol 8 (5-6) ◽

pp. 545-580 ◽

Cited By ~ 41

Author(s):

WOLFGANG FABER ◽

GERALD PFEIFER ◽

NICOLA LEONE ◽

TINA DELL'ARMI ◽

GIUSEPPE IELPA

Keyword(s):

Computational Complexity ◽

Logic Programming ◽

Real World ◽

State Of The Art ◽

Complexity Class ◽

Natural Manner ◽

Design And Implementation ◽

Finite Structures ◽

Real World Applications ◽

Aggregate Functions

AbstractDisjunctive logic programming (DLP) is a very expressive formalism. It allows for expressing every property of finite structures that is decidable in the complexity class ΣP2(=NPNP). Despite this high expressiveness, there are some simple properties, often arising in real-world applications, which cannot be encoded in a simple and natural manner. Especially properties that require the use of arithmetic operators (like sum, times, or count) on a set or multiset of elements, which satisfy some conditions, cannot be naturally expressed in classic DLP. To overcome this deficiency, we extend DLP by aggregate functions in a conservative way. In particular, we avoid the introduction of constructs with disputed semantics, by requiring aggregates to be stratified. We formally define the semantics of the extended language (called ), and illustrate how it can be profitably used for representing knowledge. Furthermore, we analyze the computational complexity of , showing that the addition of aggregates does not bring a higher cost in that respect. Finally, we provide an implementation of in DLV—a state-of-the-art DLP system—and report on experiments which confirm the usefulness of the proposed extension also for the efficiency of computation.

Download Full-text