LICIC: Less Important Components for Imbalanced Multiclass Classification

Vincenzo Dentamaro; Donato Impedovo; Giuseppe Pirlo

doi:10.3390/info9120317

LICIC: Less Important Components for Imbalanced Multiclass Classification

Information ◽

10.3390/info9120317 ◽

2018 ◽

Vol 9 (12) ◽

pp. 317 ◽

Cited By ~ 5

Author(s):

Vincenzo Dentamaro ◽

Donato Impedovo ◽

Giuseppe Pirlo

Keyword(s):

Gene Expression ◽

Class Imbalance ◽

Imbalanced Data ◽

Multiclass Classification ◽

Cancer Diagnostics ◽

Mass Spectrometry Data ◽

High Dimensional ◽

Or Gene ◽

High Dimensional Datasets

Multiclass classification in cancer diagnostics, using DNA or Gene Expression Signatures, but also classification of bacteria species fingerprints in MALDI-TOF mass spectrometry data, is challenging because of imbalanced data and the high number of dimensions with respect to the number of instances. In this study, a new oversampling technique called LICIC will be presented as a valuable instrument in countering both class imbalance, and the famous “curse of dimensionality” problem. The method enables preservation of non-linearities within the dataset, while creating new instances without adding noise. The method will be compared with other oversampling methods, such as Random Oversampling, SMOTE, Borderline-SMOTE, and ADASYN. F1 scores show the validity of this new technique when used with imbalanced, multiclass, and high-dimensional datasets.

Download Full-text

A class imbalance-aware Relief algorithm for the classification of tumors using microarray gene expression data

Computational Biology and Chemistry ◽

10.1016/j.compbiolchem.2019.03.017 ◽

2019 ◽

Vol 80 ◽

pp. 121-127 ◽

Cited By ~ 3

Author(s):

Yuanyu He ◽

Junhai Zhou ◽

Yaping Lin ◽

Tuanfei Zhu

Keyword(s):

Gene Expression ◽

Gene Expression Data ◽

Class Imbalance ◽

Microarray Gene Expression Data ◽

Expression Data ◽

Microarray Gene Expression ◽

Relief Algorithm ◽

Classification Of Tumors ◽

Microarray Gene

Download Full-text

Feature Selection and Classification of High Dimensional Mass Spectrometry Data: A Genetic Programming Approach

Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-37189-9_5 ◽

2013 ◽

pp. 43-55 ◽

Cited By ~ 15

Author(s):

Soha Ahmed ◽

Mengjie Zhang ◽

Lifeng Peng

Keyword(s):

Mass Spectrometry ◽

Feature Selection ◽

Genetic Programming ◽

Mass Spectrometry Data ◽

High Dimensional ◽

Programming Approach

Download Full-text

INCORPORATING FEATURE RANKING AND EVOLUTIONARY METHODS FOR THE CLASSIFICATION OF HIGH-DIMENSIONAL DNA MICROARRAY GENE EXPRESSION DATA

Australasian Medical Journal ◽

10.21767/amj.2013.1641 ◽

2013 ◽

Vol 06 (05) ◽

Author(s):

Mani Abedini ◽

Michael Kirley ◽

Raymond Chiong

Keyword(s):

Gene Expression ◽

Dna Microarray ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

High Dimensional ◽

Feature Ranking ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

Incorporating feature ranking and evolutionary methods for the classification of high-dimensional DNA microarray gene expression data

Australasian Medical Journal ◽

10.4066/amj.2013.1641 ◽

2013 ◽

Vol 6 (5) ◽

pp. 272-279 ◽

Cited By ~ 6

Author(s):

Mani Abedini

Keyword(s):

Gene Expression ◽

Dna Microarray ◽

Gene Expression Data ◽

Microarray Gene Expression Data ◽

High Dimensional ◽

Feature Ranking ◽

Expression Data ◽

Microarray Gene Expression ◽

Microarray Gene

Download Full-text

A Dynamic Ensemble Framework for Mining Textual Streams with Class Imbalance

The Scientific World JOURNAL ◽

10.1155/2014/497354 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Ge Song ◽

Yunming Ye

Keyword(s):

Large Scale ◽

State Of The Art ◽

Concept Drift ◽

Real Life ◽

Class Imbalance ◽

High Dimensional ◽

Adaptive Selection ◽

Stream Classification ◽

Rare Class

Textual stream classification has become a realistic and challenging issue since large-scale, high-dimensional, and non-stationary streams with class imbalance have been widely used in various real-life applications. According to the characters of textual streams, it is technically difficult to deal with the classification of textual stream, especially in imbalanced environment. In this paper, we propose a new ensemble framework, clustering forest, for learning from the textual imbalanced stream with concept drift (CFIM). The CFIM is based on ensemble learning by integrating a set of clustering trees (CTs). An adaptive selection method, which flexibly chooses the useful CTs by the property of the stream, is presented in CFIM. In particular, to deal with the problem of class imbalance, we collect and reuse both rare-class instances and misclassified instances from the historical chunks. Compared to most existing approaches, it is worth pointing out that our approach assumes that both majority class and rareclass may suffer from concept drift. Thus the distribution of resampled instances is similar to the current concept. The effectiveness of CFIM is examined in five real-world textual streams under an imbalanced nonstationary environment. Experimental results demonstrate that CFIM achieves better performance than four state-of-the-art ensemble models.

Download Full-text

HDG-select: A novel GUI based application for gene selection and classification in high dimensional datasets

PLoS ONE ◽

10.1371/journal.pone.0246039 ◽

2021 ◽

Vol 16 (1) ◽

pp. e0246039

Author(s):

Shilan S. Hameed ◽

Rohayanti Hassan ◽

Wan Haslina Hassan ◽

Fahmi F. Muhammadsharif ◽

Liza Abdul Latiff

Keyword(s):

Machine Learning ◽

User Interface ◽

Graphical User Interface ◽

Efficient Algorithm ◽

Gene Selection ◽

High Dimensional ◽

Competitive Performance ◽

User Friendly ◽

High Dimensional Datasets

The selection and classification of genes is essential for the identification of related genes to a specific disease. Developing a user-friendly application with combined statistical rigor and machine learning functionality to help the biomedical researchers and end users is of great importance. In this work, a novel stand-alone application, which is based on graphical user interface (GUI), is developed to perform the full functionality of gene selection and classification in high dimensional datasets. The so-called HDG-select application is validated on eleven high dimensional datasets of the format CSV and GEO soft. The proposed tool uses the efficient algorithm of combined filter-GBPSO-SVM and it was made freely available to users. It was found that the proposed HDG-select outperformed other tools reported in literature and presented a competitive performance, accessibility, and functionality.

Download Full-text

Ensemble-support vector machine-random undersampling: Simulation study of multiclass classification for handling high dimensional and imbalanced data

Journal of Physics Conference Series ◽

10.1088/1742-6596/1613/1/012064 ◽

2020 ◽

Vol 1613 ◽

pp. 012064

Author(s):

Nur Silviyah Rahmi

Keyword(s):

Support Vector Machine ◽

Simulation Study ◽

Imbalanced Data ◽

Multiclass Classification ◽

High Dimensional ◽

Support Vector ◽

Random Undersampling

Download Full-text

Sparse Proteomics Analysis – a compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

BMC Bioinformatics ◽

10.1186/s12859-017-1565-4 ◽

2017 ◽

Vol 18 (1) ◽

Cited By ~ 10

Author(s):

Tim O. F. Conrad ◽

Martin Genzel ◽

Nada Cvetkovic ◽

Niklas Wulkow ◽

Alexander Leichtle ◽

...

Keyword(s):

Mass Spectrometry ◽

Feature Selection ◽

Compressed Sensing ◽

Mass Spectrometry Data ◽

High Dimensional ◽

Proteomics Analysis

Download Full-text

Dissection of gene expression datasets into clinically relevant interaction signatures via high-dimensional correlation maximization

Nature Communications ◽

10.1038/s41467-019-12713-5 ◽

2019 ◽

Vol 10 (1) ◽

Author(s):

Michael Grau ◽

Georg Lenz ◽

Peter Lenz

Keyword(s):

Gene Expression ◽

B Cell Lymphoma ◽

High Dimensional ◽

Large B Cell Lymphoma ◽

Nonlinear Signal ◽

High Dimensional Datasets ◽

Technological Platforms ◽

Learning Concept ◽

Relevant Gene ◽

Relevant Interaction

AbstractGene expression is controlled by many simultaneous interactions, frequently measured collectively in biology and medicine by high-throughput technologies. It is a highly challenging task to infer from these data the generating effects and cooperating genes. Here, we present an unsupervised hypothesis-generating learning concept termed signal dissection by correlation maximization (SDCM) that dissects large high-dimensional datasets into signatures. Each signature captures a particular signal pattern that was consistently observed for multiple genes and samples, likely caused by the same underlying interaction. A key difference to other methods is our flexible nonlinear signal superposition model, combined with a precise regression technique. Analyzing gene expression of diffuse large B-cell lymphoma, our method discovers previously unidentified signatures that reveal significant differences in patient survival. These signatures are more predictive than those from various methods used for comparison and robustly validate across technological platforms. This implies highly specific extraction of clinically relevant gene interactions.

Download Full-text

CLASSIFICATION OF IMBALANCED DATA: A REVIEW

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001409007326 ◽

2009 ◽

Vol 23 (04) ◽

pp. 687-719 ◽

Cited By ~ 534

Author(s):

YANMIN SUN ◽

ANDREW K. C. WONG ◽

MOHAMED S. KAMEL

Keyword(s):

Learning Algorithms ◽

Class Imbalance ◽

Imbalanced Data ◽

Class Imbalance Problem ◽

Class Distribution ◽

Imbalance Problem ◽

Misclassification Costs ◽

Imbalanced Class Distribution ◽

Classifier Learning

Classification of data with imbalanced class distribution has encountered a significant drawback of the performance attainable by most standard classifier learning algorithms which assume a relatively balanced class distribution and equal misclassification costs. This paper provides a review of the classification of imbalanced data regarding: the application domains; the nature of the problem; the learning difficulties with standard classifier learning algorithms; the learning objectives and evaluation measures; the reported research solutions; and the class imbalance problem in the presence of multiple classes.

Download Full-text