Margin-Based Pareto Ensemble Pruning: An Ensemble Pruning Algorithm That Learns to Search Optimized Ensembles

Computational Intelligence and Neuroscience ◽

10.1155/2019/7560872 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Ruihan Hu ◽

Songbin Zhou ◽

Yisen Liu ◽

Zhiri Tang

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Classification Performance ◽

Test Set ◽

Pruning Algorithm ◽

Ensemble Pruning ◽

Learning Framework ◽

Classification Tasks ◽

Validation Set ◽

Definition Of

The ensemble pruning system is an effective machine learning framework that combines several learners as experts to classify a test set. Generally, ensemble pruning systems aim to define a region of competence based on the validation set to select the most competent ensembles from the ensemble pool with respect to the test set. However, the size of the ensemble pool is usually fixed, and the performance of an ensemble pool heavily depends on the definition of the region of competence. In this paper, a dynamic pruning framework called margin-based Pareto ensemble pruning is proposed for ensemble pruning systems. The framework explores the optimized ensemble pool size during the overproduction stage and finetunes the experts during the pruning stage. The Pareto optimization algorithm is used to explore the size of the overproduction ensemble pool that can result in better performance. Considering the information entropy of the learners in the indecision region, the marginal criterion for each learner in the ensemble pool is calculated using margin criterion pruning, which prunes the experts with respect to the test set. The effectiveness of the proposed method for classification tasks is assessed using datasets. The results show that margin-based Pareto ensemble pruning can achieve smaller ensemble sizes and better classification performance in most datasets when compared with state-of-the-art models.

Download Full-text

Using Machine Learning to Identify True Somatic Variants from Next-Generation Sequencing

Clinical Chemistry ◽

10.1373/clinchem.2019.308213 ◽

2019 ◽

Vol 66 (1) ◽

pp. 239-246 ◽

Cited By ~ 1

Author(s):

Chao Wu ◽

Xiaonan Zhao ◽

Mark Welsh ◽

Kellianne Costello ◽

Kajia Cao ◽

...

Keyword(s):

Machine Learning ◽

Next Generation Sequencing ◽

Clinical Laboratory ◽

Next Generation ◽

Single Nucleotide Variants ◽

Test Set ◽

Clinical Laboratories ◽

Bona Fide ◽

Validation Set ◽

Generation Sequencing

Abstract BACKGROUND Molecular profiling has become essential for tumor risk stratification and treatment selection. However, cancer genome complexity and technical artifacts make identification of real variants a challenge. Currently, clinical laboratories rely on manual screening, which is costly, subjective, and not scalable. We present a machine learning–based method to distinguish artifacts from bona fide single-nucleotide variants (SNVs) detected by next-generation sequencing from nonformalin-fixed paraffin-embedded tumor specimens. METHODS A cohort of 11278 SNVs identified through clinical sequencing of tumor specimens was collected and divided into training, validation, and test sets. Each SNV was manually inspected and labeled as either real or artifact as part of clinical laboratory workflow. A 3-class (real, artifact, and uncertain) model was developed on the training set, fine-tuned with the validation set, and then evaluated on the test set. Prediction intervals reflecting the certainty of the classifications were derived during the process to label “uncertain” variants. RESULTS The optimized classifier demonstrated 100% specificity and 97% sensitivity over 5587 SNVs of the test set. Overall, 1252 of 1341 true-positive variants were identified as real, 4143 of 4246 false-positive calls were deemed artifacts, whereas only 192 (3.4%) SNVs were labeled as “uncertain,” with zero misclassification between the true positives and artifacts in the test set. CONCLUSIONS We presented a computational classifier to identify variant artifacts detected from tumor sequencing. Overall, 96.6% of the SNVs received definitive labels and thus were exempt from manual review. This framework could improve quality and efficiency of the variant review process in clinical laboratories.

Download Full-text

Lifelong Spectral Clustering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6045 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5867-5874

Author(s):

Gan Sun ◽

Yang Cong ◽

Qianqian Wang ◽

Jun Li ◽

Yun Fu

Keyword(s):

Machine Learning ◽

Real World ◽

Spectral Clustering ◽

State Of The Art ◽

Clustering Algorithms ◽

Orthogonal Basis ◽

Learning Framework ◽

The Past ◽

Benchmark Datasets ◽

Over Time

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a fixed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i.e., Lifelong Spectral Clustering (L2SC). Its goal is to efficiently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Specifically, the knowledge library of L2SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2SC firstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redefines the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.

Download Full-text

MULTICLASS CLASSIFICATION BASED ON META PROBABILITY CODES

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s021800141100910x ◽

2011 ◽

Vol 25 (08) ◽

pp. 1219-1241 ◽

Cited By ~ 3

Author(s):

NACER FARAJZADEH ◽

GANG PAN ◽

ZHAOHUI WU ◽

MIN YAO

Keyword(s):

Machine Learning ◽

Clustering Algorithm ◽

State Of The Art ◽

Feature Space ◽

Multiclass Classification ◽

Classification Performance ◽

Classification Rate ◽

New Approach ◽

Multiclass Classifier ◽

Original Feature

This paper proposes a new approach to improve multiclass classification performance by employing Stacked Generalization structure and One-Against-One decomposition strategy. The proposed approach encodes the outputs of all pairwise classifiers by implicitly embedding two-class discriminative information in a probabilistic manner. The encoded outputs, called Meta Probability Codes (MPCs), are interpreted as the projections of the original features. It is observed that MPC, compared to the original features, has more appropriate features for clustering. Based on MPC, we introduce a cluster-based multiclass classification algorithm, called MPC-Clustering. The MPC-Clustering algorithm uses the proposed approach to project an original feature space to MPC, and then it employs a clustering scheme to cluster MPCs. Subsequently, it trains individual multiclass classifiers on the produced clusters to complete the procedure of multiclass classifier induction. The performance of the proposed algorithm is extensively evaluated on 20 datasets from the UCI machine learning database repository. The results imply that MPC-Clustering is quite efficient with an improvement of 2.4% overall classification rate compared to the state-of-the-art multiclass classifiers.

Download Full-text

Promoting education: A state of the art machine learning framework for feedback and monitoring E-Learning impact

2014 IEEE Global Humanitarian Technology Conference - South Asia Satellite (GHTC-SAS) ◽

10.1109/ghtc-sas.2014.6967592 ◽

2014 ◽

Cited By ~ 1

Author(s):

Harry Raymond Joseph

Keyword(s):

Machine Learning ◽

State Of The Art ◽

Learning Framework ◽

E Learning

Download Full-text

How Does Knowledge of the AUC Constrain the Set of Possible Ground-Truth Labelings?

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015425 ◽

2019 ◽

Vol 33 ◽

pp. 5425-5432

Author(s):

Jacob Whitehill

Keyword(s):

Machine Learning ◽

Empirical Evidence ◽

Recent Work ◽

Roc Curve ◽

Binary Classification ◽

Ground Truth ◽

Mathematical Structure ◽

Test Set ◽

Classification Tasks ◽

N Vector

Recent work on privacy-preserving machine learning has considered how datamining competitions such as Kaggle could potentially be “hacked”, either intentionally or inadvertently, by using information from an oracle that reports a classifier’s accuracy on the test set (Blum and Hardt 2015; Hardt and Ullman 2014; Zheng 2015; Whitehill 2016). For binary classification tasks in particular, one of the most common accuracy metrics is the Area Under the ROC Curve (AUC), and in this paper we explore the mathematical structure of how the AUC is computed from an n-vector of real-valued “guesses” with respect to the ground-truth labels. Under the assumption of perfect knowledge of the test set AUC c=p/q, we show how knowing c constrains the set W of possible ground-truth labelings, and we derive an algorithm both to compute the exact number of such labelings and to enumerate efficiently over them. We also provide empirical evidence that, surprisingly, the number of compatible labelings can actually decrease as n grows, until a test set-dependent threshold is reached. Finally, we show how W can be efficiently whittled down, through pairs of oracle queries, to infer all the groundtruth test labels with complete certainty.

Download Full-text

Discriminative non-negative representation based classifier for image recognition

Journal of Algorithms & Computational Technology ◽

10.1177/17483026211044922 ◽

2021 ◽

Vol 15 ◽

pp. 174830262110449

Author(s):

Kai-Jun Hu ◽

He-Feng Yin ◽

Jun Sun

Keyword(s):

Pattern Classification ◽

State Of The Art ◽

Source Code ◽

Classification Performance ◽

Training Data ◽

Practical Applications ◽

The Past ◽

Benchmark Datasets ◽

Classification Tasks ◽

Negative Representation

During the past decade, representation based classification method has received considerable attention in the community of pattern recognition. The recently proposed non-negative representation based classifier achieved superb recognition results in diverse pattern classification tasks. Unfortunately, discriminative information of training data is not fully exploited in non-negative representation based classifier, which undermines its classification performance in practical applications. To address this problem, we introduce a decorrelation regularizer into the formulation of non-negative representation based classifier and propose a discriminative non-negative representation based classifier for pattern classification. The decorrelation regularizer is able to reduce the correlation of representation results of different classes, thus promoting the competition among them. Experimental results on benchmark datasets validate the efficacy of the proposed discriminative non-negative representation based classifier, and it can outperform some state-of-the-art deep learning based methods. The source code of our proposed discriminative non-negative representation based classifier is accessible at https://github.com/yinhefeng/DNRC .

Download Full-text

Random Forests Highlight the Combined Effect of Environmental Heavy Metals Exposure and Genetic Damages for Cardiovascular Diseases

Applied Sciences ◽

10.3390/app11188405 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8405

Author(s):

Alfonso Monaco ◽

Antonio Lacalamita ◽

Nicola Amoroso ◽

Armando D’Orta ◽

Andrea Del Buono ◽

...

Keyword(s):

Machine Learning ◽

Heavy Metals ◽

Cardiovascular Diseases ◽

Random Forests ◽

Classification Problem ◽

Classification Performance ◽

Cvd Risk ◽

Learning Framework ◽

Clinic Foundation ◽

Highly Correlated

Heavy metals are a dangerous source of pollution due to their toxicity, permanence in the environment and chemical nature. It is well known that long-term exposure to heavy metals is related to several chronic degenerative diseases (cardiovascular diseases, neoplasms, neurodegenerative syndromes, etc.). In this work, we propose a machine learning framework to evaluate the severity of cardiovascular diseases (CVD) from Human scalp hair analysis (HSHA) tests and genetic analysis and identify a small group of these clinical features mostly associated with the CVD risk. Using a private dataset provided by the DD Clinic foundation in Caserta, Italy, we cross-validated the classification performance of a Random Forests model with 90 subjects affected by CVD. The proposed model reached an AUC of 0.78 ± 0.01 on a three class classification problem. The robustness of the predictions was assessed by comparison with different cross-validation schemes and two state-of-the-art classifiers, such as Artificial Neural Network and General Linear Model. Thus, is the first work that studies, through a machine learning approach, the tight link between CVD severity, heavy metal concentrations and SNPs. Then, the selected features appear highly correlated with the CVD phenotype, and they could represent targets for future CVD therapies.

Download Full-text

Graph Debiased Contrastive Learning with Joint Representation Clustering

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/473 ◽

2021 ◽

Author(s):

Han Zhao ◽

Xu Yang ◽

Zhenru Wang ◽

Erkun Yang ◽

Cheng Deng

Keyword(s):

State Of The Art ◽

False Negative ◽

Poor Performance ◽

Representation Learning ◽

Graph Representation ◽

Learning Framework ◽

Clustering And Classification ◽

Class Information ◽

Classification Tasks ◽

Joint Representation

By contrasting positive-negative counterparts, graph contrastive learning has become a prominent technique for unsupervised graph representation learning. However, existing methods fail to consider the class information and will introduce false-negative samples in the random negative sampling, causing poor performance. To this end, we propose a graph debiased contrastive learning framework, which can jointly perform representation learning and clustering. Specifically, representations can be optimized by aligning with clustered class information, and simultaneously, the optimized representations can promote clustering, leading to more powerful representations and clustering results. More importantly, we randomly select negative samples from the clusters which are different from the positive sample's cluster. In this way, as the supervisory signals, the clustering results can be utilized to effectively decrease the false-negative samples. Extensive experiments on five datasets demonstrate that our method achieves new state-of-the-art results on graph clustering and classification tasks.

Download Full-text

A Combined Static and Dynamic Analysis Approach to Detect Malicious Browser Extensions

Security and Communication Networks ◽

10.1155/2018/7087239 ◽

2018 ◽

Vol 2018 ◽

pp. 1-16

Author(s):

Yao Wang ◽

Wandong Cai ◽

Pin Lyu ◽

Wei Shao

Keyword(s):

Machine Learning ◽

False Positive Rate ◽

Feature Selection Method ◽

Machine Learning Techniques ◽

Security Risk ◽

Test Set ◽

Static And Dynamic Analysis ◽

Detection Model ◽

Validation Set ◽

Browser Extensions

Ill-intentioned browser extensions pose an emergent security risk and have become one of the most common attack vectors on the Internet due to their wide popularity and high privilege. Once installed, malicious extensions are executed and attempt to compromise a victim’s browser. To detect malicious browser extensions, security researchers have put forward several techniques. These techniques primarily concentrate on the usage of API calls by malicious extensions, imposing restricted policies for extensions, and monitoring extension’s activities. In this paper, we propose a machine-learning-based approach to detect malicious extensions. We apply static and dynamic techniques to analyse an extension for extracting features. The analysis process extracts features from the source codes including JavaScript codes, HTML pages, and CSS files and the execution activities of an extension. To guarantee the robustness of the features, a feature selection method is then applied to retain the most relevant features while discarding low-correlated features. The detection models based on machine-learning techniques are subsequently constructed by leveraging these features. As can be seen from evaluation results, our detection model, containing over 4,600 labelled extension samples, is able to detect malicious extensions with an accuracy of 96.52% in validation set and 95.18% in test set, with a false positive rate of 2.38% in validation set and 3.66% in test set.

Download Full-text

Dual Adversarial Co-Learning for Multi-Domain Text Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6115 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6438-6445

Author(s):

Yuan Wu ◽

Yuhong Guo

Keyword(s):

Text Classification ◽

State Of The Art ◽

Digital Data ◽

Classification Model ◽

Classification Models ◽

Learning Framework ◽

Good Classification ◽

Classification Tasks ◽

Multiple Domains ◽

Learned Features

With the advent of deep learning, the performance of text classification models have been improved significantly. Nevertheless, the successful training of a good classification model requires a sufficient amount of labeled data, while it is always expensive and time consuming to annotate data. With the rapid growth of digital data, similar classification tasks can typically occur in multiple domains, while the availability of labeled data can largely vary across domains. Some domains may have abundant labeled data, while in some other domains there may only exist a limited amount (or none) of labeled data. Meanwhile text classification tasks are highly domain-dependent — a text classifier trained in one domain may not perform well in another domain. In order to address these issues, in this paper we propose a novel dual adversarial co-learning approach for multi-domain text classification (MDTC). The approach learns shared-private networks for feature extraction and deploys dual adversarial regularizations to align features across different domains and between labeled and unlabeled data simultaneously under a discrepancy based co-learning framework, aiming to improve the classifiers' generalization capacity with the learned features. We conduct experiments on multi-domain sentiment classification datasets. The results show the proposed approach achieves the state-of-the-art MDTC performance.

Download Full-text