scholarly journals Combination of Active Learning and Semi-Supervised Learning under a Self-Training Scheme

Entropy ◽  
2019 ◽  
Vol 21 (10) ◽  
pp. 988 ◽  
Author(s):  
Fazakis ◽  
Kanas ◽  
Aridas ◽  
Karlos ◽  
Kotsiantis

One of the major aspects affecting the performance of the classification algorithms is the amount of labeled data which is available during the training phase. It is widely accepted that the labeling procedure of vast amounts of data is both expensive and time-consuming since it requires the employment of human expertise. For a wide variety of scientific fields, unlabeled examples are easy to collect but hard to handle in a useful manner, thus improving the contained information for a subject dataset. In this context, a variety of learning methods have been studied in the literature aiming to efficiently utilize the vast amounts of unlabeled data during the learning process. The most common approaches tackle problems of this kind by individually applying active learning or semi-supervised learning methods. In this work, a combination of active learning and semi-supervised learning methods is proposed, under a common self-training scheme, in order to efficiently utilize the available unlabeled data. The effective and robust metrics of the entropy and the distribution of probabilities of the unlabeled set, to select the most sufficient unlabeled examples for the augmentation of the initial labeled set, are used. The superiority of the proposed scheme is validated by comparing it against the base approaches of supervised, semi-supervised, and active learning in the wide range of fifty-five benchmark datasets.

Author(s):  
Tobias Scheffer

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data.


2017 ◽  
Vol 26 (02) ◽  
pp. 1750001 ◽  
Author(s):  
Stamatis Karlos ◽  
Nikos Fazakis ◽  
Sotiris Kotsiantis ◽  
Kyriakos Sgarbas

The most important characteristic of semi-supervised learning methods is the combination of available unlabeled data along with an enough smaller set of labeled examples, so as to increase the learning accuracy compared with the default procedure of supervised methods, which on the other hand use only the labeled data during the training phase. In this work, we have implemented a hybrid Self-trained system that combines a Support Vector Machine, a Decision Tree, a Lazy Learner and a Bayesian algorithm using a Stacking variant methodology. We performed an in depth comparison with other well-known Semi-Supervised classification methods on standard benchmark datasets and we finally reached to the point that the presented technique had better accuracy in most cases.


Author(s):  
Mohamed Nadjib Boufenara ◽  
Mahmoud Boufaida ◽  
Mohamed Lamine Berkane

With the exponential growth of biological data, labeling this kind of data becomes difficult and costly. Although unlabeled data are comparatively more plentiful than labeled ones, most supervised learning methods are not designed to use unlabeled data. Semi-supervised learning methods are motivated by the availability of large unlabeled datasets rather than a small amount of labeled examples. However, incorporating unlabeled data into learning does not guarantee an improvement in classification performance. This paper introduces an approach based on a model of semi-supervised learning, which is the self-training with a deep learning algorithm to predict missing classes from labeled and unlabeled data. In order to assess the performance of the proposed approach, two datasets are used with four performance measures: precision, recall, F-measure, and area under the ROC curve (AUC).


2014 ◽  
Vol 556-562 ◽  
pp. 4765-4769
Author(s):  
Han Yi Li ◽  
Ming Yang ◽  
Nan Nan Kang ◽  
Lu Lu Yue

In this paper, a novel image classification method, incorporating active learning and semi-supervised learning (SSL), is proposed. In this method, two classifiers are needed where one is trained by labeled data and some unlabeled data, while the other one is trained only by labeled data. Specifically, in each round, two classifiers iterate to select useful examples in contention for user query. Then we compute the label changing rate for every unlabeled example in each classifier. Those examples in which the label changing rate is zero and the label in the two classifiers is the same are selected to add into the training data of the first classifier. Our experimental results show that our method significantly reduced the need of labeled examples, while at the same time reducing classification error compared with widely used image classification algorithms.


Author(s):  
Yu Tian ◽  
Xi Peng ◽  
Long Zhao ◽  
Shaoting Zhang ◽  
Dimitris N. Metaxas

Generating multi-view images from a single-view input is an important yet challenging problem. It has broad applications in vision, graphics, and robotics. Our study indicates that the widely-used generative adversarial network (GAN) may learn ?incomplete? representations due to the single-pathway framework: an encoder-decoder network followed by a discriminator network.We propose CR-GAN to address this problem. In addition to the single reconstruction path, we introduce a generation sideway to maintain the completeness of the learned embedding space. The two learning paths collaborate and compete in a parameter-sharing manner, yielding largely improved generality to ?unseen? dataset. More importantly, the two-pathway framework makes it possible to combine both labeled and unlabeled data for self-supervised learning, which further enriches the embedding space for realistic generations. We evaluate our approach on a wide range of datasets. The results prove that CR-GAN significantly outperforms state-of-the-art methods, especially when generating from ?unseen? inputs in wild conditions.


2020 ◽  
Vol 34 (04) ◽  
pp. 3537-3544
Author(s):  
Xu Chen ◽  
Brett Wujek

Automated machine learning (AutoML) strives to establish an appropriate machine learning model for any dataset automatically with minimal human intervention. Although extensive research has been conducted on AutoML, most of it has focused on supervised learning. Research of automated semi-supervised learning and active learning algorithms is still limited. Implementation becomes more challenging when the algorithm is designed for a distributed computing environment. With this as motivation, we propose a novel automated learning system for distributed active learning (AutoDAL) to address these challenges. First, automated graph-based semi-supervised learning is conducted by aggregating the proposed cost functions from different compute nodes in a distributed manner. Subsequently, automated active learning is addressed by jointly optimizing hyperparameters in both the classification and query selection stages leveraging the graph loss minimization and entropy regularization. Moreover, we propose an efficient distributed active learning algorithm which is scalable for big data by first partitioning the unlabeled data and replicating the labeled data to different worker nodes in the classification stage, and then aggregating the data in the controller in the query selection stage. The proposed AutoDAL algorithm is applied to multiple benchmark datasets and a real-world electrocardiogram (ECG) dataset for classification. We demonstrate that the proposed AutoDAL algorithm is capable of achieving significantly better performance compared to several state-of-the-art AutoML approaches and active learning algorithms.


2021 ◽  
Vol 13 (5) ◽  
pp. 909
Author(s):  
Bangyu Wu ◽  
Delin Meng ◽  
Haixia Zhao

Seismic impedance inversion is essential to characterize hydrocarbon reservoir and detect fluids in field of geophysics. However, it is nonlinear and ill-posed due to unknown seismic wavelet, observed data band limitation and noise, but it also requires a forward operator that characterizes physical relation between measured data and model parameters. Deep learning methods have been successfully applied to solve geophysical inversion problems recently. It can obtain results with higher resolution compared to traditional inversion methods, but its performance often not fully explored for the lack of adequate labeled data (i.e., well logs) in training process. To alleviate this problem, we propose a semi-supervised learning workflow based on generative adversarial network (GAN) for acoustic impedance inversion. The workflow contains three networks: a generator, a discriminator and a forward model. The training of the generator and discriminator are guided by well logs and constrained by unlabeled data via the forward model. The benchmark models Marmousi2, SEAM and a field data are used to demonstrate the performance of our method. Results show that impedance predicted by the presented method, due to making use of both labeled and unlabeled data, are better consistent with ground truth than that of conventional deep learning methods.


Author(s):  
CHEONG HEE PARK

In semi-supervised learning, when the number of data samples with class label information is very small, information from unlabeled data is utilized in the learning process. Many semi-supervised learning methods have been presented and have exhibited competent performance. Active learning also aims to overcome the shortage of labeled data by obtaining class labels for some selected unlabeled data from experts. However, the selection process for the most informative unlabeled data samples can be demanding when the search is performed over a large set of unlabeled data. In this paper, we propose a method for batch mode active learning in graph-based semi-supervised learning. Instead of acquiring class label information of one unlabeled data sample at a time, we obtain information about several data samples at once, reducing time complexity while preserving the beneficial effects of active learning. Experimental results demonstrate the improved performance of the proposed method.


Author(s):  
Tobias Scheffer

For many classification problems, unlabeled training data are inexpensive and readily available, whereas labeling training data imposes costs. Semi-supervised classification algorithms aim at utilizing information contained in unlabeled data in addition to the (few) labeled data. Semi-supervised (for an example, see Seeger, 2001) has a long tradition in statistics (Cooper & Freeman, 1970); much early work has focused on Bayesian discrimination of Gaussians. The Expectation Maximization (EM) algorithm (Dempster, Laird, & Rubin, 1977) is the most popular method for learning generative models from labeled and unlabeled data. Model-based, generative learning algorithms find model parameters (e.g., the parameters of a Gaussian mixture model) that best explain the available labeled and unlabeled data, and they derive the discriminating classification hypothesis from this model. In discriminative learning, unlabeled data is typically incorporated via the integration of some model assumption into the discriminative framework (Miller & Uyar, 1997; Titterington, Smith, & Makov, 1985). The Transductive Support Vector Machine (Vapnik, 1998; Joachims, 1999) uses unlabeled data to identify a hyperplane that has a large distance not only from the labeled data but also from all unlabeled data. This identification results in a bias toward placing the hyperplane in regions of low density p(x). Recently, studies have covered graph-based approaches that rely on the assumption that neighboring instances are more likely to belong to the same class than remote instances (Blum & Chawla, 2001). A distinct approach to utilizing unlabeled data has been proposed by de Sa (1994), Yarowsky (1995) and Blum and Mitchell (1998). When the available attributes can be split into independent and compatible subsets, then multi-view learning algorithms can be employed. Multi-view algorithms, such as co-training (Blum & Mitchell, 1998) and co-EM (Nigam & Ghani, 2000), learn two independent hypotheses, which bootstrap by providing each other with labels for the unlabeled data. An analysis of why training two independent hypotheses that provide each other with conjectured class labels for unlabeled data might be better than EM-like self-training has been provided by Dasgupta, Littman, and McAllester (2001) and has been simplified by Abney (2002). The disagreement rate of two independent hypotheses is an upper bound on the error rate of either hypothesis. Multi-view algorithms minimize the disagreement rate between the peer hypotheses (a situation that is most apparent for the algorithm of Collins & Singer, 1999) and thereby the error rate. Semi-supervised learning is related to active learning. Active learning algorithms are able to actively query the class labels of unlabeled data. By contrast, semi-supervised algorithms are bound to learn from the given data.


2020 ◽  
Vol 34 (03) ◽  
pp. 2509-2517 ◽  
Author(s):  
Caleb Robinson ◽  
Anthony Ortiz ◽  
Kolya Malkin ◽  
Blake Elias ◽  
Andi Peng ◽  
...  

We propose incorporating human labelers in a model fine-tuning system that provides immediate user feedback. In our framework, human labelers can interactively query model predictions on unlabeled data, choose which data to label, and see the resulting effect on the model's predictions. This bi-directional feedback loop allows humans to learn how the model responds to new data. We implement this framework for fine-tuning high-resolution land cover segmentation models and compare human-selected points to points selected using standard active learning methods. Specifically, we fine-tune a deep neural network – trained to segment high-resolution aerial imagery into different land cover classes in Maryland, USA – to a new spatial area in New York, USA using both our human-in-the-loop method and traditional active learning methods. The tight loop in our proposed system turns the algorithm and the human operator into a hybrid system that can produce land cover maps of large areas more efficiently than the traditional workflows. Our framework has applications in machine learning settings where there is a practically limitless supply of unlabeled data, of which only a small fraction can feasibly be labeled through human efforts, such as geospatial and medical image-based applications.


Sign in / Sign up

Export Citation Format

Share Document