scholarly journals LoRAS: an oversampling approach for imbalanced datasets

2020 ◽  
Author(s):  
Saptarshi Bej ◽  
Narek Davtyan ◽  
Markus Wolfien ◽  
Mariam Nassar ◽  
Olaf Wolkenhauer

AbstractThe Synthetic Minority Oversampling TEchnique (SMOTE) is widely-used for the analysis of imbalanced datasets. It is known that SMOTE frequently over-generalizes the minority class, leading to misclassifications for the majority class, and effecting the overall balance of the model. In this article, we present an approach that overcomes this limitation of SMOTE, employing Localized Random Affine Shadowsampling (LoRAS) to oversample from an approximated data manifold of the minority class. We benchmarked our algorithm with 14 publicly available imbalanced datasets using three different Machine Learning (ML) algorithms and compared the performance of LoRAS, SMOTE and several SMOTE extensions that share the concept of using convex combinations of minority class data points for oversampling with LoRAS. We observed that LoRAS, on average generates better ML models in terms of F1-Score and Balanced accuracy. Another key observation is that while most of the extensions of SMOTE we have tested, improve the F1-Score with respect to SMOTE on an average, they compromise on the Balanced accuracy of a classification model. LoRAS on the contrary, improves both F1 Score and the Balanced accuracy thus produces better classification models. Moreover, to explain the success of the algorithm, we have constructed a mathematical framework to prove that LoRAS oversampling technique provides a better estimate for the mean of the underlying local data distribution of the minority class data space.

2015 ◽  
Vol 7 (2) ◽  
pp. 29-57 ◽  
Author(s):  
Nabil M. Hewahi ◽  
Ibrahim M. Elbouhissi

In data mining, the phenomenon of change in data distribution over time is known as concept drift. In this research, the authors introduce a new approach called Concepts Seeds Gathering and Dataset Updating algorithm (CSG-DU) that gives the traditional classification models the ability to adapt and cope with concept drift as time passes. CSG-DU is concerned with discovering new concepts in data stream and aims to increase the classification accuracy using any classification model when changes occur in the underlying concepts. The proposed approach has been tested using synthetic and real datasets. The experiments conducted show that after applying the authors' approach, the classification accuracy increased from low values to high and acceptable ones. Finally, a comparison study between CSG-DU and Set Formation for Delayed Labeling algorithm (SFDL) has been conducted; SFDL is an approach that handles sudden and gradual concept drift. CSG-DU results outperforms SFDL in terms of classification accuracy.


2020 ◽  
Vol 2020 ◽  
pp. 1-6
Author(s):  
Jian-ye Yuan ◽  
Xin-yuan Nan ◽  
Cheng-rong Li ◽  
Le-le Sun

Considering that the garbage classification is urgent, a 23-layer convolutional neural network (CNN) model is designed in this paper, with the emphasis on the real-time garbage classification, to solve the low accuracy of garbage classification and recycling and difficulty in manual recycling. Firstly, the depthwise separable convolution was used to reduce the Params of the model. Then, the attention mechanism was used to improve the accuracy of the garbage classification model. Finally, the model fine-tuning method was used to further improve the performance of the garbage classification model. Besides, we compared the model with classic image classification models including AlexNet, VGG16, and ResNet18 and lightweight classification models including MobileNetV2 and SuffleNetV2 and found that the model GAF_dense has a higher accuracy rate, fewer Params, and FLOPs. To further check the performance of the model, we tested the CIFAR-10 data set and found the accuracy rates of the model (GAF_dense) are 0.018 and 0.03 higher than ResNet18 and SufflenetV2, respectively. In the ImageNet data set, the accuracy rates of the model (GAF_dense) are 0.225 and 0.146 higher than Resnet18 and SufflenetV2, respectively. Therefore, the garbage classification model proposed in this paper is suitable for garbage classification and other classification tasks to protect the ecological environment, which can be applied to classification tasks such as environmental science, children’s education, and environmental protection.


2021 ◽  
Vol 87 (4) ◽  
pp. 283-293
Author(s):  
Wei Wang ◽  
Yuan Xu ◽  
Yingchao Ren ◽  
Gang Wang

Recently, performance improvement in facade parsing from 3D point clouds has been brought about by designing more complex network structures, which cost huge computing resources and do not take full advantage of prior knowledge of facade structure. Instead, from the perspective of data distribution, we construct a new hierarchical mesh multi-view data domain based on the characteristics of facade objects to achieve fusion of deep-learning models and prior knowledge, thereby significantly improving segmentation accuracy. We comprehensively evaluate the current mainstream method on the RueMonge 2014 data set and demonstrate the superiority of our method. The mean intersection-over-union index on the facade-parsing task reached 76.41%, which is 2.75% higher than the current best result. In addition, through comparative experiments, the reasons for the performance improvement of the proposed method are further analyzed.


2012 ◽  
Vol 2012 ◽  
pp. 1-10 ◽  
Author(s):  
Lev V. Utkin

A fuzzy classification model is studied in the paper. It is based on the contaminated (robust) model which produces fuzzy expected risk measures characterizing classification errors. Optimal classification parameters of the models are derived by minimizing the fuzzy expected risk. It is shown that an algorithm for computing the classification parameters is reduced to a set of standard support vector machine tasks with weighted data points. Experimental results with synthetic data illustrate the proposed fuzzy model.


1995 ◽  
Vol 19 (5) ◽  
pp. 276-280 ◽  
Author(s):  
◽  
Bernard Audini ◽  
Michael Crowe ◽  
Joan Feldman ◽  
Anna Higgitt ◽  
...  

Our objective was to establish a mechanism for monitoring indicators of the state of health of inner London's mental illness services. Data were collected for a census week around 15 June 1994. Local data collection was coordinated by consultant pyschiatrists working in inner London services. Twelve services participated with a combined catchment population of 2.6 m. They included ten London services which were among the 17 most socially deprived areas of England. Main indicators were admission bed occupancy levels (including an estimate of the total requirement), proportion of patients detained under the Mental Health Act, number of assaults committed by inpatients, number of emergency assessments and CPN caseloads. The mean true bed occupancy (which reflects the number of patients who were receiving, or required, in-patient care on census day) was 130%. To meet all need for acute psychiatric care, including for patients who should have been admitted and those discharged prematurely because beds were full, a further 426 beds would have been required. Fifty per cent of patients were legally detained. Physical assaults were virtually a daily occurrence on the admission units. Average community pyschiatric nurse caseloads were 37, suggesting that the majority were not working intensively with limited caseloads of patients with severe mental illness. These indicators, although imperfect, will allow for some measurement of the impact of local and central initiatives on the poor state of London's mental illness services.


Author(s):  
Tianhang Zheng ◽  
Changyou Chen ◽  
Kui Ren

Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal first-order adversary, and the classifier adversarially trained by PGD is robust against a wide range of first-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution p(x), typically in the form of risk maximization/minimization, e.g., max/min Ep(x) L(x) with p(x) some unknown data distribution and L(·) a loss function. However, since PGD generates attack samples independently for each data sample based on L(·), the procedure does not necessarily lead to good generalization in terms of risk optimization. In this paper, we achieve the goal by proposing distributionally adversarial attack (DAA), a framework to solve an optimal adversarial-data distribution, a perturbed distribution that satisfies the L∞ constraint but deviates from the original data distribution to increase the generalization risk maximally. Algorithmically, DAA performs optimization on the space of potential data distributions, which introduces direct dependency between all data points when generating adversarial samples. DAA is evaluated by attacking state-of-the-art defense models, including the adversarially-trained models provided by MIT MadryLab. Notably, DAA ranks the first place on MadryLab’s white-box leaderboards, reducing the accuracy of their secret MNIST model to 88.56% (with l∞ perturbations of ε = 0.3) and the accuracy of their secret CIFAR model to 44.71% (with l∞ perturbations of ε = 8.0). Code for the experiments is released on https://github.com/tianzheng4/Distributionally-Adversarial-Attack.


2014 ◽  
Vol 54 (2) ◽  
pp. 207 ◽  
Author(s):  
D. J. Brown ◽  
D. B. Savage ◽  
G. N. Hinch

Sheep liveweight is an indicator of nutritional status, and its measure may be used as an aid to nutritional management. When walk-over weighing (WOW), a remote weighing concept for grazing sheep, is combined with radio frequency identification (RFID), resulting ‘RFID-linked WOW’ data may enable the liveweight of individual sheep to be tracked over time. We investigated whether RFID-linked WOW data is sufficiently repeatable and frequent to generate individual liveweight estimates with 95% confidence intervals (95% CI) of <2 kg (a sufficient level of error to account for fluctuating gut fill) for a flock within timeframes suitable for management (1-day and 5-day timeframes). Four flocks of sheep were used to generate RFID-linked WOW datasets. RFID-linked WOW data were organised into three groups: raw (unfiltered), coarse filtered (remove all sheep-weights outside the flock’s liveweight range), and fine filtered (remove all sheep-weights outside a 25% range of a recent flock average reference liveweight). The repeatability of raw (unfiltered) RFID-linked WOW data was low (0.20), while a coarse (0.46) and fine (0.76) data filter improved repeatability. The 95% CI of raw RFID-linked WOW data was 27 kg, and was decreased by a coarse (11 kg) and fine (6 kg) data filter. Increasing the number of raw, coarse and fine-filtered data points to 190, 30 and 12 sheep-weights, respectively, decreased the 95% CI to <2 kg. The mean cumulative percentage of sheep achieving >11 fine-filtered RFID-linked WOW sheep-weights within a 1-day and 5-day timeframe was 0 and 10%, respectively. The null hypothesis was accepted: RFID-linked WOW data had low repeatability and was unable to generate liveweight estimates with a 95% CI of less than 2 kg within a suitable timeframe. Therefore, at this stage, RFID-linked WOW is not recommended for on-farm decision making of individual sheep.


1971 ◽  
Vol 40 ◽  
pp. 116-127
Author(s):  
Carl Sagan

Venus is the closest planet. Its surface has never been seen at optical frequencies; nevertheless we now know with at least fair reliability, and in some cases with remarkable accuracy, its surface temperature and pressure, its atmospheric structure, its period of rotation, the obliquity of its rotation axis, the mean surface dielectric constant, its ionospheric structure, and even a little about its surface topography. And yet the clouds of Venus, visible to the naked eye and known to be clouds since the time of Lomonsov, continue to elude our efforts to understand them comprehensively. Not only do we disagree on the chemical composition of the clouds, but it is not even settled whether they are condensation clouds or non-condensable aerosols. And yet there is a very wide variety of relevant data on the clouds. Indeed, the ratio of potentially diagnostic data points to mutually exclusive hypotheses is of the order unity.


2020 ◽  
pp. 019459982094064
Author(s):  
Matthew Shew ◽  
Helena Wichova ◽  
Andres Bur ◽  
Devin C. Koestler ◽  
Madeleine St Peter ◽  
...  

Objective Diagnosis and treatment of Ménière’s disease remains a significant challenge because of our inability to understand what is occurring on a molecular level. MicroRNA (miRNA) perilymph profiling is a safe methodology and may serve as a “liquid biopsy” equivalent. We used machine learning (ML) to evaluate miRNA expression profiles of various inner ear pathologies to predict diagnosis of Ménière’s disease. Study Design Prospective cohort study. Setting Tertiary academic hospital. Subjects and Methods Perilymph was collected during labyrinthectomy (Ménière’s disease, n = 5), stapedotomy (otosclerosis, n = 5), and cochlear implantation (sensorineural hearing loss [SNHL], n = 9). miRNA was isolated and analyzed with the Affymetrix miRNA 4.0 array. Various ML classification models were evaluated with an 80/20 train/test split and cross-validation. Permutation feature importance was performed to understand miRNAs that were critical to the classification models. Results In terms of miRNA profiles for conductive hearing loss versus Ménière’s, 4 models were able to differentiate and identify the 2 disease classes with 100% accuracy. The top-performing models used the same miRNAs in their decision classification model but with different weighted values. All candidate models for SNHL versus Ménière’s performed significantly worse, with the best models achieving 66% accuracy. Ménière’s models showed unique features distinct from SNHL. Conclusions We can use ML to build Ménière’s-specific prediction models using miRNA profile alone. However, ML models were less accurate in predicting SNHL from Ménière’s, likely from overlap of miRNA biomarkers. The power of this technique is that it identifies biomarkers without knowledge of the pathophysiology, potentially leading to identification of novel biomarkers and diagnostic tests.


2012 ◽  
Vol 11 (3) ◽  
pp. 237-251 ◽  
Author(s):  
Malgorzata Migut ◽  
Marcel Worring

In risk assessment applications well-informed decisions need to be made based on large amounts of multi-dimensional data. In many domains, not only the risk of a wrong decision, but also of the trade-off between the costs of possible decisions are of utmost importance. In this paper we describe a framework to support the decision-making process, which tightly integrates interactive visual exploration with machine learning. The proposed approach uses a series of interactive 2D visualizations of numerical and ordinal data combined with visualization of classification models. These series of visual elements are linked to the classifier’s performance, which is visualized using an interactive performance curve. This interaction allows the decision-maker to steer the classification model and instantly identify the critical, cost-changing data elements in the various linked visualizations. The critical data elements are represented as images in order to trigger associations related to the knowledge of the expert. In this way the data visualization and classification results are not only linked together, but are also linked back to the classification model. Such a visual analytics framework allows the user to interactively explore the costs of his decisions for different settings of the model and, accordingly, use the most suitable classification model. More informed and reliable decisions result. A case study in the forensic psychiatry domain reveals the usefulness of the suggested approach.


Sign in / Sign up

Export Citation Format

Share Document