FDR2-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

María José Basgall; Marcelo Naiouf; Alberto Fernández

doi:10.3390/electronics10151757

FDR2-BD: A Fast Data Reduction Recommendation Tool for Tabular Big Data Classification Problems

Electronics ◽

10.3390/electronics10151757 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1757

Author(s):

María José Basgall ◽

Marcelo Naiouf ◽

Alberto Fernández

Keyword(s):

State Of The Art ◽

Original Data ◽

Problem Area ◽

Classification Problems ◽

Uniform Sampling ◽

Predictive Quality ◽

Big Data Classification ◽

Different Characteristics ◽

Representative Samples ◽

Analyze Data

In this paper, a methodological data condensation approach for reducing tabular big datasets in classification problems is presented, named FDR2-BD. The key of our proposal is to analyze data in a dual way (vertical and horizontal), so as to provide a smart combination between feature selection to generate dense clusters of data and uniform sampling reduction to keep only a few representative samples from each problem area. Its main advantage is allowing the model’s predictive quality to be kept in a range determined by a user’s threshold. Its robustness is built on a hyper-parametrization process, in which all data are taken into consideration by following a k-fold procedure. Another significant capability is being fast and scalable by using fully optimized parallel operations provided by Apache Spark. An extensive experimental study is performed over 25 big datasets with different characteristics. In most cases, the obtained reduction percentages are above 95%, thus outperforming state-of-the-art solutions such as FCNN_MR that barely reach 70%. The most promising outcome is maintaining the representativeness of the original data information, with quality prediction values around 1% of the baseline.

Download Full-text

Faster Motif Counting via Succinct Color Coding and Adaptive Sampling

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3447397 ◽

2021 ◽

Vol 15 (6) ◽

pp. 1-27

Author(s):

Marco Bressan ◽

Stefano Leucci ◽

Alessandro Panconesi

Keyword(s):

Adaptive Sampling ◽

Relative Frequency ◽

State Of The Art ◽

Color Coding ◽

Input Graph ◽

Large Graphs ◽

Running Time ◽

Uniform Sampling ◽

Current State ◽

Connected Subgraphs

We address the problem of computing the distribution of induced connected subgraphs, aka graphlets or motifs , in large graphs. The current state-of-the-art algorithms estimate the motif counts via uniform sampling by leveraging the color coding technique by Alon, Yuster, and Zwick. In this work, we extend the applicability of this approach by introducing a set of algorithmic optimizations and techniques that reduce the running time and space usage of color coding and improve the accuracy of the counts. To this end, we first show how to optimize color coding to efficiently build a compact table of a representative subsample of all graphlets in the input graph. For 8-node motifs, we can build such a table in one hour for a graph with 65M nodes and 1.8B edges, which is times larger than the state of the art. We then introduce a novel adaptive sampling scheme that breaks the “additive error barrier” of uniform sampling, guaranteeing multiplicative approximations instead of just additive ones. This allows us to count not only the most frequent motifs, but also extremely rare ones. For instance, on one graph we accurately count nearly 10.000 distinct 8-node motifs whose relative frequency is so small that uniform sampling would literally take centuries to find them. Our results show that color coding is still the most promising approach to scalable motif counting.

Download Full-text

Fissure Ridges: A Reappraisal of Faulting and Travertine Deposition (Travitonics)

Geosciences ◽

10.3390/geosciences11070278 ◽

2021 ◽

Vol 11 (7) ◽

pp. 278

Author(s):

Andrea Brogi ◽

Enrico Capezzuoli ◽

Volkan Karabacak ◽

Mehmet Cihat Alcicek ◽

Lianchao Luo

Keyword(s):

State Of The Art ◽

Tectonic Setting ◽

Original Data ◽

Apical Part ◽

Thermal Waters ◽

Growth Mechanisms ◽

Depositional Facies ◽

Geothermal Fluids ◽

Tectonic Features ◽

Travertine Deposition

The mechanical discontinuities in the upper crust (i.e., faults and related fractures) lead to the uprising of geothermal fluids to the Earth’s surface. If fluids are enriched in Ca2+ and HCO3-, masses of CaCO3 (i.e., travertine deposits) can form mainly due to the CO2 leakage from the thermal waters. Among other things, fissure-ridge-type deposits are peculiar travertine bodies made of bedded carbonate that gently to steeply dip away from the apical part where a central fissure is located, corresponding to the fracture trace intersecting the substratum; these morpho-tectonic features are the most useful deposits for tectonic and paleoseismological investigation, as their development is contemporaneous with the activity of faults leading to the enhancement of permeability that serves to guarantee the circulation of fluids and their emergence. Therefore, the fissure ridge architecture sheds light on the interplay among fault activity, travertine deposition, and ridge evolution, providing key geo-chronologic constraints due to the fact that travertine can be dated by different radiometric methods. In recent years, studies dealing with travertine fissure ridges have been considerably improved to provide a large amount of information. In this paper, we report the state of the art of knowledge on this topic refining the literature data as well as adding original data, mainly focusing on the fissure ridge morphology, internal architecture, depositional facies, growth mechanisms, tectonic setting in which the fissure ridges develop, and advantages of using the fissure ridges for neotectonic and seismotectonic studies.

Download Full-text

Improving Land Cover Classification Using Genetic Programming for Feature Construction

Remote Sensing ◽

10.3390/rs13091623 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1623

Author(s):

João E. Batista ◽

Ana I. R. Cabral ◽

Maria J. P. Vasconcelos ◽

Leonardo Vanneschi ◽

Sara Silva

Keyword(s):

Land Cover ◽

Genetic Programming ◽

Satellite Images ◽

State Of The Art ◽

Binary Classification ◽

Feature Construction ◽

Classification Problems ◽

Construction Methods ◽

Box Models

Genetic programming (GP) is a powerful machine learning (ML) algorithm that can produce readable white-box models. Although successfully used for solving an array of problems in different scientific areas, GP is still not well known in the field of remote sensing. The M3GP algorithm, a variant of the standard GP algorithm, performs feature construction by evolving hyperfeatures from the original ones. In this work, we use the M3GP algorithm on several sets of satellite images over different countries to create hyperfeatures from satellite bands to improve the classification of land cover types. We add the evolved hyperfeatures to the reference datasets and observe a significant improvement of the performance of three state-of-the-art ML algorithms (decision trees, random forests, and XGBoost) on multiclass classifications and no significant effect on the binary classifications. We show that adding the M3GP hyperfeatures to the reference datasets brings better results than adding the well-known spectral indices NDVI, NDWI, and NBR. We also compare the performance of the M3GP hyperfeatures in the binary classification problems with those created by other feature construction methods such as FFX and EFS.

Download Full-text

Robust CNN Compression Framework for Security-Sensitive Embedded Systems

Applied Sciences ◽

10.3390/app11031093 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1093

Author(s):

Jeonghyun Lee ◽

Sangkyun Lee

Keyword(s):

Embedded Systems ◽

Optimization Problem ◽

State Of The Art ◽

Classification Problems ◽

Proximal Gradient Method ◽

Knowledge Distillation ◽

New Type ◽

Adversarial Examples ◽

Adversarial Training ◽

Memory Efficient

Convolutional neural networks (CNNs) have achieved tremendous success in solving complex classification problems. Motivated by this success, there have been proposed various compression methods for downsizing the CNNs to deploy them on resource-constrained embedded systems. However, a new type of vulnerability of compressed CNNs known as the adversarial examples has been discovered recently, which is critical for security-sensitive systems because the adversarial examples can cause malfunction of CNNs and can be crafted easily in many cases. In this paper, we proposed a compression framework to produce compressed CNNs robust against such adversarial examples. To achieve the goal, our framework uses both pruning and knowledge distillation with adversarial training. We formulate our framework as an optimization problem and provide a solution algorithm based on the proximal gradient method, which is more memory-efficient than the popular ADMM-based compression approaches. In experiments, we show that our framework can improve the trade-off between adversarial robustness and compression rate compared to the existing state-of-the-art adversarial pruning approach.

Download Full-text

A Single Technically Consistent Design Formula for the Thickness of Cylindrical Sections Under Internal Pressure

Journal of Pressure Vessel Technology ◽

10.1115/1.2389035 ◽

2006 ◽

Vol 129 (1) ◽

pp. 211-215 ◽

Cited By ~ 2

Author(s):

John D. Fishburn

Keyword(s):

Experimental Data ◽

Internal Pressure ◽

State Of The Art ◽

Original Data ◽

Pressure Vessels ◽

Time Dependent ◽

Design Codes ◽

Recent Developments ◽

Current Design ◽

Single Formula

Within the current design codes for boilers, piping, and pressure vessels, there are many different equations for the thickness of a cylindrical section under internal pressure. A reassessment of these various formulations, using the original data, is described together with more recent developments in the state of the art. A single formula, which can be demonstrated to retain the same design margin in both the time-dependent and time-independent regimes, is shown to give the best correlation with the experimental data and is proposed for consideration for inclusion in the design codes.

Download Full-text

Data-Efficient Sensor Upgrade Path Using Knowledge Distillation

Sensors ◽

10.3390/s21196523 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6523

Author(s):

Pieter Van Van Molle ◽

Cedric De De Boom ◽

Tim Verbelen ◽

Bert Vankeirsbilck ◽

Jonas De De Vylder ◽

...

Keyword(s):

Deep Neural Networks ◽

State Of The Art ◽

Original Data ◽

Radar Data ◽

Teacher Supervision ◽

Multispectral Images ◽

Test Set ◽

Time To Market ◽

Speed Up ◽

Knowledge Distillation

Deep neural networks have achieved state-of-the-art performance in image classification. Due to this success, deep learning is now also being applied to other data modalities such as multispectral images, lidar and radar data. However, successfully training a deep neural network requires a large reddataset. Therefore, transitioning to a new sensor modality (e.g., from regular camera images to multispectral camera images) might result in a drop in performance, due to the limited availability of data in the new modality. This might hinder the adoption rate and time to market for new sensor technologies. In this paper, we present an approach to leverage the knowledge of a teacher network, that was trained using the original data modality, to improve the performance of a student network on a new data modality: a technique known in literature as knowledge distillation. By applying knowledge distillation to the problem of sensor transition, we can greatly speed up this process. We validate this approach using a multimodal version of the MNIST dataset. Especially when little data is available in the new modality (i.e., 10 images), training with additional teacher supervision results in increased performance, with the student network scoring a test set accuracy of 0.77, compared to an accuracy of 0.37 for the baseline. We also explore two extensions to the default method of knowledge distillation, which we evaluate on a multimodal version of the CIFAR-10 dataset: an annealing scheme for the hyperparameter α and selective knowledge distillation. Of these two, the first yields the best results. Choosing the optimal annealing scheme results in an increase in test set accuracy of 6%. Finally, we apply our method to the real-world use case of skin lesion classification.

Download Full-text

DANNP: an efficient artificial neural network pruning tool

PeerJ Computer Science ◽

10.7717/peerj-cs.137 ◽

2017 ◽

Vol 3 ◽

pp. e137 ◽

Cited By ~ 7

Author(s):

Mona Alshahrani ◽

Othman Soufan ◽

Arturo Magana-Mora ◽

Vladimir B. Bajic

Keyword(s):

Neural Network ◽

State Of The Art ◽

Model Performance ◽

Training Data ◽

Classification Problems ◽

Link Type ◽

On Line ◽

Pruning Algorithms ◽

Artificial Neural ◽

The Impact

Background Artificial neural networks (ANNs) are a robust class of machine learning models and are a frequent choice for solving classification problems. However, determining the structure of the ANNs is not trivial as a large number of weights (connection links) may lead to overfitting the training data. Although several ANN pruning algorithms have been proposed for the simplification of ANNs, these algorithms are not able to efficiently cope with intricate ANN structures required for complex classification problems. Methods We developed DANNP, a web-based tool, that implements parallelized versions of several ANN pruning algorithms. The DANNP tool uses a modified version of the Fast Compressed Neural Network software implemented in C++ to considerably enhance the running time of the ANN pruning algorithms we implemented. In addition to the performance evaluation of the pruned ANNs, we systematically compared the set of features that remained in the pruned ANN with those obtained by different state-of-the-art feature selection (FS) methods. Results Although the ANN pruning algorithms are not entirely parallelizable, DANNP was able to speed up the ANN pruning up to eight times on a 32-core machine, compared to the serial implementations. To assess the impact of the ANN pruning by DANNP tool, we used 16 datasets from different domains. In eight out of the 16 datasets, DANNP significantly reduced the number of weights by 70%–99%, while maintaining a competitive or better model performance compared to the unpruned ANN. Finally, we used a naïve Bayes classifier derived with the features selected as a byproduct of the ANN pruning and demonstrated that its accuracy is comparable to those obtained by the classifiers trained with the features selected by several state-of-the-art FS methods. The FS ranking methodology proposed in this study allows the users to identify the most discriminant features of the problem at hand. To the best of our knowledge, DANNP (publicly available at www.cbrc.kaust.edu.sa/dannp) is the only available and on-line accessible tool that provides multiple parallelized ANN pruning options. Datasets and DANNP code can be obtained at www.cbrc.kaust.edu.sa/dannp/data.php and https://doi.org/10.5281/zenodo.1001086.

Download Full-text

Generating New Space-Filling Test Instances for Continuous Black-Box Optimization

Evolutionary Computation ◽

10.1162/evco_a_00262 ◽

2020 ◽

Vol 28 (3) ◽

pp. 379-404

Author(s):

Mario A. Muñoz ◽

Kate Smith-Miles

Keyword(s):

Feature Vector ◽

State Of The Art ◽

Evolutionary Process ◽

Black Box ◽

Two Dimensional ◽

Entire Space ◽

Test Functions ◽

New Space ◽

Instance Space ◽

Different Characteristics

This article presents a method to generate diverse and challenging new test instances for continuous black-box optimization. Each instance is represented as a feature vector of exploratory landscape analysis measures. By projecting the features into a two-dimensional instance space, the location of existing test instances can be visualized, and their similarities and differences revealed. New instances are generated through genetic programming which evolves functions with controllable characteristics. Convergence to selected target points in the instance space is used to drive the evolutionary process, such that the new instances span the entire space more comprehensively. We demonstrate the method by generating two-dimensional functions to visualize its success, and ten-dimensional functions to test its scalability. We show that the method can recreate existing test functions when target points are co-located with existing functions, and can generate new functions with entirely different characteristics when target points are located in empty regions of the instance space. Moreover, we test the effectiveness of three state-of-the-art algorithms on the new set of instances. The results demonstrate that the new set is not only more diverse than a well-known benchmark set, but also more challenging for the tested algorithms. Hence, the method opens up a new avenue for developing test instances with controllable characteristics, necessary to expose the strengths and weaknesses of algorithms, and drive algorithm development.

Download Full-text

CharTeC-Net: An Efficient and Lightweight Character-Based Convolutional Network for Text Classification

Journal of Electrical and Computer Engineering ◽

10.1155/2020/9701427 ◽

2020 ◽

Vol 2020 ◽

pp. 1-7 ◽

Cited By ~ 2

Author(s):

Aboubakar Nasser Samatin Njikam ◽

Huan Zhao

Keyword(s):

Text Classification ◽

Building Block ◽

Large Scale ◽

State Of The Art ◽

Building Blocks ◽

Training Data ◽

Superior Performance ◽

Classification Problems ◽

Computationally Efficient ◽

Convolutional Network

This paper introduces an extremely lightweight (with just over around two hundred thousand parameters) and computationally efficient CNN architecture, named CharTeC-Net (Character-based Text Classification Network), for character-based text classification problems. This new architecture is composed of four building blocks for feature extraction. Each of these building blocks, except the last one, uses 1 × 1 pointwise convolutional layers to add more nonlinearity to the network and to increase the dimensions within each building block. In addition, shortcut connections are used in each building block to facilitate the flow of gradients over the network, but more importantly to ensure that the original signal present in the training data is shared across each building block. Experiments on eight standard large-scale text classification and sentiment analysis datasets demonstrate CharTeC-Net’s superior performance over baseline methods and yields competitive accuracy compared with state-of-the-art methods, although CharTeC-Net has only between 181,427 and 225,323 parameters and weighs less than 1 megabyte.

Download Full-text

HOPS: high-performance library for (non-)uniform sampling of convex-constrained models

Bioinformatics ◽

10.1093/bioinformatics/btaa872 ◽

2020 ◽

Author(s):

Johann F Jadebeck ◽

Axel Theorell ◽

Samuel Leweke ◽

Katharina Nöh

Keyword(s):

High Performance ◽

State Of The Art ◽

Source Code ◽

Third Party ◽

Supplementary Information ◽

Scalable Algorithms ◽

Uniform Sampling ◽

Non Uniform Sampling ◽

Constrained Models ◽

Performance Gains

Abstract Summary The C++ library Highly Optimized Polytope Sampling (HOPS) provides implementations of efficient and scalable algorithms for sampling convex-constrained models that are equipped with arbitrary target functions. For uniform sampling, substantial performance gains were achieved compared to the state-of-the-art. The ease of integration and utility of non-uniform sampling is showcased in a Bayesian inference setting, demonstrating how HOPS interoperates with third-party software. Availability and implementation Source code is available at https://github.com/modsim/hops/, tested on Linux and MS Windows, includes unit tests, detailed documentation, example applications and a Dockerfile. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text