Fast Training Logistic Regression via Adaptive Sampling

Scientific Programming ◽

10.1155/2021/9991859 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Yunsheng Song ◽

Xiaohan Kong ◽

Shuoping Huang ◽

Chao Zhang

Keyword(s):

Logistic Regression ◽

Random Sampling ◽

Adaptive Sampling ◽

Large Scale ◽

Likelihood Function ◽

Classification Performance ◽

Training Data ◽

Gradient Estimation ◽

Training Time ◽

Fast Training

Logistic regression has been widely used in artificial intelligence and machine learning due to its deep theoretical basis and good practical performance. Its training process aims to solve a large-scale optimization problem characterized by a likelihood function, where the gradient descent approach is the most commonly used. However, when the data size is large, it is very time-consuming because it computes the gradient using all the training data in every iteration. Though this difficulty can be solved by random sampling, the appropriate sampled examples size is difficult to be predetermined and the obtained could be not robust. To overcome this deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling. The proposed method decomposes the problem of gradient estimation into several subproblems according to its dimension; then, each subproblem is solved independently by adaptive sampling. Each element of the gradient estimation is obtained by successively sampling a fixed volume training example multiple times until it satisfies its stopping criteria. The final estimation is combined with the results of all the subproblems. It is proved that the obtained gradient estimation is a robust estimation, and it could keep the objective function value decreasing in the iterative calculation. Compared with the representative algorithms using random sampling, the experimental results show that this algorithm obtains comparable classification performance with much less training time.

Download Full-text

Accelerating recommendation system training by leveraging popular choices

Proceedings of the VLDB Endowment ◽

10.14778/3485450.3485462 ◽

2021 ◽

Vol 15 (1) ◽

pp. 127-140

Author(s):

Muhammad Adnan ◽

Yassaman Ebrahimzadeh Maboud ◽

Divya Mahajan ◽

Prashant J. Nair

Keyword(s):

Neural Networks ◽

Large Scale ◽

Recommendation System ◽

Training Data ◽

Categorical Variables ◽

Numerical Representation ◽

Data Layout ◽

Production Scale ◽

Training Time ◽

Usage Patterns

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.

Download Full-text

Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework

Security and Communication Networks ◽

10.1155/2021/8340925 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Shyam Deshmukh ◽

Komati Thirupathi Rao ◽

Mohammad Shabaz

Keyword(s):

Collaborative Learning ◽

Distributed Computing ◽

Large Scale ◽

Cluster Computing ◽

Data Transfer ◽

Training Data ◽

Shared Resources ◽

Training Time ◽

Big Data Applications ◽

Computing Framework

Modern big data applications tend to prefer a cluster computing approach as they are linked to the distributed computing framework that serves users jobs as per demand. It performs rapid processing of tasks by subdividing them into tasks that execute in parallel. Because of the complex environment, hardware and software issues, tasks might run slowly leading to delayed job completion, and such phenomena are also known as stragglers. The performance improvement of distributed computing framework is a bottleneck by straggling nodes due to various factors like shared resources, heavy system load, or hardware issues leading to the prolonged job execution time. Many state-of-the-art approaches use independent models per node and workload. With increased nodes and workloads, the number of models would increase, and even with large numbers of nodes. Not every node would be able to capture the stragglers as there might not be sufficient training data available of straggler patterns, yielding suboptimal straggler prediction. To alleviate such problems, we propose a novel collaborative learning-based approach for straggler prediction, the alternate direction method of multipliers (ADMM), which is resource-efficient and learns how to efficiently deal with mitigating stragglers without moving data to a centralized location. The proposed framework shares information among the various models, allowing us to use larger training data and bring training time down by avoiding data transfer. We rigorously evaluate the proposed method on various datasets with high accuracy results.

Download Full-text

Selection of Support Vector Candidates Using Relative Support Distance for Sustainability in Large-Scale Support Vector Machines

Applied Sciences ◽

10.3390/app10196979 ◽

2020 ◽

Vol 10 (19) ◽

pp. 6979

Author(s):

Minho Ryu ◽

Kichun Lee

Keyword(s):

Support Vector Machines ◽

Quadratic Programming ◽

Decision Trees ◽

Programming Problem ◽

Large Scale ◽

Classification Performance ◽

Quadratic Programming Problem ◽

Support Vector ◽

Training Time ◽

Vector Machines

Support vector machines (SVMs) are a well-known classifier due to their superior classification performance. They are defined by a hyperplane, which separates two classes with the largest margin. In the computation of the hyperplane, however, it is necessary to solve a quadratic programming problem. The storage cost of a quadratic programming problem grows with the square of the number of training sample points, and the time complexity is proportional to the cube of the number in general. Thus, it is worth studying how to reduce the training time of SVMs without compromising the performance to prepare for sustainability in large-scale SVM problems. In this paper, we proposed a novel data reduction method for reducing the training time by combining decision trees and relative support distance. We applied a new concept, relative support distance, to select good support vector candidates in each partition generated by the decision trees. The selected support vector candidates improved the training speed for large-scale SVM problems. In experiments, we demonstrated that our approach significantly reduced the training time while maintaining good classification performance in comparison with existing approaches.

Download Full-text

Learning epistatic polygenic phenotypes with Boolean interactions

10.1101/2020.11.24.396846 ◽

2020 ◽

Author(s):

Merle Behr ◽

Karl Kumbier ◽

Aldo Cordova-Palomera ◽

Matthew Aguirre ◽

Euan Ashley ◽

...

Keyword(s):

Logistic Regression ◽

Prediction Accuracy ◽

Large Scale ◽

Higher Order ◽

Training Data ◽

Biological Interactions ◽

Bootstrap Sampling ◽

Logistic Regression Models ◽

Interaction Terms ◽

Red Hair

AbstractDetecting epistatic drivers of human phenotypes remains a challenge. Traditional approaches use regression to sequentially test multiplicative interaction terms involving single pairs of genetic variants. For higher-order interactions and genome-wide large-scale data, this strategy is computationally intractable. Moreover, multiplicative terms used in regression modeling may not capture the form of biological interactions. Building on the Predictability, Computability, Stability (PCS) framework, we introduce the epiTree pipeline to extract higher-order interactions from genomic data using tree-based models. The epiTree pipeline first selects a set of variants derived from tissue-specific estimates of gene expression. Next, it uses iterative random forests (iRF) to search training data for candidate Boolean interactions (pairwise and higher-order). We derive significance tests from interactions by simulating Boolean tree-structured null (no epistasis) and alternative (epistasis) distributions on hold-out test data. Finally, our pipeline computes PCS epistasis p-values that evaluate the stability of improvement in prediction accuracy via bootstrap sampling on the test set. We validate the epiTree pipeline using the phenotype of red-hair from the UK Biobank, where several genes are known to demonstrate epistatic interactions. epiTree recovers both previously reported and novel interactions, which represent forms of non-linearities not captured by logistic regression models. Additionally, epiTree suggests interactions between genes such as PKHD1 and XPOTP1, which are unlinked to MC1R, as novel candidate interactions associated with the red hair phenotype. Last but not least, we find that individual Boolean or tree-based epistasis models generally provide higher prediction accuracy than classical logistic regression.

Download Full-text

A Core Set Based Large Vector-Angular Region and Margin Approach for Novelty Detection

Mathematical Problems in Engineering ◽

10.1155/2016/1658758 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Jiusheng Chen ◽

Xiaoyu Zhang ◽

Kai Guo

Keyword(s):

Large Scale ◽

Novelty Detection ◽

Imbalanced Data ◽

Feature Space ◽

Training Time ◽

Angular Region ◽

The Core ◽

Large Scale Problems ◽

Fast Training ◽

Core Set

A large vector-angular region and margin (LARM) approach is presented for novelty detection based on imbalanced data. The key idea is to construct the largest vector-angular region in the feature space to separate normal training patterns; meanwhile, maximize the vector-angular margin between the surface of this optimal vector-angular region and abnormal training patterns. In order to improve the generalization performance of LARM, the vector-angular distribution is optimized by maximizing the vector-angular mean and minimizing the vector-angular variance, which separates the normal and abnormal examples well. However, the inherent computation of quadratic programming (QP) solver takesO(n3)training time and at leastO(n2)space, which might be computational prohibitive for large scale problems. By(1+ε) and (1-ε)-approximation algorithm, the core set based LARM algorithm is proposed for fast training LARM problem. Experimental results based on imbalanced datasets have validated the favorable efficiency of the proposed approach in novelty detection.

Download Full-text

SUPPORTING THE MANAGEMENT OF HUMANITARIAN OPERATIONS CONCERNING MIGRATION MOVEMENTS WITH REMOTE SENSING

ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences ◽

10.5194/isprs-archives-xliii-b3-2020-233-2020 ◽

2020 ◽

Vol XLIII-B3-2020 ◽

pp. 233-240

Author(s):

L. Wickert ◽

M. Bogen ◽

M. Richter

Keyword(s):

Large Scale ◽

Training Data ◽

Sufficient Information ◽

Refugee Camps ◽

Humanitarian Crisis ◽

Data Annotation ◽

Machine Learning Methods ◽

Humanitarian Operations ◽

Fast Training

Abstract. The various forms of humanitarian operations include operations concerning the management of migrant movements and refugees. Managing those operations is non-trivial. A large number of refugees have to be welcomed, registered, forwarded, and be given supplies and accommodation. This is due to a lack of current and sufficient information about the refugees, making planning and execution of operations challenging, expensive and cumbersome. The earlier information about the refugees is available, the better. The method “Dwelling Detection”, conducted on satellite imagery of refugee camps, can provide large-scale heads-up information fast, complementing information already available to operators at the ground. With “Dwelling Detection”, dwellings in a camp and their extent are detected using machine learning methods. An estimate of inhabitants of the camp is computed using the number and the extent of the detected dwellings. Our workflow uses a Faster R-CNN, an object detection network. To train the network, we developed a fast training data annotation workflow. We use the dwellings detected by the faster R-CNN to estimate a number of inhabitants. The quality of the analysis can be evaluated by a confidence-metric, computed out of the results of the Faster R-CNN. The results can be used in humanitarian operations. We tested the workflow using different configurations and data. From those tests, we give recommendations on how to build a dwelling detection classifier. We propose to humanitarian operators to build a dwelling detection classifier according to our recommendations and use satellite images in actual humanitarian operations. This could help to reduce stress for all people involved in a humanitarian (crisis) situation.

Download Full-text

Realtime forecasting of COVID 19 cases in Karnataka state using Artificial neural network (ANN)

10.21203/rs.3.rs-83517/v1 ◽

2020 ◽

Author(s):

Rashmi P Shetty ◽

Srinivasa Pai P

Keyword(s):

Prediction Models ◽

Cuckoo Search ◽

The State ◽

Training Data ◽

Percentage Error ◽

Forecasting Model ◽

Ann Model ◽

Training Time ◽

Fast Training ◽

Time Required

Abstract COVID-19 is a pandemic that has caused lot of deaths and infections in the last 6 months and is showing an increasing trend not only in the number of infections and deaths, but also in the recovery rate. Accurate prediction models are very much essential to make proper forecasts and take necessary actions. This study demonstrates the capability of Multilayer Perceptron (MLP, an ANN model for forecasting the number of infected cases in the state of Karnataka in India. It is trained using a fast training algorithm namely, Extreme Learning machine (ELM) to reduce the training time required. The parameters required for the forecasting model have been selected using partial autocorrelation function (PACF), which is a conventional method and its performance has been compared with parameters selected using cuckoo search (CS) algorithm, which is a very popular nature-inspired optimization algorithm. The testing of the forecasting model has been done and comparison between the two parameter selection methods has been carried out. Use of CS algorithm has resulted in a better forecasting performance based on mean absolute percentage error (MAPE), with a value of 6.62 % on training data and 7.03% on the test data. Further to check the efficacy of the model, the data of COVID-19 cases of Hungary from 4th March to 19th April 2020 has been used, which resulted in a MAPE of 1.55%, thereby establishing the robustness of the proposed ANN model for forecasting COVID-19 cases for the state of Karnataka.

Download Full-text

Learning from Imbalanced Multi-label Data Sets by Using Ensemble Strategies

Computer Engineering and Applications Journal ◽

10.18495/comengapp.v4i1.109 ◽

2015 ◽

Vol 4 (1) ◽

pp. 61-81

Author(s):

Mohammad Masoud Javidi

Keyword(s):

Logistic Regression ◽

Ensemble Learning ◽

Nearest Neighbor ◽

Imbalanced Data ◽

Classification Performance ◽

Training Data ◽

Data Sets ◽

K Nearest Neighbor ◽

Data Set ◽

Stable Algorithm

Multi-label classification is an extension of conventional classification in which a single instance can be associated with multiple labels. Problems of this type are ubiquitous in everyday life. Such as, a movie can be categorized as action, crime, and thriller. Most algorithms on multi-label classification learning are designed for balanced data and donâ€™t work well on imbalanced data. On the other hand, in real applications, most datasets are imbalanced. Therefore, we focused to improve multi-label classification performance on imbalanced datasets. In this paper, a state-of-the-art multi-label classification algorithm, which called IBLR_ML, is employed. This algorithm is produced from combination of k-nearest neighbor and logistic regression algorithms. Logistic regression part of this algorithm is combined with two ensemble learning algorithms, Bagging and Boosting. My approach is called IB-ELR. In this paper, for the first time, the ensemble bagging method whit stable learning as the base learner and imbalanced data sets as the training data is examined. Finally, to evaluate the proposed methods; they are implemented in JAVA language. Experimental results show the effectiveness of proposed methods. Keywords: Multi-label classification, Imbalanced data set, Ensemble learning, Stable algorithm, Logistic regression, Bagging, Boosting

Download Full-text

An Electronic Nose Using Neural Networks with Effective Training Data Selection

Journal of Robotics and Mechatronics ◽

10.20965/jrm.2003.p0369 ◽

2003 ◽

Vol 15 (4) ◽

pp. 369-376 ◽

Cited By ~ 3

Author(s):

Bancha Charumporn ◽

◽

Michifumi Yoshioka ◽

Toru Fujinaka ◽

Sigeru Omatu

Keyword(s):

Neural Network ◽

Electronic Nose ◽

Back Propagation ◽

Similarity Index ◽

Classification Performance ◽

Training Data ◽

Data Set ◽

Training Time ◽

Error Back Propagation ◽

Redundant Data

An electronic nose developed from metal oxide gas sensors is applied to test smoke of three general household burning materials under different environments. Generally training data is randomly selected for a layered neural network with error back-propagation (BP). Randomized training data always contain redundant data that lengthen training time without improving classification performance. This paper proposes an effective method to select training data based on a similarity index (SI). The SI ensures that only the most valuable training data is included in the training data set. The proposed method is applied to remove redundant data from the training data set before being fed to the layered neural network based on BP. Results verified high classification performance by using a small number of training data from proposed method.

Download Full-text

From Local to Global: A Transfer Learning-Based Approach for Mapping Poplar Plantations at Large Scale

10.20944/preprints202004.0302.v1 ◽

2020 ◽

Author(s):

Yousra Hamrouni ◽

Éric Paillassa ◽

Véronique Chéret ◽

Claude Monteil ◽

David Sheeren

Keyword(s):

Active Learning ◽

Random Sampling ◽

Large Scale ◽

Forest Cover ◽

Classification Performance ◽

Global Model ◽

Source Image ◽

National Scale ◽

Training Samples ◽

Poplar Plantations

Reliable estimates of poplar plantations area are not available at the French national scale due to the unsuitability and low update rate of existing forest databases for this short-rotation species. While supervised classification methods have been shown to be highly accurate in mapping forest cover from remotely sensed images, their performance depends to a great extent on the labelled samples used to build the models. In addition to their high acquisition cost, such samples are often scarce and not fully representative of the variability in class distributions. Consequently, when classification models are applied to large areas with high intra-class variance, they generally yield poor accuracies. In this paper, we propose the use of active learning (AL) to efficiently adapt a classifier trained on a source image to spatially distinct target images with minimal labelling effort and without sacrificing classification performance. The adaptation consists in actively adding to the initial local model, new relevant training samples from other areas, in a cascade that iteratively improves the generalisation capabilities of the classifier, leading to a global model tailored to different areas. This active selection relies on uncertainty sampling to directly focus on the most informative pixels for which the algorithm is the least certain of their class labels. Experiments conducted on Sentinel-2 time series showed that when the same number of training samples was used, active learning outperformed passive learning (random sampling) by up to 5% of overall accuracy and up to 12% of class F-score. In addition, and depending on the class considered, the random sampling required up to 50% more samples to achieve the same performance of an active learning-based model. Moreover, the results demonstrate the suitability of the derived global model to accurately map poplar plantations among other tree species with overall accuracy values up to 14% higher than those obtained with local models. The proposed approach paves the way for national-scale mapping in an operational context.

Download Full-text