Differentiable Subset Pruning of Transformer Heads

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00436 ◽

2021 ◽

Vol 9 ◽

pp. 1442-1459

Author(s):

Jiaoda Li ◽

Ryan Cotterell ◽

Mrinmaya Sachan

Keyword(s):

Machine Translation ◽

Gradient Descent ◽

Stochastic Gradient Descent ◽

Hard Constraint ◽

Precise Control ◽

Level 1 ◽

Sparsity Level ◽

Pruning Technique ◽

Different Parts ◽

Better Than

Abstract Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. ntuitively, our method learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. he importance variables are learned via stochastic gradient descent. e conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1

Download Full-text

Local Regularizer Improves Generalization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6167 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6861-6868 ◽

Cited By ~ 1

Author(s):

Yikai Zhang ◽

Hui Qu ◽

Dimitris Metaxas ◽

Chao Chen

Keyword(s):

Deep Learning ◽

Theoretical Analysis ◽

Experimental Evidence ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Training Algorithms ◽

Training Methods ◽

Theoretical Understanding ◽

Better Than

Regularization plays an important role in generalization of deep learning. In this paper, we study the generalization power of an unbiased regularizor for training algorithms in deep learning. We focus on training methods called Locally Regularized Stochastic Gradient Descent (LRSGD). An LRSGD leverages a proximal type penalty in gradient descent steps to regularize SGD in training. We show that by carefully choosing relevant parameters, LRSGD generalizes better than SGD. Our thorough theoretical analysis is supported by experimental evidence. It advances our theoretical understanding of deep learning and provides new perspectives on designing training algorithms. The code is available at https://github.com/huiqu18/LRSGD.

Download Full-text

ANALISA DAN PREDIKSI IKLAN LOWONGAN KERJA PALSU DENGAN METODE NATURAL LANGUAGE PROGRAMING DAN MACHINE LEARNING

Jurnal Informatika ◽

10.30873/ji.v21i1.2865 ◽

2021 ◽

Vol 21 (1) ◽

pp. 14-22

Author(s):

Hary Sabita ◽

Fitria Fitria ◽

Riko Herwanto

Keyword(s):

Machine Learning ◽

Gradient Descent ◽

Naive Bayes ◽

Group Discussion ◽

Naïve Bayes ◽

Stochastic Gradient Descent ◽

Baseline Model ◽

Bayes Model ◽

The Us ◽

Better Than

This research was conducted using the data provided by Kaggle. This data contains features that describe job vacancies. This study used location-based data in the US, which covered 60% of all data. Job vacancies that are posted are categorized as real or fake. This research was conducted by following five stages, namely: defining the problem, collecting data, cleaning data (exploration and pre-processing) and modeling. The evaluation and validation models use Naïve Bayes as a baseline model and Small Group Discussion as end model. For the Naïve Bayes model, an accuracy value of 0.971 and an F1-score of 0.743 is obtained. While the Stochastic Gradient Descent obtained an accuracy value of 0.977 and an F1-score of 0.81. These final results indicate that SGD performs slightly better than Naïve Bayes.Keywords—NLP, Machine Learning, Naïve Bayes, SGD, Fake Jobs

Download Full-text

A Method for Greenhouse Temperature Prediction Based on XGBoost Algorithm and Linear Residual Model

CONVERTER ◽

10.17762/converter.271 ◽

2021 ◽

pp. 108-121

Author(s):

Huijin Han, Et al.

Keyword(s):

Mean Squared Error ◽

Prediction Method ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Small Scale ◽

Temperature Prediction ◽

Precise Control ◽

Squared Error ◽

Vector Machines ◽

Better Than

Temperature prediction is significant for precise control of the greenhouse environment. Traditional machine learning methods usually rely on a large amount of data. Therefore, it is difficult to make a stable and accurate prediction based on a small amount of data. This paper proposes a temperature prediction method for greenhouses. With the prediction target transformed to the logarithmic difference of temperature inside and outside the greenhouse,the method first uses XGBoost algorithm to make a preliminary prediction. Second, a linear model is used to predict the residuals of the predicted target. The predicted temperature is obtained combining the preliminary prediction and the residuals. Based on the 20-day greenhouse data, the results show that the target transformation applied in our method is better than the others presented in the paper. The MSE (Mean Squared Error) of our method is 0.0844, which is respectively 20.7%, 76.0%, 10.2%, and 95.3% of the MSE of LR (Logistic Regression), SGD (Stochastic Gradient Descent), SVM (Support Vector Machines), and XGBoost algorithm. The results indicate that our method significantly improves the accuracy of the prediction based on the small-scale data.

Download Full-text

Deep Neural Networks with Multistate Activation Functions

Computational Intelligence and Neuroscience ◽

10.1155/2015/721367 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Chenghao Cai ◽

Yanyan Xu ◽

Dengfeng Ke ◽

Kaile Su

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Error Rates ◽

Stochastic Gradient Descent ◽

Activation Functions ◽

Classification Problems ◽

Training Set ◽

Relative Improvement ◽

Better Than

We propose multistate activation functions (MSAFs) for deep neural networks (DNNs). These MSAFs are new kinds of activation functions which are capable of representing more than two states, including theN-order MSAFs and the symmetrical MSAF. DNNs with these MSAFs can be trained via conventional Stochastic Gradient Descent (SGD) as well as mean-normalised SGD. We also discuss how these MSAFs perform when used to resolve classification problems. Experimental results on the TIMIT corpus reveal that, on speech recognition tasks, DNNs with MSAFs perform better than the conventional DNNs, getting a relative improvement of 5.60% on phoneme error rates. Further experiments also reveal that mean-normalised SGD facilitates the training processes of DNNs with MSAFs, especially when being with large training sets. The models can also be directly trained without pretraining when the training set is sufficiently large, which results in a considerable relative improvement of 5.82% on word error rates.

Download Full-text

Sentiment Analysis for Roman Urdu

Mehran University Research Journal of Engineering and Technology ◽

10.22581/muet1982.1902.20 ◽

2019 ◽

Vol 38 (2) ◽

pp. 463-470 ◽

Cited By ~ 1

Author(s):

Ayesha Rafique ◽

Kamran Malik ◽

Zubair Nawaz ◽

Faisal Bukhari ◽

Akhtar Hussain Jalbani

Keyword(s):

Sentiment Analysis ◽

Gradient Descent ◽

English Language ◽

Research Work ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Online Comments ◽

Better Than

The majority of online comments/opinions are written in text-free format. Sentiment Analysis can be used as a measure to express the polarity (positive/negative) of comments/opinions. These comments/ opinions can be in different languages i.e. English, Urdu, Roman Urdu, Hindi, Arabic etc. Mostly, people have worked on the sentiment analysis of the English language. Very limited research work has been done in Urdu or Roman Urdu languages. Whereas, Hindi/Urdu is the third largest language in the world. In this paper, we focus on the sentiment analysis of comments/opinions in Roman Urdu. There is no publicly available Roman Urdu public opinion dataset. We prepare a dataset by taking comments/opinions of people in Roman Urdu from different websites. Three supervised machine learning algorithms namely NB (Naive Bayes), LRSGD (Logistic Regression with Stochastic Gradient Descent) and SVM (Support Vector Machine) have been applied on this dataset. From results of experiments, it can be concluded that SVM performs better than NB and LRSGD in terms of accuracy. In case of SVM, an accuracy of 87.22% is achieved.

Download Full-text

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

10.18653/v1/d18-1332 ◽

2018 ◽

Author(s):

Nikolay Bogoychev ◽

Kenneth Heafield ◽

Alham Fikri Aji ◽

Marcin Junczys-Dowmunt

Keyword(s):

Machine Translation ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Neural Machine Translation

Download Full-text

Linear Support Vector Machine (SVM) with Stochastic Gradient Descent (SGD) training and multinomial Nave Bayes (NB) in News Classification

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v7i4.360363 ◽

2019 ◽

Vol 7 (4) ◽

pp. 360-363

Author(s):

Feroz Ahmed ◽

Shabina Ghafir

Keyword(s):

Support Vector Machine ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Linear Support Vector Machine

Download Full-text

Comparison of SVM, RF and SGD Methods for Determination of Programmer's Performance Classification Model in Social Media Activities

Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) ◽

10.29207/resti.v4i2.1770 ◽

2020 ◽

Vol 4 (2) ◽

pp. 329-335

Author(s):

Rusydi Umar ◽

Imam Riadi ◽

Purwono

Keyword(s):

Social Media ◽

Gradient Descent ◽

Classification Model ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Svm Algorithm ◽

Vector Machines ◽

Performance Patterns ◽

A Company

The failure of most startups in Indonesia is caused by team performance that is not solid and competent. Programmers are an integral profession in a startup team. The development of social media can be used as a strategic tool for recruiting the best programmer candidates in a company. This strategic tool is in the form of an automatic classification system of social media posting from prospective programmers. The classification results are expected to be able to predict the performance patterns of each candidate with a predicate of good or bad performance. The classification method with the best accuracy needs to be chosen in order to get an effective strategic tool so that a comparison of several methods is needed. This study compares classification methods including the Support Vector Machines (SVM) algorithm, Random Forest (RF) and Stochastic Gradient Descent (SGD). The classification results show the percentage of accuracy with k = 10 cross validation for the SVM algorithm reaches 81.3%, RF at 74.4%, and SGD at 80.1% so that the SVM method is chosen as a model of programmer performance classification on social media activities.

Download Full-text

Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty

Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - ACL-IJCNLP '09 ◽

10.3115/1687878.1687946 ◽

2009 ◽

Cited By ~ 45

Author(s):

Yoshimasa Tsuruoka ◽

Jun'ichi Tsujii ◽

Sophia Ananiadou

Keyword(s):

Gradient Descent ◽

Linear Models ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Log Linear

Download Full-text

Cloning Safe Driving Behavior for Self-Driving Cars using Convolutional Neural Networks

Recent Patents on Computer Science ◽

10.2174/2213275911666181106160002 ◽

2019 ◽

Vol 12 (2) ◽

pp. 120-127 ◽

Cited By ~ 5

Author(s):

Wael Farag

Keyword(s):

Gradient Descent ◽

Autonomous Driving ◽

Driving Behavior ◽

Training Data ◽

Stochastic Gradient Descent ◽

Data Set ◽

Safe Driving ◽

Processing Pipeline ◽

Self Driving Cars ◽

And Training

Background: In this paper, a Convolutional Neural Network (CNN) to learn safe driving behavior and smooth steering manoeuvring, is proposed as an empowerment of autonomous driving technologies. The training data is collected from a front-facing camera and the steering commands issued by an experienced driver driving in traffic as well as urban roads. Methods: This data is then used to train the proposed CNN to facilitate what it is called “Behavioral Cloning”. The proposed Behavior Cloning CNN is named as “BCNet”, and its deep seventeen-layer architecture has been selected after extensive trials. The BCNet got trained using Adam’s optimization algorithm as a variant of the Stochastic Gradient Descent (SGD) technique. Results: The paper goes through the development and training process in details and shows the image processing pipeline harnessed in the development. Conclusion: The proposed approach proved successful in cloning the driving behavior embedded in the training data set after extensive simulations.

Download Full-text