Training Method and Device of Chemical Industry Chinese Language Model Based on Knowledge Distillation

Scientific Programming ◽

10.1155/2021/5753693 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Wen-Ting Li ◽

Shang-Bing Gao ◽

Jun-Qiang Zhang ◽

Shu-Xing Guo

Keyword(s):

Language Processing ◽

Chemical Industry ◽

Language Model ◽

Middle Layer ◽

Language Models ◽

Learning Ability ◽

Simplified Model ◽

Student Model ◽

Model Based ◽

Knowledge Distillation

Recent advances in pretraining language models have obtained state-of-the-art results in various natural language processing tasks. However, these huge pretraining language models are difficult to be used in practical applications, such as mobile devices and embedded devices. Moreover, there is no pretraining language model for the chemical industry. In this work, we propose a method to pretrain a smaller language representation model of the chemical industry domain. First, a huge number of chemical industry texts are used as pretraining corpus, and nontraditional knowledge distillation technology is used to build a simplified model to learn the knowledge in the BERT model. By learning the embedded layer, the middle layer, and the prediction layer at different stages, the simplified model not only learns the probability distribution of the prediction layer but also learns the embedded layer and the middle layer at the same time, to acquire the learning ability of BERT model. Finally, it is applied to the downstream tasks. Experiments show that, compared with the current BERT model distillation method, our method makes full use of the rich feature knowledge in the middle layer of the teacher model while building a student model based on the BiLSTM architecture, which effectively solves the problem that the traditional student model based on the transformer architecture is too large and improves the accuracy of the language model in the chemical domain.

Download Full-text

Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches

AI ◽

10.3390/ai2010001 ◽

2021 ◽

Vol 2 (1) ◽

pp. 1-16

Author(s):

Juan Cruz-Benito ◽

Sanjay Vishwakarma ◽

Francisco Martin-Fernandez ◽

Ismael Faro

Keyword(s):

Deep Learning ◽

Learning Community ◽

Programming Languages ◽

Language Processing ◽

Code Generation ◽

Language Model ◽

Language Models ◽

Stochastic Gradient Descent ◽

Network Architectures ◽

Learning Architectures

In recent years, the use of deep learning in language models has gained much attention. Some research projects claim that they can generate text that can be interpreted as human writing, enabling new possibilities in many application areas. Among the different areas related to language processing, one of the most notable in applying this type of modeling is programming languages. For years, the machine learning community has been researching this software engineering area, pursuing goals like applying different approaches to auto-complete, generate, fix, or evaluate code programmed by humans. Considering the increasing popularity of the deep learning-enabled language models approach, we found a lack of empirical papers that compare different deep learning architectures to create and use language models based on programming code. This paper compares different neural network architectures like Average Stochastic Gradient Descent (ASGD) Weight-Dropped LSTMs (AWD-LSTMs), AWD-Quasi-Recurrent Neural Networks (QRNNs), and Transformer while using transfer learning and different forms of tokenization to see how they behave in building language models using a Python dataset for code generation and filling mask tasks. Considering the results, we discuss each approach’s different strengths and weaknesses and what gaps we found to evaluate the language models or to apply them in a real programming context.

Download Full-text

Astrid

Proceedings of the VLDB Endowment ◽

10.14778/3436905.3436907 ◽

2020 ◽

Vol 14 (4) ◽

pp. 471-484

Author(s):

Suraj Shetiya ◽

Saravanan Thirumuruganathan ◽

Nick Koudas ◽

Gautam Das

Keyword(s):

Deep Learning ◽

Objective Function ◽

Pattern Matching ◽

Language Processing ◽

Language Model ◽

Language Models ◽

Selectivity Estimation ◽

Statistical Correlations ◽

Benchmark Datasets ◽

Traditional Approaches

Accurate selectivity estimation for string predicates is a long-standing research challenge in databases. Supporting pattern matching on strings (such as prefix, substring, and suffix) makes this problem much more challenging, thereby necessitating a dedicated study. Traditional approaches often build pruned summary data structures such as tries followed by selectivity estimation using statistical correlations. However, this produces insufficiently accurate cardinality estimates resulting in the selection of sub-optimal plans by the query optimizer. Recently proposed deep learning based approaches leverage techniques from natural language processing such as embeddings to encode the strings and use it to train a model. While this is an improvement over traditional approaches, there is a large scope for improvement. We propose Astrid, a framework for string selectivity estimation that synthesizes ideas from traditional and deep learning based approaches. We make two complementary contributions. First, we propose an embedding algorithm that is query-type (prefix, substring, and suffix) and selectivity aware. Consider three strings 'ab', 'abc' and 'abd' whose prefix frequencies are 1000, 800 and 100 respectively. Our approach would ensure that the embedding for 'ab' is closer to 'abc' than 'abd'. Second, we describe how neural language models could be used for selectivity estimation. While they work well for prefix queries, their performance for substring queries is sub-optimal. We modify the objective function of the neural language model so that it could be used for estimating selectivities of pattern matching queries. We also propose a novel and efficient algorithm for optimizing the new objective function. We conduct extensive experiments over benchmark datasets and show that our proposed approaches achieve state-of-the-art results.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

Comparing gated and simple recurrent neural network architectures as models of human sentence processing

10.31234/osf.io/wec74 ◽

2018 ◽

Author(s):

Christoph Aurnhammer ◽

Stefan L. Frank

Keyword(s):

Language Processing ◽

Sentence Processing ◽

Language Model ◽

Cell Types ◽

Recurrent Network ◽

Cognitive Models ◽

Language Models ◽

Model Quality ◽

Sentence Reading ◽

Human Sentence Processing

The Simple Recurrent Network (SRN) has a long tradition in cognitive models of language processing. More recently, gated recurrent networks have been proposed that often outperform the SRN on natural language processing tasks. Here, we investigate whether two types of gated networks perform better as cognitive models of sentence reading than SRNs, beyond their advantage as language models.This will reveal whether the filtering mechanism implemented in gated networks corresponds to an aspect of human sentence processing.We train a series of language models differing only in the cell types of their recurrent layers. We then compute word surprisal values for stimuli used in self-paced reading, eye-tracking, and electroencephalography experiments, and quantify the surprisal values' fit to experimental measures that indicate human sentence reading effort.While the gated networks provide better language models, they do not outperform their SRN counterpart as cognitive models when language model quality is equal across network types. Our results suggest that the different architectures are equally valid as models of human sentence processing.

Download Full-text

BERTtoCNN: Similarity-preserving enhanced knowledge distillation for stance detection

PLoS ONE ◽

10.1371/journal.pone.0257130 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257130

Author(s):

Yang Li ◽

Yuqing Sun ◽

Nana Zhu

Keyword(s):

Language Model ◽

Language Models ◽

Limited Resources ◽

Teacher Language ◽

Proposed Model ◽

Knowledge Distillation ◽

Similarity Preserving ◽

The One ◽

Chinese And English ◽

Text Sentiment Analysis

In recent years, text sentiment analysis has attracted wide attention, and promoted the rise and development of stance detection research. The purpose of stance detection is to determine the author’s stance (favor or against) towards a specific target or proposition in the text. Pre-trained language models like BERT have been proven to perform well in this task. However, in many reality scenes, they are usually very expensive in computation, because such heavy models are difficult to implement with limited resources. To improve the efficiency while ensuring the performance, we propose a knowledge distillation model BERTtoCNN, which combines the classic distillation loss and similarity-preserving loss in a joint knowledge distillation framework. On the one hand, BERTtoCNN provides an efficient distillation process to train a novel ‘student’ CNN structure from a much larger ‘teacher’ language model BERT. On the other hand, based on the similarity-preserving loss function, BERTtoCNN guides the training of a student network, so that input pairs with similar (dissimilar) activation in the teacher network have similar (dissimilar) activation in the student network. We conduct experiments and test the proposed model on the open Chinese and English stance detection datasets. The experimental results show that our model outperforms the competitive baseline methods obviously.

Download Full-text

Automatic Mixed-Precision Quantization Search of BERT

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/472 ◽

2021 ◽

Author(s):

Changsheng Zhao ◽

Ting Hua ◽

Yilin Shen ◽

Qian Lou ◽

Hongxia Jin

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Models ◽

Model Compression ◽

Mixed Precision ◽

Knowledge Distillation ◽

Model Size ◽

Orthogonal Methods ◽

Weight Pruning

Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. However, these models usually contain millions of parameters, which prevent them from the practical deployment on resource-constrained devices. Knowledge distillation, Weight pruning, and Quantization are known to be the main directions in model compression. However, compact models obtained through knowledge distillation may suffer from significant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few attempts based on quantization designed for natural language processing tasks, and they usually require manual setting on hyper-parameters. In this paper, we proposed an automatic mixed-precision quantization framework designed for BERT that can conduct quantization and pruning simultaneously. Specifically, our proposed method leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each sub-group automatically, and at the same pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method beats baselines by providing the same performance with much smaller model size. We also show the possibility of obtaining the extremely light-weight model by combining our solution with orthogonal methods such as DistilBERT.

Download Full-text

HIDING CRITICAL INFORMATION WHEN TRAINING LANGUAGE MODELS

EurasianUnionScientists ◽

10.31618/esu.2413-9335.2021.1.86.1349 ◽

2021 ◽

pp. 15-18

Author(s):

A. Evtushenko

Keyword(s):

Natural Language ◽

Language Processing ◽

Text Processing ◽

Language Model ◽

Personal Data ◽

Language Models ◽

Training Dataset ◽

Critical Information ◽

Research Company ◽

Learning Language

Machine learning language models are combinations of algorithms and neural networks designed for text processing composed in natural language (Natural Language Processing, NLP). In 2020, the largest language model from the artificial intelligence research company OpenAI, GPT-3, was released, the maximum number of parameters of which reaches 175 billion. The parameterization of the model increased by more than 100 times made it possible to improve the quality of generated texts to a level that is hard to distinguish from human-written texts. It is noteworthy that this model was trained on a training dataset mainly collected from open sources on the Internet, the volume of which is estimated at 570 GB. This article discusses the problem of memorizing critical information, in particular, personal data of individual, at the stage of training large language models (GPT-2/3 and derivatives), and also describes an algorithmic approach to solving this problem, which consists in additional preprocessing training dataset and refinement of the model inference in the context of generating pseudo-personal data and embedding into the results of work on the tasks of summarization, text generation, formation of answers to questions and others from the field of seq2seq.

Download Full-text

A word language model based contextual language processing on Chinese character recognition

10.1117/12.838718 ◽

2010 ◽

Author(s):

Chen Huang ◽

Xiaoqing Ding ◽

Yan Chen

Keyword(s):

Language Processing ◽

Character Recognition ◽

Chinese Character ◽

Language Model ◽

Chinese Character Recognition ◽

Model Based

Download Full-text

Enhancing argumentation component classification using contextual language model

Journal Of Big Data ◽

10.1186/s40537-021-00490-2 ◽

2021 ◽

Vol 8 (1) ◽

Author(s):

Hidayaturrahman ◽

Emmanuel Dave ◽

Derwin Suhartono ◽

Aniati Murni Arymurthy

Keyword(s):

Language Processing ◽

Language Model ◽

Language Models ◽

Context Sensitive ◽

Argumentation Mining ◽

Prediction Time ◽

Semantic Label ◽

Core Idea ◽

Increase In Accuracy ◽

Context Free

AbstractArguments facilitate humans to deliver their ideas. The outcome of the discussion heavily relies on the validity of the argument. If an argument is well-composed, it is more effective to grasp the core idea behind the argument. To grade the argument, machines can be utilized by decomposing into semantic label components. In natural language processing, multiple language models are available to perform this task. It is divided into context-free and contextual models. The majority of previous studies used hand-crafted features to perform argument component classification, while state of the art language models utilize machine learning. The majority of these language models ignore the context in an argument. This research paper aims to analyze whether by including the context in the classification process may improve the accuracy of the language model which will enhance the argumentation mining process as well. The same document corpus is fed into several language models. Word2Vec and GLoVe represent the context free models, while BERT and ELMo as context sensitive language models. Accuracy and time from each model are then compared to determine the importance of context. The result shows that contextual language models are proven to be able to boost classification accuracy by approximately 20%. However, time comes as a cost where contextual models require longer training and prediction time. The benefit from the increase in accuracy outweighs the burden of time. Thus, as a contextual task, argumentation mining is suggested to use contextual model where context must be included to achieve promising results.

Download Full-text

Neural Language Models Capture Some, But Not All, Agreement Attraction Effects

10.31234/osf.io/97qcg ◽

2020 ◽

Author(s):

Suhas Arehalli ◽

Tal Linzen

Keyword(s):

Real Time ◽

Language Processing ◽

Language Production ◽

Prediction Models ◽

Language Model ◽

Cognitive Model ◽

Language Models ◽

Word Prediction ◽

Wide Range ◽

Attraction Effects

The number of the subject in English must match the number of the corresponding verb (dog runs but dogs run). Yet in real-time language production and comprehension, speakers often mistakenly compute agreement between the verb and a grammatically irrelevant non-subject noun phrase instead. This phenomenon, referred to as agreement attraction, is modulated by a wide range of factors; any complete computational model of grammatical planning and comprehension would be expected to derive this rich empirical picture. Recent developments in Natural Language Processing have shown that neural networks trained only on word-prediction over large corpora are capable of capturing subject-verb agreement dependencies to a significant extent, but with occasional errors. The goal of this paper is to evaluate the potential of such neural word prediction models as a foundation for a cognitive model of real-time grammatical processing. We simulate six experiments taken from the agreement attraction literature with LSTMs, one common type of neural language model. The LSTMs captured the critical human behavior in three of them, indicating that (1) some agreement attraction phenomena can be captured by a generic sequence processing model, but (2) capturing the other phenomena may require models with more language-specific mechanisms

Download Full-text