Low-Rank RNN Adaptation for Context-Aware Language Modeling

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00035 ◽

2018 ◽

Vol 6 ◽

pp. 497-510 ◽

Cited By ~ 3

Author(s):

Aaron Jaech ◽

Mari Ostendorf

Keyword(s):

Language Model ◽

Language Modeling ◽

Language Models ◽

Low Rank ◽

Model Parameters ◽

Context Aware ◽

Context Vector ◽

Additional Input ◽

Different Types ◽

Powerful Mechanism

A context-aware language model uses location, user and/or domain metadata (context) to adapt its predictions. In neural language models, context information is typically represented as an embedding and it is given to the RNN as an additional input, which has been shown to be useful in many applications. We introduce a more powerful mechanism for using context to adapt an RNN by letting the context vector control a low-rank transformation of the recurrent layer weight matrix. Experiments show that allowing a greater fraction of the model parameters to be adjusted has benefits in terms of perplexity and classification for several different types of context.

Download Full-text

Generating Sentences by Editing Prototypes

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00030 ◽

2018 ◽

Vol 6 ◽

pp. 437-450 ◽

Cited By ~ 10

Author(s):

Kelvin Guu ◽

Tatsunori B. Hashimoto ◽

Yonatan Oren ◽

Percy Liang

Keyword(s):

Language Model ◽

Language Modeling ◽

Language Models ◽

Training Corpus ◽

Human Evaluation ◽

Sentence Level ◽

Sentence Similarity ◽

Traditional Language ◽

Generative Language

We propose a new generative language model for sentences that first samples a prototype sentence from the training corpus and then edits it into a new sentence. Compared to traditional language models that generate from scratch either left-to-right or by first sampling a latent sentence vector, our prototype-then-edit model improves perplexity on language modeling and generates higher quality outputs according to human evaluation. Furthermore, the model gives rise to a latent edit vector that captures interpretable semantics such as sentence similarity and sentence-level analogies.

Download Full-text

MSA Transformer

10.1101/2021.02.12.430858 ◽

2021 ◽

Author(s):

Roshan Rao ◽

Jason Liu ◽

Robert Verkuil ◽

Joshua Meier ◽

John F. Canny ◽

...

Keyword(s):

Structure Learning ◽

State Of The Art ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Multiple Sequence ◽

Wide Margin ◽

Current State ◽

Individual Sequences ◽

And Function

AbstractUnsupervised protein language models trained across millions of diverse sequences learn structure and function of proteins. Protein language models studied to date have been trained to perform inference from individual sequences. The longstanding approach in computational biology has been to make inferences from a family of evolutionarily related sequences by fitting a model to each family independently. In this work we combine the two paradigms. We introduce a protein language model which takes as input a set of sequences in the form of a multiple sequence alignment. The model interleaves row and column attention across the input sequences and is trained with a variant of the masked language modeling objective across many protein families. The performance of the model surpasses current state-of-the-art unsupervised structure learning methods by a wide margin, with far greater parameter efficiency than prior state-of-the-art protein language models.

Download Full-text

Dynamic Language Models for Streaming Text

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00175 ◽

2014 ◽

Vol 2 ◽

pp. 181-192 ◽

Cited By ~ 6

Author(s):

Dani Yogatama ◽

Chong Wang ◽

Bryan R. Routledge ◽

Noah A. Smith ◽

Eric P. Xing

Keyword(s):

Social Media ◽

Temporal Dynamics ◽

Language Model ◽

Language Modeling ◽

Streaming Data ◽

Language Models ◽

Linguistic Context ◽

Text Data ◽

Competing Models ◽

Context Features

We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres—economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models.

Download Full-text

Which Sentence Embeddings and Which Layers Encode Syntactic Structure?

10.31234/osf.io/9jsnz ◽

2020 ◽

Author(s):

M. Alex Kelly ◽

Yang Xu ◽

Jesús Calvillo ◽

David Reitter

Keyword(s):

Syntactic Structure ◽

Dimensional Space ◽

Language Model ◽

Empirical Support ◽

Language Models ◽

Double Object ◽

Different Types ◽

Prepositional Object ◽

High Degree ◽

Shed Light

Recent models of language have eliminated syntactic-semantic dividing lines. We explore the psycholinguistic implications of this development by comparing different types of sentence embeddings in their ability to encode syntactic constructions. Our study uses contrasting sentence structures known to cause syntactic priming effects, that is, the tendency in humans to re- peat sentence structures after recent exposure. We compare how syntactic alternatives are captured by sentence embed- dings produced by a neural language model (BERT) or by the composition of word embeddings (BEAGLE, HHM, GloVe). Dative double object vs. prepositional object and active vs. passive sentences are separable in the high-dimensional space of the sentence embeddings and can be classified with a high degree of accuracy. The results lend empirical support to the modern, computational, integrated accounts of semantics and syntax, and they shed light on the information stored at different layers in deep language models such as BERT.

Download Full-text

Exploring the Data Efficiency of Cross-Lingual Post-Training in Pretrained Language Models

Applied Sciences ◽

10.3390/app11051974 ◽

2021 ◽

Vol 11 (5) ◽

pp. 1974 ◽

Cited By ~ 1

Author(s):

Chanhee Lee ◽

Kisu Yang ◽

Taesun Whang ◽

Chanjun Park ◽

Andrew Matteson ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Low Resource ◽

High Resource ◽

Cross Lingual ◽

Data Efficiency

Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high- and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efficient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-specific parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efficiency by a factor of up to 32 compared to monolingual training.

Download Full-text

Rare Words: A Major Problem for Contextualized Embeddings and How to Fix it by Attentive Mimicking

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6403 ◽

2020 ◽

Vol 34 (05) ◽

pp. 8766-8774 ◽

Cited By ~ 1

Author(s):

Timo Schick ◽

Hinrich Schütze

Keyword(s):

Neural Network ◽

Natural Language Processing ◽

Language Processing ◽

Deep Neural Network ◽

Language Model ◽

Language Modeling ◽

Fine Tuning ◽

Language Models ◽

Network Architectures ◽

Semantic Properties

Pretraining deep neural network architectures with a language modeling objective has brought large improvements for many natural language processing tasks. Exemplified by BERT, a recently proposed such architecture, we demonstrate that despite being trained on huge amounts of data, deep language models still struggle to understand rare words. To fix this problem, we adapt Attentive Mimicking, a method that was designed to explicitly learn embeddings for rare words, to deep language models. In order to make this possible, we introduce one-token approximation, a procedure that enables us to use Attentive Mimicking even when the underlying language model uses subword-based tokenization, i.e., it does not assign embeddings to all words. To evaluate our method, we create a novel dataset that tests the ability of language models to capture semantic properties of words without any task-specific fine-tuning. Using this dataset, we show that adding our adapted version of Attentive Mimicking to BERT does substantially improve its understanding of rare words.

Download Full-text

Neural Lattice Language Models

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00036 ◽

2018 ◽

Vol 6 ◽

pp. 529-541 ◽

Cited By ~ 3

Author(s):

Jacob Buckman ◽

Graham Neubig

Keyword(s):

Information Flow ◽

Language Model ◽

Language Modeling ◽

Language Models ◽

Model Experiments ◽

Chinese Model ◽

Word Level ◽

Modeling Paradigm ◽

Multiple Granularities ◽

Linguistic Intuitions

In this work, we propose a new language modeling paradigm that has the ability to perform both prediction and moderation of information flow at multiple granularities: neural lattice language models. These models construct a lattice of possible paths through a sentence and marginalize across this lattice to calculate sequence probabilities or optimize parameters. This approach allows us to seamlessly incorporate linguistic intuitions — including polysemy and the existence of multiword lexical items — into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95% relative to a word-level baseline, and that a Chinese model that handles multi-character tokens is able to improve perplexity by 20.94% relative to a character-level baseline.

Download Full-text

Character n-Gram Embeddings to Improve RNN Language Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015074 ◽

2019 ◽

Vol 33 ◽

pp. 5074-5082 ◽

Cited By ~ 2

Author(s):

Sho Takase ◽

Jun Suzuki ◽

Masaaki Nagata

Keyword(s):

Neural Network ◽

Machine Translation ◽

Recurrent Neural Network ◽

Language Model ◽

Language Modeling ◽

Word Embedding ◽

Experimental Results ◽

Language Models ◽

Word Embeddings ◽

N Gram

This paper proposes a novel Recurrent Neural Network (RNN) language model that takes advantage of character information. We focus on character n-grams based on research in the field of word embedding construction (Wieting et al. 2016). Our proposed method constructs word embeddings from character ngram embeddings and combines them with ordinary word embeddings. We demonstrate that the proposed method achieves the best perplexities on the language modeling datasets: Penn Treebank, WikiText-2, and WikiText-103. Moreover, we conduct experiments on application tasks: machine translation and headline generation. The experimental results indicate that our proposed method also positively affects these tasks

Download Full-text

Using Morphological Data in Language Modeling for Serbian Large Vocabulary Speech Recognition

Computational Intelligence and Neuroscience ◽

10.1155/2019/5072918 ◽

2019 ◽

Vol 2019 ◽

pp. 1-8 ◽

Cited By ~ 4

Author(s):

Edvin Pakoci ◽

Branislav Popović ◽

Darko Pekar

Keyword(s):

Speech Recognition ◽

Language Model ◽

Recognition System ◽

Language Modeling ◽

Error Rates ◽

Language Models ◽

Morphological Data ◽

Semantic Features ◽

Automatic Speech Recognition System ◽

Large Vocabulary

Serbian is in a group of highly inflective and morphologically rich languages that use a lot of different word suffixes to express different grammatical, syntactic, or semantic features. This kind of behaviour usually produces a lot of recognition errors, especially in large vocabulary systems—even when, due to good acoustical matching, the correct lemma is predicted by the automatic speech recognition system, often a wrong word ending occurs, which is nevertheless counted as an error. This effect is larger for contexts not present in the language model training corpus. In this manuscript, an approach which takes into account different morphological categories of words for language modeling is examined, and the benefits in terms of word error rates and perplexities are presented. These categories include word type, word case, grammatical number, and gender, and they were all assigned to words in the system vocabulary, where applicable. These additional word features helped to produce significant improvements in relation to the baseline system, both for n-gram-based and neural network-based language models. The proposed system can help overcome a lot of tedious errors in a large vocabulary system, for example, for dictation, both for Serbian and for other languages with similar characteristics.

Download Full-text

Surface Chemical Heterogeneity of Low Rank Coal Characterized by Micro-FTIR and Its Correlation with Hydrophobicity

Minerals ◽

10.3390/min11030239 ◽

2021 ◽

Vol 11 (3) ◽

pp. 239

Author(s):

Wei Wang ◽

Long Liang ◽

Yaoli Peng ◽

Maria Holuszko

Keyword(s):

Surface Chemistry ◽

Functional Group ◽

Group Composition ◽

Contact Angles ◽

Low Rank ◽

Surface Chemical ◽

Carbonyl Groups ◽

Low Rank Coal ◽

Different Types ◽

Aliphatic Carbon

Micro-Fourier transform infrared (micro-FTIR) spectroscopy was used to correlate the surface chemistry of low rank coal with hydrophobicity. Six square areas without mineral impurities on low rank coal surfaces were selected as testing areas. A specially-designed methodology was applied to conduct micro-FTIR measurements and contact angle tests on the same testing area. A series of semi-quantitative functional group ratios derived from micro-FTIR spectra were correlated with contact angles, and the determination coefficients of linear regression were calculated and compared in order to identify the structure of the functional group ratios. Finally, two semi-quantitative ratios composed of aliphatic carbon hydrogen, aromatic carbon hydrogen and two different types of carbonyl groups were proposed as indicators of low rank coal hydrophobicity. This work provided a rapid way to predict low rank coal hydrophobicity through its functional group composition and helped us understand the hydrophobicity heterogeneity of low rank coal from the perspective of its surface chemistry.

Download Full-text