Using knowledge to improve N-Gram language modelling through the MGGI methodology

Improving cross-domain n-gram language modelling with skipgrams

10.18653/v1/p16-2023 ◽

2016 ◽

Cited By ~ 1

Author(s):

Louis Onrust ◽

Antal van den Bosch ◽

Hugo Van hamme

Keyword(s):

Cross Domain ◽

Language Modelling ◽

N Gram

Download Full-text

What makes digital humanities, digital?

10.33178/boolean.2012.19 ◽

2012 ◽

pp. 84-88

Author(s):

James O’Sullivan

Keyword(s):

Digital Humanities ◽

Humanities Computing ◽

Google Books ◽

Language Modelling ◽

Significant Measure ◽

N Gram

While not quite a neologism at this point, the term “digital humanities” for some still bears a significant measure of ambiguity. What separates digital humanities from the humanities? Throughout this article, I will attempt to offer some clarity on this separation, outlining what it is that makes digital humanities, digital. The field of scholarship now recognised as the digital humanities has not always held this particular mantle. Initially, this emerging discipline was referred to as “humanities computing”, a term that gathered momentum as early as the late ‘70s, the evidence for which can be found in a quick n-gram of Google Books. N-grams offer an approach to probabilistic language modelling that can be used for a variety of purposes, in this case, to identify the frequency of a sequence of words in a set of texts. Google Ngram Viewer is not a scholarly tool appropriate for research, but it is ...

Download Full-text

Language modelling for biological sequences – curated datasets and baselines

10.1101/2020.03.09.983585 ◽

2020 ◽

Author(s):

Jose Juan Almagro Armenteros ◽

Alexander Rosenberg Johansen ◽

Ole Winther ◽

Henrik Nielsen

Keyword(s):

Language Processing ◽

Language Models ◽

Biological Sequences ◽

Quality Of Data ◽

Novel Proteins ◽

Protein Prediction ◽

Language Modelling ◽

N Gram ◽

The Impact

AbstractMotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage

Download Full-text

OxLM: A Neural Language Modelling Framework for Machine Translation

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0016 ◽

2014 ◽

Vol 102 (1) ◽

pp. 81-92 ◽

Cited By ~ 1

Author(s):

Baltescu Paul ◽

Blunsom Phil ◽

Hoang Hieu

Keyword(s):

Machine Translation ◽

Language Model ◽

Computational Cost ◽

Training Data ◽

Language Models ◽

Training Algorithm ◽

Beam Search ◽

Modelling Framework ◽

Language Modelling ◽

N Gram

Abstract This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word's context in the same space as the word representations and by assigning probabilities proportional to the distance between the words and the context's projection. Neural language models are notoriously slow to train and test. Our framework is designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation. Our models may be extended to incorporate direct n-gram features to learn weights for every n-gram in the training data. Our framework comes with wrappers for the cdec and Moses translation toolkits, allowing our language models to be incorporated as normalized features in their decoders (inside the beam search).

Download Full-text

An exploratory research on grammar checking of Bangla sentences using statistical language models

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i3.pp3244-3252 ◽

2020 ◽

Vol 10 (3) ◽

pp. 3244

Author(s):

M. D. Riazur Rahman ◽

M. D. Tarek Habib ◽

M. D. Sadekur Rahman ◽

Gazi Zahirul Islam ◽

M. D. Abbas Ali Khan

Keyword(s):

Language Processing ◽

Language Model ◽

Language Models ◽

Exploratory Research ◽

Smoothing Technique ◽

Comparative Performance ◽

Statistical Language Models ◽

Language Modelling ◽

N Gram ◽

Improved Technique

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.

Download Full-text

N-gram based Language Model for the QWERTY Keyboard Input Errors in a Touch Screen Environment

Korean Institute of Smart Media ◽

10.30693/smj.2018.7.2.54 ◽

2018 ◽

Vol 7 (2) ◽

pp. 54-59

Author(s):

Yoon Gee Ong ◽

◽

Seung Shik Kang ◽

Keyword(s):

Language Model ◽

Touch Screen ◽

Keyboard Input ◽

N Gram

Download Full-text

Learning N-Gram Language Models from Uncertain Data

10.21437/interspeech.2016-1093 ◽

2016 ◽

Cited By ~ 4

Author(s):

Vitaly Kuznetsov ◽

Hank Liao ◽

Mehryar Mohri ◽

Michael Riley ◽

Brian Roark

Keyword(s):

Uncertain Data ◽

Language Models ◽

N Gram

Download Full-text

Out of Set Language Modelling in Hierarchical Language Identification

10.21437/interspeech.2016-558 ◽

2016 ◽

Cited By ~ 4

Author(s):

Saad Irtza ◽

Vidhyasaharan Sethu ◽

Sarith Fernando ◽

Eliathamby Ambikairajah ◽

Haizhou Li

Keyword(s):

Language Identification ◽

Language Modelling

Download Full-text

Rescore in a Flash: Compact, Cache Efficient Hashing Data Structures for n-Gram Language Models

10.21437/interspeech.2020-1939 ◽

2020 ◽

Author(s):

Grant P. Strimel ◽

Ariya Rastrow ◽

Gautam Tiwari ◽

Adrien Piérard ◽

Jon Webb

Keyword(s):

Data Structures ◽

Language Models ◽

Cache Efficient ◽

N Gram

Download Full-text

Supervised and unsupervised language modelling in Chest X-Ray radiological reports

PLoS ONE ◽

10.1371/journal.pone.0229963 ◽

2020 ◽

Vol 15 (3) ◽

pp. e0229963 ◽

Cited By ~ 2

Author(s):

Ignat Drozdov ◽

Daniel Forbes ◽

Benjamin Szubert ◽

Mark Hall ◽

Chris Carlin ◽

...

Keyword(s):

X Ray ◽

Chest X Ray ◽

Language Modelling

Download Full-text