Using knowledge to improve N-Gram language modelling through the MGGI methodology

Author(s):  
Enrique Vidal ◽  
David Llorens
Keyword(s):  
2016 ◽  
Author(s):  
Louis Onrust ◽  
Antal van den Bosch ◽  
Hugo Van hamme

2012 ◽  
pp. 84-88
Author(s):  
James O’Sullivan

While not quite a neologism at this point, the term “digital humanities” for some still bears a significant measure of ambiguity. What separates digital humanities from the humanities? Throughout this article, I will attempt to offer some clarity on this separation, outlining what it is that makes digital humanities, digital. The field of scholarship now recognised as the digital humanities has not always held this particular mantle. Initially, this emerging discipline was referred to as “humanities computing”, a term that gathered momentum as early as the late ‘70s, the evidence for which can be found in a quick n-gram of Google Books. N-grams offer an approach to probabilistic language modelling that can be used for a variety of purposes, in this case, to identify the frequency of a sequence of words in a set of texts. Google Ngram Viewer is not a scholarly tool appropriate for research, but it is ...


2020 ◽  
Author(s):  
Jose Juan Almagro Armenteros ◽  
Alexander Rosenberg Johansen ◽  
Ole Winther ◽  
Henrik Nielsen

AbstractMotivationLanguage modelling (LM) on biological sequences is an emergent topic in the field of bioinformatics. Current research has shown that language modelling of proteins can create context-dependent representations that can be applied to improve performance on different protein prediction tasks. However, little effort has been directed towards analyzing the properties of the datasets used to train language models. Additionally, only the performance of cherry-picked downstream tasks are used to assess the capacity of LMs.ResultsWe analyze the entire UniProt database and investigate the different properties that can bias or hinder the performance of LMs such as homology, domain of origin, quality of the data, and completeness of the sequence. We evaluate n-gram and Recurrent Neural Network (RNN) LMs to assess the impact of these properties on performance. To our knowledge, this is the first protein dataset with an emphasis on language modelling. Our inclusion of properties specific to proteins gives a detailed analysis of how well natural language processing methods work on biological sequences. We find that organism domain and quality of data have an impact on the performance, while the completeness of the proteins has little influence. The RNN based LM can learn to model Bacteria, Eukarya, and Archaea; but struggles with Viruses. By using the LM we can also generate novel proteins that are shown to be similar to real proteins.Availability and implementationhttps://github.com/alrojo/UniLanguage


2014 ◽  
Vol 102 (1) ◽  
pp. 81-92 ◽  
Author(s):  
Baltescu Paul ◽  
Blunsom Phil ◽  
Hoang Hieu

Abstract This paper presents an open source implementation1 of a neural language model for machine translation. Neural language models deal with the problem of data sparsity by learning distributed representations for words in a continuous vector space. The language modelling probabilities are estimated by projecting a word's context in the same space as the word representations and by assigning probabilities proportional to the distance between the words and the context's projection. Neural language models are notoriously slow to train and test. Our framework is designed with scalability in mind and provides two optional techniques for reducing the computational cost: the so-called class decomposition trick and a training algorithm based on noise contrastive estimation. Our models may be extended to incorporate direct n-gram features to learn weights for every n-gram in the training data. Our framework comes with wrappers for the cdec and Moses translation toolkits, allowing our language models to be incorporated as normalized features in their decoders (inside the beam search).


Author(s):  
M. D. Riazur Rahman ◽  
M. D. Tarek Habib ◽  
M. D. Sadekur Rahman ◽  
Gazi Zahirul Islam ◽  
M. D. Abbas Ali Khan

N-gram based language models are very popular and extensively used statistical methods for solving various natural language processing problems including grammar checking. Smoothing is one of the most effective techniques used in building a language model to deal with data sparsity problem. Kneser-Ney is one of the most prominently used and successful smoothing technique for language modelling. In our previous work, we presented a Witten-Bell smoothing based language modelling technique for checking grammatical correctness of Bangla sentences which showed promising results outperforming previous methods. In this work, we proposed an improved method using Kneser-Ney smoothing based n-gram language model for grammar checking and performed a comparative performance analysis between Kneser-Ney and Witten-Bell smoothing techniques for the same purpose. We also provided an improved technique for calculating the optimum threshold which further enhanced the the results. Our experimental results show that, Kneser-Ney outperforms Witten-Bell as a smoothing technique when used with n-gram LMs for checking grammatical correctness of Bangla sentences.


Author(s):  
Vitaly Kuznetsov ◽  
Hank Liao ◽  
Mehryar Mohri ◽  
Michael Riley ◽  
Brian Roark

Author(s):  
Saad Irtza ◽  
Vidhyasaharan Sethu ◽  
Sarith Fernando ◽  
Eliathamby Ambikairajah ◽  
Haizhou Li

2020 ◽  
Author(s):  
Grant P. Strimel ◽  
Ariya Rastrow ◽  
Gautam Tiwari ◽  
Adrien Piérard ◽  
Jon Webb

PLoS ONE ◽  
2020 ◽  
Vol 15 (3) ◽  
pp. e0229963 ◽  
Author(s):  
Ignat Drozdov ◽  
Daniel Forbes ◽  
Benjamin Szubert ◽  
Mark Hall ◽  
Chris Carlin ◽  
...  
Keyword(s):  
X Ray ◽  

Sign in / Sign up

Export Citation Format

Share Document