Integrated Sequence Tagging for Medieval Latin Using Deep Representation
  Learning

Mike Kestemont; Jeroen De Gussem

doi:10.46298/jdmdh.1398

Integrated Sequence Tagging for Medieval Latin Using Deep Representation Learning

Journal of Data Mining & Digital Humanities ◽

10.46298/jdmdh.1398 ◽

2017 ◽

Vol Special Issue on... (Towards a Digital Ecosystem:...) ◽

Author(s):

Mike Kestemont ◽

Jeroen De Gussem

Keyword(s):

Network Architecture ◽

Integrated Approach ◽

Representation Learning ◽

Context Aware ◽

Neural Network Architecture ◽

Medieval Latin ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Pos Tagger ◽

Speech Tagging

In this paper we consider two sequence tagging tasks for medieval Latin: part-of-speech tagging and lemmatization. These are both basic, yet foundational preprocessing steps in applications such as text re-use detection. Nevertheless, they are generally complicated by the considerable orthographic variation which is typical of medieval Latin. In Digital Classics, these tasks are traditionally solved in a (i) cascaded and (ii) lexicon-dependent fashion. For example, a lexicon is used to generate all the potential lemma-tag pairs for a token, and next, a context-aware PoS-tagger is used to select the most appropriate tag-lemma pair. Apart from the problems with out-of-lexicon items, error percolation is a major downside of such approaches. In this paper we explore the possibility to elegantly solve these tasks using a single, integrated approach. For this, we make use of a layered neural network architecture from the field of deep representation learning.

Download Full-text

Discontinuous Constituent Parsing with Pointer Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6275 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7724-7731

Author(s):

Daniel Fernández-González ◽

Carlos Gómez-Rodríguez

Keyword(s):

Computational Linguistics ◽

Network Architecture ◽

Neural Network Architecture ◽

Wide Margin ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Dependency Trees ◽

Syntactic Relations ◽

Dependency Structures ◽

Speech Tagging

One of the most complex syntactic representations used in computational linguistics and NLP are discontinuous constituent trees, crucial for representing all grammatical phenomena of languages such as German. Recent advances in dependency parsing have shown that Pointer Networks excel in efficiently parsing syntactic relations between words in a sentence. This kind of sequence-to-sequence models achieve outstanding accuracies in building non-projective dependency trees, but its potential has not been proved yet on a more difficult task. We propose a novel neural network architecture that, by means of Pointer Networks, is able to generate the most accurate discontinuous constituent representations to date, even without the need of Part-of-Speech tagging information. To do so, we internally model discontinuous constituent structures as augmented non-projective dependency structures. The proposed approach achieves state-of-the-art results on the two widely-used NEGRA and TIGER benchmarks, outperforming previous work by a wide margin.

Download Full-text

Deep Neural Network Architecture for Part-of-Speech Tagging for Turkish Language

2018 3rd International Conference on Computer Science and Engineering (UBMK) ◽

10.1109/ubmk.2018.8566272 ◽

2018 ◽

Cited By ~ 1

Author(s):

Cenk Anil Bahcevan ◽

Emirhan Kutlu ◽

Tugba Yildiz

Keyword(s):

Neural Network ◽

Network Architecture ◽

Deep Neural Network ◽

Neural Network Architecture ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Turkish Language ◽

Speech Tagging

Download Full-text

The Transformer Neural Network Architecture for Part-of-Speech Tagging

2021 IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineering (ElConRus) ◽

10.1109/elconrus51938.2021.9396231 ◽

2021 ◽

Author(s):

Artem A. Maksutov ◽

Vladimir I. Zamyatovskiy ◽

Viacheslav O. Morozov ◽

Sviatoslav O. Dmitriev

Keyword(s):

Neural Network ◽

Network Architecture ◽

Neural Network Architecture ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

An Integrated Approach to Chinese Word Segmentation and Part-of-Speech Tagging

Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead - Lecture Notes in Computer Science ◽

10.1007/11940098_31 ◽

2006 ◽

pp. 299-309

Author(s):

Maosong Sun ◽

Dongliang Xu ◽

Benjamin K. Tsou ◽

Huaming Lu

Keyword(s):

Integrated Approach ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

A Template-Based Approach for Tagging Non-Vocalized Arabic Nouns

Academic Journal of Research and Scientific Publishing ◽

10.52132/ajrsp.e.2021.32.1 ◽

2021 ◽

Vol 3 (32) ◽

pp. 05-35

Author(s):

Hashem Alsharif ◽

Keyword(s):

Linear Part ◽

Arabic Language ◽

Arabic Text ◽

Rule Based ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Pos Tagger ◽

Log Linear ◽

Speech Tagging

There exist no corpora of Arabic nouns. Furthermore, in any Arabic text, nouns can be found in different forms. In fact, by tagging nouns in an Arabic text, the beginning of each sentence can determine whether it starts with a noun or a verb. Part of Speech Tagging (POS) is the task of labeling each word in a sentence with its appropriate category, which is called a Tag (Noun, Verb and Article). In this thesis, we attempt to tag non-vocalized Arabic text. The proposed POS Tagger for Arabic Text is based on searching for each word of the text in our lists of Verbs and Articles. Nouns are found by eliminating Verbs and Articles. Our hypothesis states that, if the word in the text is not found in our lists, then it is a Noun. These comparisons will be made for each of the words in the text until all of them have been tagged. To apply our method, we have prepared a list of articles and verbs in the Arabic language with a total of 112 million verbs and articles combined, which are used in our comparisons to prove our hypothesis. To evaluate our proposed method, we used pre-tagged words from "The Quranic Arabic Corpus", making a total of 78,245 words, with our method, the Template-based tagging approach compared with (AraMorph) a rule-based tagging approach and the Stanford Log-linear Part-Of-Speech Tagger. Finally, AraMorph produced 40% correctly-tagged words and Stanford Log-linear Part-Of-Speech Tagger produced 68% correctly-tagged words, while our method produced 68,501 correctly-tagged words (88%).

Download Full-text

Building Balinese Part-of-Speech Tagger Using Hidden Markov Model (HMM)

JELIKU (Jurnal Elektronik Ilmu Komputer Udayana) ◽

10.24843/jlk.2020.v09.i02.p18 ◽

2020 ◽

Vol 9 (2) ◽

pp. 303

Author(s):

I Gde Made Hendra Pradiptha ◽

Ngurah Agus Sanjaya ER

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Probabilistic Approach ◽

Word Class ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Fast Processing ◽

Pos Tagger ◽

Speech Tagging

Part-of-Speech tagging or word class labeling is a process for labeling a word class in a word in a sentence. Previous research on POS Tagger, especially for Indonesian, has been done using various approaches and obtained high accuracy values. However, not many researchers have built POS Tagger for Balinese. In this article, we are interested in building a POS Tagger for Balinese using a probabilistic approach, specifically the Hidden Markov Model (HMM). HMM is selected to deal with ambiguity since it gives higher accuracy and fast processing time. We used k-fold cross-validation (with k = 10) and tagged corpus around 3669 tokens with 21 tags. Based on the experiments conducted, the HMM method obtained an accuracy of 68.56%.

Download Full-text

Part of speech tagging for Arabic

Natural Language Engineering ◽

10.1017/s1351324911000325 ◽

2011 ◽

Vol 18 (4) ◽

pp. 521-548 ◽

Cited By ~ 8

Author(s):

SANDRA KÜBLER ◽

EMAD MOHAMED

Keyword(s):

Computational Linguistics ◽

Automatic Segmentation ◽

Data Sparseness ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Novel Approach ◽

Pos Tagger ◽

Whole Word ◽

Speech Tagging

AbstractThis paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per word; and the tagset is larger than the typical tagset for English. Both factors, the second one being partly dependent on the first, increase the number of word/tag combinations, for which the POS tagger needs to find estimates, and thus they contribute to data sparseness. We present a novel approach to Arabic POS tagging that does not require any pre-processing, such as segmentation or tokenization: whole word tagging. In this approach, the complete word is assigned a complex POS tag, which includes morphological information. A competing approach investigates the effect of segmentation and vocalization on POS tagging to alleviate data sparseness and ambiguity. In the segmentation-based approach, we first automatically segment words and then POS tags the segments. The complex tagset encompasses 993 POS tags, whereas the segment-based tagset encompasses only 139 tags. However, segments are also more ambiguous, thus there are more possible combinations of segment tags. In realistic situations, in which we have no information about segmentation or vocalization, whole word tagging reaches the highest accuracy of 94.74%. If gold standard segmentation or vocalization is available, including this information improves POS tagging accuracy. However, while our automatic segmentation and vocalization modules reach state-of-the-art performance, their performance is not reliable enough for POS tagging and actually impairs POS tagging performance. Finally, we investigate whether a reduction of the complex tagset to the Extra-Reduced Tagset as suggested by Habash and Rambow (Habash, N., and Rambow, O. 2005. Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), Ann Arbor, MI, USA, pp. 573–80) will alleviate the data sparseness problem. While the POS tagging accuracy increases due to the smaller tagset, a closer look shows that using a complex tagset for POS tagging and then converting the resulting annotation to the smaller tagset results in a higher accuracy than tagging using the smaller tagset directly.

Download Full-text

Punjabi Pos Tagger: Rule Based and HMM

International Journal of Advanced Research in Computer Science and Software Engineering ◽

10.23956/ijarcsse/v7i7/0106 ◽

2017 ◽

Vol 7 (7) ◽

pp. 193

Author(s):

Umrinderpal Singh ◽

Vishal Goyal

Keyword(s):

Information Retrieval ◽

Language Processing ◽

State Of The Art ◽

Input Word ◽

Rule Based ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Unseen Data ◽

Pos Tagger ◽

Speech Tagging

The Part of Speech tagger system is used to assign a tag to every input word in a given sentence. The tags may include different part of speech tag for a particular language like noun, pronoun, verb, adjective, conjunction etc. and may have subcategories of all these tags. Part of Speech tagging is a basic and a preprocessing task of most of the Natural Language Processing (NLP) applications such as Information Retrieval, Machine Translation, and Grammar Checking etc. The task belongs to a larger set of problems, namely, sequence labeling problems. Part of Speech tagging for Punjabi is not widely explored territory. We have discussed Rule Based and HMM based Part of Speech tagger for Punjabi along with the comparison of their accuracies of both approaches. The System is developed using 35 different standard part of speech tag. We evaluate our system on unseen data with state-of-the-art accuracy 93.3%.

Download Full-text

Learning syntactic tagging of Macedonian language

Computer Science and Information Systems ◽

10.2298/csis180310027b ◽

2018 ◽

Vol 15 (3) ◽

pp. 799-820

Author(s):

Martin Bonchanoski ◽

Katerina Zdravkova

Keyword(s):

Dynamic Features ◽

Slavic Languages ◽

Learning Framework ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Guided Learning ◽

Initial Stage ◽

Dependency Network ◽

Pos Tagger ◽

Speech Tagging

This paper presents the creation of machine learning based systems for Part-of-speech tagging of Macedonian language. Four well-known PoS tagger systems implemented for English and Slavic languages: TnT, cyclic dependency network, guided learning framework for bidirectional sequence classification, and dynamic features induction were trained. Orwell?s novel ?1984? was manually tagged from the authors and it was used split into training and test set. After the training of the models, a comparison between the models was made. At the end, a POS tagger with an accuracy that reaches 97.5% was achieved, making it very appropriate for the future grammatical tagging of the National corpus of Macedonian language, which is currently in its initial stage. The Part-of-speech tagger that was create is published online and free to use.

Download Full-text

Part of Speech Tagging for Ancient Greek

Open Linguistics ◽

10.1515/opli-2016-0020 ◽

2016 ◽

Vol 2 (1) ◽

Cited By ~ 2

Author(s):

Giuseppe G. A. Celano ◽

Gregory Crane ◽

Saeed Majidi

Keyword(s):

Cross Validation ◽

T Test ◽

The Other ◽

Accuracy Score ◽

Ancient Greek ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Pos Tagger ◽

Speech Tagging ◽

Fold Cross Validation

AbstractIn this article we report the results for five POS taggers, i.e., the Mate tagger, the Hunpos tagger, RFTagger, theOpenNLP tagger, andNLTKUnigramtagger, tested on the data of the Ancient Greek Dependency Treebank. This is done in order to find the most efficient POS tagger to use for pre-annotation of new treebank data. A corrected 1-run 10-fold cross validation t test shows that the Mate tagger outperforms all the other taggers, with an accuracy score of 88%.

Download Full-text