Improving Basic Natural Language Processing Tools for the Ainu Language

Karol Nowakowski; Michal Ptaszynski; Fumito Masui; Yoshio Momouchi

doi:10.3390/info10110329

Improving Basic Natural Language Processing Tools for the Ainu Language

Information ◽

10.3390/info10110329 ◽

2019 ◽

Vol 10 (11) ◽

pp. 329

Author(s):

Karol Nowakowski ◽

Michal Ptaszynski ◽

Fumito Masui ◽

Yoshio Momouchi

Keyword(s):

Language Processing ◽

Positive Impact ◽

Word Segmentation ◽

Data Set ◽

Part Of Speech Tagging ◽

Endangered Language ◽

Part Of Speech ◽

Overall Performance ◽

Speech Tagging ◽

Northern Japan

Ainu is a critically endangered language spoken by the native inhabitants of northern Japan. This paper describes our research aimed at the development of technology for automatic processing of text in Ainu. In particular, we improved the existing tools for normalizing old transcriptions, word segmentation, and part-of-speech tagging. In the experiments we applied two Ainu language dictionaries from different domains (literary and colloquial) and created a new data set by combining them. The experiments revealed that expanding the lexicon had a positive impact on the overall performance of our tools, especially with test data unrelated to any of the training sets used.

Download Full-text

Aspect Based Sentiments from Tweets using Co-Ranking Multi-Modal Natural Language Processing Methodologies

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.e6305.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 1061-1068

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Detection System ◽

Word Segmentation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Sentiment Detection ◽

Analysis System ◽

Speech Tagging

Now-a-days people interest to spend their time in social sites especially twitters to post lot of tweets in every day. The posted tweets are used by many users to get the knowledge about the particular applications, products and other search engine queries. With the help of the posted tweets, their emotions and sentiments are derived which are used to get opinion about particular event. Lot of traditional sentiment detection system that has been developed but they failed to analyze huge volume of tweets and online contents with temporal patterns were also difficult to analyze. To overcome the above issues, the co-ranking multi-modal natural language processing based sentiment analysis system was developed to detect the emotions from the posted tweets. Initially, tweets of different events are collected from social sites which are processed by natural language procedures such as Stemming, Lemmatization, Part-of-speech tagging, word segmentation and parsing are applied to get the words related to posted tweets for deriving the sentiments. From the extracted emotions, co-ranking process is applied to get the opinion effectively related to particular event. Then the efficiency of the system is examined using experimental results and discussions. The introduced system recognize the sentiments from tweets with 98.80% of accuracy.

Download Full-text

Part of Speech Tagging for Arabic Long Sentence

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i3.27.17671 ◽

2018 ◽

Vol 7 (3.27) ◽

pp. 125

Author(s):

Ahmed H. Aliwy ◽

Duaa A. Al_Raza

Keyword(s):

Language Processing ◽

Arabic Language ◽

Data Set ◽

English Sentence ◽

Suggested Approach ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

N Gram ◽

Speech Tagging

Part Of Speech (POS) tagging of Arabic words is a difficult and non-travail task it was studied in details for the last twenty years and its performance affects many applications and tasks in area of natural language processing (NLP). The sentence in Arabic language is very long compared with English sentence. This affect tagging process for any approach deals with complete sentence at once as in Hidden Markov Model HMM tagger. In this paper, new approach is suggested for using HMM and n-grams taggers for tagging Arabic words in a long sentence. The suggested approach is very simple and easy to implement. It is implemented on data set of 1000 documents of 526321 tokens annotated manually (containing punctuations). The results shows that the suggested approach has higher accuracy than HMM and n-gram taggers. The F-measures were 0.888, 0.925 and 0.957 for n-grams, HMM and the suggested approach respectively.

Download Full-text

A Feature-Enriched Neural Model for Joint Chinese Word Segmentation and Part-of-Speech Tagging

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2017/553 ◽

2017 ◽

Cited By ~ 3

Author(s):

Xinchi Chen ◽

Xipeng Qiu ◽

Xuanjing Huang

Keyword(s):

Language Processing ◽

Neural Model ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Neural Network Models ◽

Long Distance ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Recently, neural network models for natural language processing tasks have been increasingly focused on for their ability of alleviating the burden of manual feature engineering. However, the previous neural models cannot extract the complicated feature compositions as the traditional methods with discrete features. In this work, we propose a feature-enriched neural model for joint Chinese word segmentation and part-of-speech tagging task. Specifically, to simulate the feature templates of traditional discrete feature based models, we use different filters to model the complex compositional features with convolutional and pooling layer, and then utilize long distance dependency information with recurrent layer. Experimental results on five different datasets show the effectiveness of our proposed model.

Download Full-text

Learning from Disagreement: A Survey

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12752 ◽

2021 ◽

Vol 72 ◽

pp. 1385-1470

Author(s):

Alexandra N. Uma ◽

Tommaso Fornaciari ◽

Dirk Hovy ◽

Silviu Paun ◽

Barbara Plank ◽

...

Keyword(s):

Language Processing ◽

Gold Standard ◽

Training Methods ◽

High Quality ◽

Training Models ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Growing Body ◽

Research Questions ◽

Speech Tagging

Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (AI) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on NLP and CV tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials.

Download Full-text

Part-of-Speech Tagging

The Oxford Handbook of Computational Linguistics 2nd edition ◽

10.1093/oxfordhb/9780199573691.013.51 ◽

2017 ◽

Author(s):

Dan Tufiș ◽

Radu Ion

Keyword(s):

Language Processing ◽

Machine Learning Techniques ◽

Computing Power ◽

Data Sparseness ◽

Part Of Speech Tagging ◽

Web Environment ◽

Part Of Speech ◽

Learning Techniques ◽

Lexical Disambiguation ◽

Speech Tagging

One of the fundamental tasks in natural-language processing is the morpho-lexical disambiguation of words occurring in text. Over the last twenty years or so, approaches to part-of-speech tagging based on machine learning techniques have been developed or ported to provide high-accuracy morpho-lexical annotation for an increasing number of languages. Due to recent increases in computing power, together with improvements in tagging technology and the extension of language typologies, part-of-speech tags have become significantly more complex. The need to address multilinguality more directly in the web environment has created a demand for interoperable, harmonized morpho-lexical descriptions across languages. Given the large number of morpho-lexical descriptors for a morphologically complex language, one has to consider ways to avoid the data sparseness threat in standard statistical tagging, yet ensure that full lexicon information is available for each word form in the output. The chapter overviews the current major approaches to part-of-speech tagging.

Download Full-text

An Integrated Approach to Chinese Word Segmentation and Part-of-Speech Tagging

Computer Processing of Oriental Languages. Beyond the Orient: The Research Challenges Ahead - Lecture Notes in Computer Science ◽

10.1007/11940098_31 ◽

2006 ◽

pp. 299-309

Author(s):

Maosong Sun ◽

Dongliang Xu ◽

Benjamin K. Tsou ◽

Huaming Lu

Keyword(s):

Integrated Approach ◽

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Repairing errors for Chinese word segmentation and part-of-speech tagging

Proceedings. International Conference on Machine Learning and Cybernetics ◽

10.1109/icmlc.2002.1175365 ◽

2003 ◽

Author(s):

Tian-Fang Yao ◽

Wei Ding ◽

G. Erbach

Keyword(s):

Word Segmentation ◽

Chinese Word ◽

Chinese Word Segmentation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation

Journal of the American Medical Informatics Association ◽

10.1136/amiajnl-2012-001453 ◽

2013 ◽

Vol 20 (5) ◽

pp. 931-939 ◽

Cited By ~ 16

Author(s):

Jeffrey P Ferraro ◽

Hal Daumé ◽

Scott L DuVall ◽

Wendy W Chapman ◽

Henk Harkema ◽

...

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Domain Adaptation ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Speech Tagging

Download Full-text

PENENTUAN KELAS KATA PADA PART OF SPEECH TAGGING KATA AMBIGU BAHASA INDONESIA

JISKA (Jurnal Informatika Sunan Kalijaga) ◽

10.14421/jiska.2018.23-05 ◽

2018 ◽

Vol 2 (3) ◽

pp. 157

Author(s):

Ahmad Subhan Yazid ◽

Agung Fatwanto

Keyword(s):

Language Processing ◽

Word Class ◽

Rule Based ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Ambiguous Words ◽

Computer Science Faculty ◽

Speech Tagging ◽

Bahasa Indonesia

Indonesian hold a fundamental role in the communication. There is ambiguous problem in its machine learning implementation. In the Natural Language Processing study, Part of Speech (POS) tagging has a role in the decreasing this problem. This study use the Rule Based method to determine the best word class for ambiguous words in Indonesian. This research follows some stages: knowledge inventory, making algorithms, implementation, Testing, Analysis, and Conclusions. The first data used is Indonesian corpus that was developed by Language department of Computer science Faculty, Indonesia University. Then, data is processed and shown descriptively by following certain rules and specification. The result is a POS tagging algorithm included 71 rules in flowchart and descriptive sentence notation. Refer to testing result, the algorithm successfully provides 92 labeling of 100 tested words (92%). The results of the implementation are influenced by the availability of rules, word class tagsets and corpus data.

Download Full-text

Improving Brill's tagger lexical and transformation rule for Afaan Oromo language

10.7287/peerj.preprints.1225v1 ◽

2015 ◽

Author(s):

Abraham G Ayana

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Transformation Rule ◽

Initial State ◽

Training Corpus ◽

Part Of Speech Tagging ◽

Pos Tagging ◽

Part Of Speech ◽

Speech Tagging

Natural Language Processing (NLP) refers to Human-like language processing which reveals that it is a discipline within the field of Artificial Intelligence (AI). However, the ultimate goal of research on Natural Language Processing is to parse and understand language, which is not fully achieved yet. For this reason, much research in NLP has focused on intermediate tasks that make sense of some of the structure inherent in language without requiring complete understanding. One such task is part-of-speech tagging, or simply tagging. Lack of standard part of speech tagger for Afaan Oromo will be the main obstacle for researchers in the area of machine translation, spell checkers, dictionary compilation and automatic sentence parsing and constructions. Even though several works have been done in POS tagging for Afaan Oromo, the performance of the tagger is not sufficiently improved yet. Hence,the aim of this thesis is to improve Brill’s tagger lexical and transformation rule for Afaan Oromo POS tagging with sufficiently large training corpus. Accordingly, Afaan Oromo literatures on grammar and morphology are reviewed to understand nature of the language and also to identify possible tagsets. As a result, 26 broad tagsets were identified and 17,473 words from around 1100 sentences containing 6750 distinct words were tagged for training and testing purpose. From which 258 sentences are taken from the previous work. Since there is only a few ready made standard corpuses, the manual tagging process to prepare corpus for this work was challenging and hence, it is recommended that a standard corpus is prepared. Transformation-based Error driven learning are adapted for Afaan Oromo part of speech tagging. Different experiments are conducted for the rule based approach taking 20% of the whole data for testing. A comparison with the previously adapted Brill’s Tagger made. The previously adapted Brill’s Tagger shows an accuracy of 80.08% whereas the improved Brill’s Tagger result shows an accuracy of 95.6% which has an improvement of 15.52%. Hence, it is found that the size of the training corpus, the rule generating system in the lexical rule learner, and moreover, using Afaan Oromo HMM tagger as initial state tagger have a significant effect on the improvement of the tagger.

Download Full-text