Resource Building and Parts-of-Speech (POS) Tagging for the Mizo Language

Author(s):  
Partha Pakray ◽  
Arunagshu Pal ◽  
Goutam Majumder ◽  
Alexander Gelbukh
Keyword(s):  
Author(s):  
Carlos Eduardo Silva ◽  
Lincoln Fernandes

This paper describes COPA-TRAD Version 2.0, a parallel corpus-based system developed at the Universidade Federal de Santa Catarina (UFSC) for translation research, teaching and practice. COPA-TRAD enables the user to investigate the practices of professional translators by identifying translational patterns related to a particular element or linguistic pattern. In addition, the system allows for the comparison between human translation and automatic translation provided by three well-known machine translation systems available on the Internet (Google Translate, Microsoft Translator and Yandex). Currently, COPA-TRAD incorporates five subcorpora (Children's Literature, Literary Texts, Meta-Discourse in Translation, Subtitles and Legal Texts) and provides the following tools: parallel concordancer, monolingual concordancer, wordlist and a DIY Tool that enables the user to create his own parallel disposable corpus. The system also provides a POS-tagging tool interface to analyze and classify the parts of speech of a text.


Author(s):  
FATEMA N. JULIA ◽  
KHAN M. IFTEKHARUDDIN ◽  
ATIQ U. ISLAM

Dialog act (DA) classification is useful to understand the intentions of a human speaker. An effective classification of DA can be exploited for realistic implementation of expert systems. In this work, we investigate DA classification using both acoustic and discourse information for HCRC MapTask data. We extract several different acoustic features and exploit these features using a Hidden Markov Model (HMM) network to classify acoustic information. For discourse feature extraction, we propose a novel parts-of-speech (POS) tagging technique that effectively reduces the dimensionality of discourse features. To classify discourse information, we exploit two classifiers such as a HMM and Support Vector Machine (SVM). We further obtain classifier fusion between HMM and SVM to improve discourse classification. Finally, we perform an efficient decision-level classifier fusion for both acoustic and discourse information to classify 12 different DAs in MapTask data. We obtain 65.2% and 55.4% DA classification rates using acoustic and discourse information, respectively. Furthermore, we obtain combined accuracy of 68.6% for DA classification using both acoustic and discourse information. These accuracy rates of DA classification are either comparable or better than previously reported results for the same data set. For average precision and recall, we obtain accuracy rates of 74.89% and 69.83%, respectively. Therefore, we obtain much better precision and recall rates for most of the classified DAs when compared to existing works on the same HCRC MapTask data set.


2019 ◽  
Vol 8 (2S11) ◽  
pp. 2468-2471

Sentiment Analysis is one of the leading research work. This paper proposes a model for the description of verbs that provide a structure for developing sentiment analysis. The verbs are very significant language elements and they receive the attention of linguistic researchers. The text is processed for parts-of-speech tagging (POS tagging). With the help of POS tagger, the verbs from each sentence are extracted to show the difference in sentiment analysis values. The work includes performing parts-of-speech tagging to obtain verb words and implement TextBlob and VADER to find the semantic orientation to mine the opinion from the movie review. We achieved interesting results, which were assessed effectively for accuracy by considering with and without verb form words. The findings show that concerning verb words accuracy increases along with emotion words. This introduces a new strategy to classify online reviews using components of algorithms for parts-of-speech..


Author(s):  
Raul D. Karimov

This article dwells upon automatic PoS-tagging of Old Norse by computational means, including machine learning. It analyzes the available language material in diachrony from the standpoint of how language evolution might have affected the quality of automatic PoS-tagging. This article further describes the phonetic traits that have assumingly led to any classification errors. The research material is an Old Norwegian educational text titled Konungs skuggsjá, or “King’s Mirror”, vectorized by the moving average method and then used to train an Ada-Boosted random forest model. The resulting classification accuracy is about 97%. However, being non-contextual, this vectorization method enables no complete differentiation of morphologically similar parts of speech: verbs, nouns, adjectives, and adverbs. This becomes evident when digging into the identified high-weight classification features, each being a vectoral dimension corresponding to a specific alphabet character; another indicative factor comprises Morfessor-identified high-rank morphs, analyzing which reveals the morphogrammatic units that cause the most classification errors. Historical consideration of these morphs shows that their collision is due to them being inherited from Proto-Germanic (PG) while undergoing rhotacism, or conversion from PG /z/ to ON /r/. However, the same process effectively prevents the collision of rhotacized finite verbal forms with the genitive case that inherits the PG suffix -s. The key finding is that such morphological collision being unavoidable, character-based vectorization might not suffice when using a small training set or when trying to classify not only by parts of speech, but also by specific forms in the paradigm.


2021 ◽  
Vol 11 (4) ◽  
pp. 1-13
Author(s):  
Arpitha Swamy ◽  
Srinath S.

Parts-of-speech (POS) tagging is a method used to assign the POS tag for every word present in the text, and named entity recognition (NER) is a process to identify the proper nouns in the text and to classify the identified nouns into certain predefined categories. A POS tagger and a NER system for Kannada text have been proposed utilizing conditional random fields (CRFs). The dataset used for POS tagging consists of 147K tokens, where 103K tokens are used for training and the remaining tokens are used for testing. The proposed CRF model for POS tagging of Kannada text obtained 91.3% of precision, 91.6% of recall, and 91.4% of f-score values, respectively. To develop the NER system for Kannada, the data required is created manually using the modified tag-set containing 40 labels. The dataset used for NER system consists of 16.5K tokens, where 70% of the total words are used for training the model, and the remaining 30% of total words are used for model testing. The developed NER model obtained the 94% of precision, 93.9% of recall, and 93.9% of F1-measure values, respectively.


2021 ◽  
Vol 11 (1) ◽  
pp. 110
Author(s):  
Anastasia Hannas Putri ◽  
Usmi Usmi

<p>This paper discusses the syntactic function of the word jeongmal (정말) analyzed through part of speech (품사). In the Korean language, a word can be classified into more than one parts of speech, which is called pumsa tongyong (품사통용) or conversion. An example of pumsa tongyong is the word jeongmal, which can function as an adverb, noun, and interjection. This research shows that such classification makes the word jeongmal has different functions in a syntactic unit, which are main component (predicate), attributive component (adverb), and independent component. In addition, the limitations of the word jeongmal as a noun and POS-tagging error in classifying the word jeongmal as an interjection in the corpus were found. The syntactic function of the word jeongmal is important to be understood because jeongmal is a basic vocabulary with a high frequency of use, both spoken and written. This research is using a quantitative-qualitative method by analyzing a corpus (corpus-based analysis), 21st Century Sejong Corpora (21세기 세종 말뭉치). POS-tagged written and spoken corpus data is used as the data source of this research.</p>


2022 ◽  
Vol 14 (1) ◽  
pp. 0-0

POS (Parts of Speech) tagging, a vital step in diverse Natural Language Processing (NLP) tasks has not drawn much attention in case of Odia a computationally under-developed language. The proposed hybrid method suggests a robust POS tagger for Odia. Observing the rich morphology of the language and unavailability of sufficient annotated text corpus a combination of machine learning and linguistic rules is adopted in the building of the tagger. The tagger is trained on tagged text corpus from the domain of tourism and is capable of obtaining a perceptible improvement in the result. Also an appreciable performance is observed for news articles texts of varied domains. The performance of proposed algorithm experimenting on Odia language shows its manifestation in dominating over existing methods like rule based, hidden Markov model (HMM), maximum entropy (ME) and conditional random field (CRF).


2013 ◽  
Vol 347-350 ◽  
pp. 2836-2840 ◽  
Author(s):  
Shao Hong Yin ◽  
Gui Dan Fan

Part of speech contains important grammatical information, so it has great significance for the natural language understanding while the words in the sentence are marked on the parts of speech. POS tagging rules based on statistical methods and rule-based method can mining effectively, but its marked accuracy need to be improved. This paper presents a statistical method and rules of the combination of speech tagging rule mining algorithm in order to improve the correct rate of marked.


2018 ◽  
Vol 8 (7) ◽  
pp. 1
Author(s):  
Ruzana Omar ◽  
Sarah Yusoff ◽  
Radzuwan Ab Rashid ◽  
Azweed Mohamad ◽  
Kamariah Yunus

One of the components of learning English is Grammar, and the intrinsic part of it is Parts of Speech (PoS), where the majority of Malaysian students in higher institutions are still grappling to understand its use in sentences. This study aims to compare conventional method to e-learning method on its effectiveness in the teaching and learning of PoS. The application of Stanford PoS tagging has been used to analyze the PoS in every single word of the sentences extracted from the articles in The New Straits Times Online (NST Online). This quantitative research study adopted a comparative analysis in analyzing its findings. The results were statistically analyzed using The Statistical Package for the Social Science (SPSS) for statistical analysis. These findings of the research reveal a significance difference between the score from students using E-paper and the score from students not using E-paper in learning Grammar. Independent t-test was carried out to compare mean between the two groups. The result shows a significance difference (p-value = 0.007, t = -2.774) between the two groups of students&rsquo; score. The mean performance of the students using E-paper shows a higher percentage compared to those not using E-paper. As students nowadays spend most of their time with electronic gadgets, this is an innovative way to capture their interest to spend more time on quality reading materials via electronic newspaper, simultaneously learning Grammar by going to the crux of its core by identifying the PoS of each word in sentences using new pedagogical strategy of PoS tagging.


Sign in / Sign up

Export Citation Format

Share Document