Unsupervised Quality Estimation Model for English to German Translation and Its Application in Extensive Supervised Evaluation

The Scientific World JOURNAL ◽

10.1155/2014/760301 ◽

2014 ◽

Vol 2014 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Aaron L.-F. Han ◽

Derek F. Wong ◽

Lidia S. Chao ◽

Liangye He ◽

Yi Lu

Keyword(s):

Rapid Development ◽

Linguistic Features ◽

Estimation Model ◽

Linguistic Feature ◽

Language Bias ◽

Automatic Translation ◽

Part Of Speech ◽

Mt Evaluation ◽

Evaluation Metric ◽

Translation Systems

With the rapid development of machine translation (MT), the MT evaluation becomes very important to timely tell us whether the MT system makes any progress. The conventional MT evaluation methods tend to calculate the similarity between hypothesis translations offered by automatic translation systems and reference translations offered by professional translators. There are several weaknesses in existing evaluation metrics. Firstly, the designed incomprehensive factors result in language-bias problem, which means they perform well on some special language pairs but weak on other language pairs. Secondly, they tend to use no linguistic features or too many linguistic features, of which no usage of linguistic feature draws a lot of criticism from the linguists and too many linguistic features make the model weak in repeatability. Thirdly, the employed reference translations are very expensive and sometimes not available in the practice. In this paper, the authors propose an unsupervised MT evaluation metric using universal part-of-speech tagset without relying on reference translations. The authors also explore the performances of the designed metric on traditional supervised evaluation tasks. Both the supervised and unsupervised experiments show that the designed methods yield higher correlation scores with human judgments.

Download Full-text

From key words to key semantic domains

International Journal of Corpus Linguistics ◽

10.1075/ijcl.13.4.06ray ◽

2008 ◽

Vol 13 (4) ◽

pp. 519-549 ◽

Cited By ~ 215

Author(s):

Paul Rayson

Keyword(s):

Key Words ◽

Corpus Linguistics ◽

Linguistic Features ◽

Linguistic Feature ◽

Part Of Speech ◽

Liberal Democratic ◽

Election Manifestos ◽

Automatic Tagging ◽

Data Driven Approach ◽

Democratic Parties

This paper reports the extension of the key words method for the comparison of corpora. Using automatic tagging software that assigns part-of-speech and semantic field (domain) tags, a method is described which permits the extraction of key domains by applying the keyness calculation to tag frequency lists. The combination of the key words and key domains methods is shown to allow macroscopic analysis (the study of the characteristics of whole texts or varieties of language) to inform the microscopic level (focussing on the use of a particular linguistic feature) and thereby suggesting those linguistic features which should be investigated further. The resulting ‘data-driven’ approach presented here combines elements of both the ‘corpus-based’ and ‘corpus-driven’ paradigms in corpus linguistics. A web-based tool, Wmatrix, implementing the proposed method is applied in a case study: the comparison of UK 2001 general election manifestos of the Labour and Liberal Democratic parties.

Download Full-text

Sublemma-Based Neural Machine Translation

Complexity ◽

10.1155/2021/5935958 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Thien Nguyen ◽

Huu Nguyen ◽

Phuoc Tran

Keyword(s):

Machine Translation ◽

Quality Data ◽

Human Judgment ◽

Linguistic Features ◽

Neural Machine Translation ◽

Low Resource ◽

Part Of Speech ◽

Proposed Model ◽

Translation Systems ◽

Ted Talks

Powerful deep learning approach frees us from feature engineering in many artificial intelligence tasks. The approach is able to extract efficient representations from the input data, if the data are large enough. Unfortunately, it is not always possible to collect large and quality data. For tasks in low-resource contexts, such as the Russian ⟶ Vietnamese machine translation, insights into the data can compensate for their humble size. In this study of modelling Russian ⟶ Vietnamese translation, we leverage the input Russian words by decomposing them into not only features but also subfeatures. First, we break down a Russian word into a set of linguistic features: part-of-speech, morphology, dependency labels, and lemma. Second, the lemma feature is further divided into subfeatures labelled with tags corresponding to their positions in the lemma. Being consistent with the source side, Vietnamese target sentences are represented as sequences of subtokens. Sublemma-based neural machine translation proves itself in our experiments on Russian-Vietnamese bilingual data collected from TED talks. Experiment results reveal that the proposed model outperforms the best available Russian ⟶ Vietnamese model by 0.97 BLEU. In addition, automatic machine judgment on the experiment results is verified by human judgment. The proposed sublemma-based model provides an alternative to existing models when we build translation systems from an inflectionally rich language, such as Russian, Czech, or Bulgarian, in low-resource contexts.

Download Full-text

Toward the Elaboration of a Spanish-Chinese Parallel Annotated Corpus

10.29007/gxv3 ◽

2018 ◽

Author(s):

Shuyuan Cao ◽

Iria Da-Cunha ◽

Mikel Iruskieta

Keyword(s):

Language Learning ◽

Academic Community ◽

Parallel Corpus ◽

Translation Quality ◽

Automatic Translation ◽

Part Of Speech ◽

Challenging Tasks ◽

Syntactic Information ◽

Translation Systems ◽

Language Pair

Spanish and Chinese are two very different languages in all language levels. Therefore, translation (both human and machine translation) from one to another and learning one of them as a foreign language are challenging tasks. Some automatic translation systems exist for this pair of languages, but there is enough room to improve the translation quality between Spanish and Chinese. In addition, the accessible sources, such as a parallel corpus for studying and understanding this language pair, are still few. In this paper, we present how we have created a Spanish-Chinese parallel corpus designed for language learning and translation tasks at the discourse level. This corpus has been enriched automatically with part-of-speech (POS) and several queries based on morpho-syntactic information can be realized. We have made available the parallel corpus to the academic community.

Download Full-text

Source-Word Decomposition for Neural Machine Translation

Mathematical Problems in Engineering ◽

10.1155/2020/4795187 ◽

2020 ◽

Vol 2020 ◽

pp. 1-10

Author(s):

Thien Nguyen ◽

Hoai Le ◽

Van-Huy Pham

Keyword(s):

Machine Translation ◽

Linguistic Features ◽

Specialized Knowledge ◽

Neural Machine Translation ◽

Translation Model ◽

Part Of Speech ◽

Effective System ◽

Feature Based ◽

Transformer Model ◽

Translation Systems

End-to-end neural machine translation does not require us to have specialized knowledge of investigated language pairs in building an effective system. On the other hand, feature engineering proves to be vital in other artificial intelligence fields, such as speech recognition and computer vision. Inspired by works in those fields, in this paper, we propose a novel feature-based translation model by modifying the state-of-the-art transformer model. Specifically, the encoder of the modified transformer model takes input combinations of linguistic features comprising of lemma, dependency label, part-of-speech tag, and morphological label instead of source words. The experiment results for the Russian-Vietnamese language pair show that the proposed feature-based transformer model improves over the strongest baseline transformer translation model by impressive 4.83 BLEU. In addition, experiment analysis reveals that human judgment on the translation results strongly confirms machine judgment. Our model could be useful in building translation systems translating from a highly inflectional language into a noninflectional language.

Download Full-text

Discourse, Context, and Coherence

10.1093/oso/9780198791492.003.0006 ◽

2018 ◽

Author(s):

Una Stojnić

Keyword(s):

Prima Facie ◽

Context Sensitivity ◽

Linguistic Meaning ◽

Linguistic Features ◽

Linguistic Feature ◽

Discourse Context ◽

Context Sensitive ◽

Pure Indexicals

On the received view, the resolution of context-sensitivity is at least partly determined by non-linguistic features of utterance situation. If I say ‘He’s happy’, what ‘he’ picks out is underspecified by its linguistic meaning, and is only fixed through extra-linguistic supplementation: the speaker’s intention, and/or some objective, non-linguistic feature of the utterance situation. This underspecification is exhibited by most context-sensitive expressions, with the exception of pure indexicals, like ‘I.’ While this received view is prima facie appealing, I argue it is deeply mistaken. I defend an account according to which context-sensitivity resolution is governed by linguistic mechanisms determining prominence of candidate resolutions of context-sensitive items. Thus, on this account, the linguistic meaning of a context-sensitive expression fully specifies its resolution in a context, automatically selecting the resolution antecedently set by the prominence-governing linguistic mechanisms.

Download Full-text

Linguistic feature of anorexia nervosa: a prospective case–control pilot study

Eating and Weight Disorders - Studies on Anorexia Bulimia and Obesity ◽

10.1007/s40519-021-01273-7 ◽

2021 ◽

Author(s):

Vittoria Cuteri ◽

Giulia Minori ◽

Gloria Gagliardi ◽

Fabio Tamburini ◽

Elisabetta Malaspina ◽

...

Keyword(s):

Anorexia Nervosa ◽

Language Processing ◽

Statistical Significance ◽

Case Control ◽

Pathological Process ◽

Linguistic Features ◽

Level Of Evidence ◽

Linguistic Feature ◽

Linguistic Markers ◽

Clinical Conditions

Abstract Purpose Attention has recently been paid to Clinical Linguistics for the detection and support of clinical conditions. Many works have been published on the “linguistic profile” of various clinical populations, but very few papers have been devoted to linguistic changes in patients with eating disorders. Patients with Anorexia Nervosa (AN) share similar psychological features such as disturbances in self-perceived body image, inflexible and obsessive thinking and anxious or depressive traits. We hypothesize that these characteristics can result in altered linguistic patterns and be detected using the Natural Language Processing tools. Methods We enrolled 51 young participants from December 2019 to February 2020 (age range: 14–18): 17 girls with a clinical diagnosis of AN, and 34 normal-weighted peers, matched by gender, age and educational level. Participants in each group were asked to produce three written texts (around 10–15 lines long). A rich set of linguistic features was extracted from the text samples and the statistical significance in pinpointing the pathological process was measured. Results Comparison between the two groups showed several linguistics indexes as statistically significant, with syntactic reduction as the most relevant trait of AN productions. In particular, the following features emerge as statistically significant in distinguishing AN girls and their normal-weighted peers: the length of the sentences, the complexity of the noun phrase, and the global syntactic complexity. This peculiar pattern of linguistic erosion may be due to the severe metabolic impairment also affecting the central nervous system in AN. Conclusion These preliminary data showed the existence of linguistic parameters as probable linguistic markers of AN. However, the analysis of a bigger cohort, still ongoing, is needed to consolidate this assumption. Level of evidence III Evidence obtained from case–control analytic studies.

Download Full-text

Transactional and Interpersonal Conversation Texts in English Textbook

10.18326/rgt.v7i2.205-224 ◽

2014 ◽

Vol 7 (2) ◽

pp. 205

Author(s):

Ulin Nuha

Keyword(s):

High School ◽

Junior High School ◽

Junior High ◽

School Grade ◽

Linguistic Features ◽

Linguistic Feature ◽

Functional Literacy ◽

Units Of Analysis ◽

High School Grade ◽

Literacy Level

In this study, The researcher analyzed the transactional andinterpersonal conversation texts found in grade VIII English textbookentitled ―EOS English on Sky 2‖ and I also analyzed the linguisticfeatures of the transactional and interpersonal conversations in theEnglish textbook. This study focuses on the issues of structuralfunctionalapproach which analyzes the speech function, structuralapproach which analyzes linguistic features. This is a qualitative study.In calculating the data and the final result of data percentage,quantification was used to support this study. Units of analysis in thisstudy are moves and clauses. The conversation texts are presented in 8units. The moves were analyzed functionally and the clauses wereanalyzed structurally. The result shows that the speech functions of thetransactional conversation texts are 54.5% matching the standard ofcontent, the speech functions of the interpersonal conversation texts are2.1% matching the standard of content. The linguistic feature applied inthe transactional and interpersonal conversation texts uses the linguisticfeature in functional literacy level. The speech functions of conversationtexts introduced in EOS English on Sky 2 for junior high school grade VIII are less compatible with the standard of content based on thecompatibility levels. Keywords: Transactional and interpersonal conversation texts; Speech function; linguistic feature.

Download Full-text

Evaluation of the Performance and Efficiency of the Automated Linguistic Features for Author Identification in Short Text Messages Using Different Variable Selection Techniques

Studies in Media and Communication ◽

10.11114/smc.v6i2.3892 ◽

2018 ◽

Vol 6 (2) ◽

pp. 83

Author(s):

Refat Aljumily

Keyword(s):

Variable Selection ◽

Text Messages ◽

Sentence Length ◽

Function Word ◽

Linguistic Features ◽

Linguistic Feature ◽

Short Text ◽

Parts Of Speech ◽

Author Identification ◽

Type Frequency

The aim of this paper was to evaluate the efficiency of automated linguistic features to test its capacity or discriminating power as style markers for author identification in short text messages of the Facebook genre. The corpus used to evaluate the automated linguistics features was compiled from 221 Facebook texts (each text is about 2 to 3 lines/35-40 words) written in English, which were written in the same genre and topic and posted in the same year group, totaling 7530 words. To compose the dataset for linguistic features performance or evaluation, frequency values were collected from 16 linguistic feature types involving parts of speech, function words, word bigrams, character tri grams, average sentence length in terms of words, average sentence length in terms of characters, Yule’s K measure, Simpson’s D measure, average words length, FW/CW ratio, average characters, content specific key words, type/token ratio, total number of short words less than four characters, contractions, and total number of characters in words which were selected from five corpora, totalling 328 test features. The evaluation of the 16 linguistic feature types differ from those of other analyses because the study used different variable selection methods including feature type frequency, variance, term frequency/ inverse document frequency (TF.IDF), signal-noise ratio, and Poisson term distribution. The relationships between known and anonymous text messages were examined using hierarchical linear and non-hierarchical nonlinear clustering methods, taking into accounts the nonlinear patterns among the data. There were similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms function word and parts of speech usages based on TF.IDF technique and the efficiency of function word usages (=60%) and the efficiency of parts of speech frequencies (=50%). There were no similarities between the anonymous text messages and the authors of the non-anonymous text messages in terms of the other features using feature type frequency and variance techniques in this test and the efficiency of these features in the corpus (< 40%). There was a positive effect on identification performance using parts of speech and function word frequency usages and applying TF.IDF technique as the length of text messages increased (N≥ 100). Through this way, the performance and efficiency of syntactic features and function word usages to identify anonymous authors or text messages is improved by increasing the length of the text messages using TF.IDF variable selection technique, but decreased as feature type frequency and variance techniques in the selection process apply.

Download Full-text

A TWO-CHANNEL MODEL FOR REPRESENTATION LEARNING IN VIETNAMESE SENTIMENT CLASSIFICATION PROBLEM

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/36/4/14829 ◽

2020 ◽

Vol 36 (4) ◽

pp. 305-323

Author(s):

Quan Hoang Nguyen ◽

Ly Vu ◽

Quang Uy Nguyen

Keyword(s):

Channel Model ◽

Rapid Development ◽

Representation Learning ◽

Classification Problem ◽

Sentiment Classification ◽

Complex Sentences ◽

Digital World ◽

Part Of Speech ◽

Proposed Model ◽

Important Research Topic

Sentiment classification (SC) aims to determine whether a document conveys a positive or negative opinion. Due to the rapid development of the digital world, SC has become an important research topic that affects many aspects of our life. In SC based on machine learning, the representation of the document strongly influences on its accuracy. Word Embedding (WE)-based techniques, i.e., Word2vec techniques, are proved to be beneficial techniques to the SC problem. However, Word2vec is often not enough to represent the semantic of documents with complex sentences of Vietnamese. In this paper, we propose a new representation learning model called a \textbf{two-channel vector} to learn a higher-level feature of a document in SC. Our model uses two neural networks to learn the semantic feature, i.e., Word2vec and the syntactic feature, i.e., Part of Speech tag (POS). Two features are then combined and input to a \textit{Softmax} function to make the final classification. We carry out intensive experiments on $4$ recent Vietnamese sentiment datasets to evaluate the performance of the proposed architecture. The experimental results demonstrate that the proposed model can significantly enhance the accuracy of SC problems compared to two single models and a state-of-the-art ensemble method.

Download Full-text

Modelling asthma cases by count analysis approach: Poisson INGARCH and Negative Binomial INGARCH

MATEMATIKA ◽

10.11113/matematika.v36.n1.1158 ◽

2020 ◽

Vol 36 (1) ◽

pp. 15-30

Author(s):

'Aaishah Radziah Jamaludin ◽

Fadhilah Yusof ◽

Suhartono Suhartono

Keyword(s):

Negative Binomial ◽

Integral Transform ◽

Rapid Development ◽

Information Criterion ◽

Analysis Approach ◽

Model Assessment ◽

Link Function ◽

Estimation Model ◽

Pearson Residuals ◽

Log Likelihood

Johor Bahru with its rapid development where pollution is an issue that needs to be considered because it has contributed to the number of asthma cases in this area. Therefore, the goal of this study is to investigate the behaviour of asthma disease in Johor Bahru by count analysis approach namely; Poisson Integer Generalized Autoregressive Conditional Heteroscedasticity (Poisson-INGARCH) and Negative Binomial INGARCH (NB-INGARCH) with identity and log link function. Intervention analysis was conducted since the outbreak in the asthma data for the period of July 2012 to July 2013. This occurs perhaps due to the extremely bad haze in Johor Bahru from Indonesian fires. The estimation of the parameter will be done by quasi-maximum likelihood estimation. Model assessment was evaluated from the Pearson residuals, cumulative periodogram, the probability integral transform (PIT) histogram, log-likelihood value, Akaike’s Information Criterion (AIC) and Bayesian information criterion (BIC). Our result shows that NB-INGARCH with identity and log link function is adequate in representing the asthma data with uncorrelated Pearson residuals, higher in log likelihood, the PIT exhibits normality yet the lowest AIC and BIC. However, in terms of forecasting accuracy, NB-INGARCH with identity link function performed better with the smaller RMSE (8.54) for the sample data. Therefore, NB-INGARCH with identity link function can be applied as the prediction model for asthma disease in Johor Bahru. Ideally, this outcome can assist the Department of Health in executing counteractive action and early planning to curb asthma diseases in Johor Bahru.

Download Full-text