scholarly journals String Kernels for Native Language Identification: Insights from Behind the Curtains

2016 ◽  
Vol 42 (3) ◽  
pp. 491-525 ◽  
Author(s):  
Radu Tudor Ionescu ◽  
Marius Popescu ◽  
Aoife Cahill

The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Recently, an approach that uses only character p-grams as features has been proposed for the task of native language identification (NLI). The approach obtained state-of-the-art results by combining several string kernels using multiple kernel learning. Despite the fact that the approach based on string kernels performs so well, several questions about this method remain unanswered. First, it is not clear why such a simple approach can compete with far more complex approaches that take words, lemmas, syntactic information, or even semantics into account. Second, although the approach is designed to be language independent, all experiments to date have been on English. This work is an extensive study that aims to systematically present the string kernel approach and to clarify the open questions mentioned above. A broad set of native language identification experiments were conducted to compare the string kernels approach with other state-of-the-art methods. The empirical results obtained in all of the experiments conducted in this work indicate that the proposed approach achieves state-of-the-art performance in NLI, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the results obtained on both the Arabic and the Norwegian corpora demonstrate that the proposed approach is language independent. In the Arabic native language identification task, string kernels show an increase of more than 17% over the best accuracy reported so far. The results of string kernels on Norwegian native language identification are also significantly better than the state-of-the-art approach. In addition, in a cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state-of-the-art system by 32.3%. To gain additional insights about the string kernels approach, the features selected by the classifier as being more discriminating are analyzed in this work. The analysis also offers information about localized language transfer effects, since the features used by the proposed model are p-grams of various lengths. The features captured by the model typically include stems, function words, and word prefixes and suffixes, which have the potential to generalize over purely word-based features. By analyzing the discriminating features, this article offers insights into two kinds of language transfer effects, namely, word choice (lexical transfer) and morphological differences. The goal of the current study is to give a full view of the string kernels approach and shed some light on why this approach works so well.

2020 ◽  
pp. 1-31
Author(s):  
Ilia Markov ◽  
Vivi Nastase ◽  
Carlo Strapparava

Abstract Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena potentially involved in native language interference in the context of the NLI task: the languages’ structuring of information through punctuation usage, emotion expression in language, and similarities of form with the L1 vocabulary through the use of anglicized words, cognates, and other misspellings. The results of experiments with different combinations of features in a variety of settings allow us to quantify the native language interference value of these linguistic phenomena and show how robust they are in cross-corpus experiments and with respect to proficiency in L2. These experiments provide a deeper insight into the NLI task, showing how native language interference explains the gap between baseline, corpus-independent features, and the state of the art that relies on features/representations that cover (indiscriminately) a variety of linguistic phenomena.


2018 ◽  
Vol 44 (3) ◽  
pp. 403-446 ◽  
Author(s):  
Shervin Malmasi ◽  
Mark Dras

Ensemble methods using multiple classifiers have proven to be among the most successful approaches for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on several large data sets, evaluated in both intra-corpus and cross-corpus modes.


Algorithms ◽  
2021 ◽  
Vol 14 (1) ◽  
pp. 12
Author(s):  
Abdelraouf Ishtaiwi ◽  
Feda Alshahwan ◽  
Naser Jamal ◽  
Wael Hadi ◽  
Muhammad AbuArqoub

For decades, the use of weights has proven its superior ability to improve dynamic local search weighting algorithms’ overall performance. This paper proposes a new mechanism where the initial clause’s weights are dynamically allocated based on the problem’s structure. The new mechanism starts by examining each clause in terms of its size and the extent of its link, and its proximity to other clauses. Based on our examination, we categorized the clauses into four categories: (1) clauses small in size and linked with a small neighborhood, (2) clauses small in size and linked with a large neighborhood, (3) clauses large in size and linked with a small neighborhood, and (4) clauses large in size and linked with a large neighborhood. Then, the initial weights are dynamically allocated according to each clause category. To examine the efficacy of the dynamic initial weight assignment, we conducted an extensive study of our new technique on many problems. The study concluded that the dynamic allocation of initial weights contributes significantly to improving the search process’s performance and quality. To further investigate the new mechanism’s effect, we compared the new mechanism with the state-of-the-art algorithms belonging to the same family in terms of using weights, and it was clear that the new mechanism outperformed the state-of-the-art clause weighting algorithms. We also show that the new mechanism could be generalized with minor changes to be utilized within the general-purpose stochastic local search state-of-the-art weighting algorithms.


Author(s):  
T. A. Welton

Various authors have emphasized the spatial information resident in an electron micrograph taken with adequately coherent radiation. In view of the completion of at least one such instrument, this opportunity is taken to summarize the state of the art of processing such micrographs. We use the usual symbols for the aberration coefficients, and supplement these with £ and 6 for the transverse coherence length and the fractional energy spread respectively. He also assume a weak, biologically interesting sample, with principal interest lying in the molecular skeleton remaining after obvious hydrogen loss and other radiation damage has occurred.


2003 ◽  
Vol 48 (6) ◽  
pp. 826-829 ◽  
Author(s):  
Eric Amsel
Keyword(s):  

1968 ◽  
Vol 13 (9) ◽  
pp. 479-480
Author(s):  
LEWIS PETRINOVICH
Keyword(s):  

1984 ◽  
Vol 29 (5) ◽  
pp. 426-428
Author(s):  
Anthony R. D'Augelli

1991 ◽  
Vol 36 (2) ◽  
pp. 140-140
Author(s):  
John A. Corson
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document