String Kernels for Native Language Identification: Insights from Behind the Curtains

Radu Tudor Ionescu; Marius Popescu; Aoife Cahill

doi:10.1162/coli_a_00256

String Kernels for Native Language Identification: Insights from Behind the Curtains

Computational Linguistics ◽

10.1162/coli_a_00256 ◽

2016 ◽

Vol 42 (3) ◽

pp. 491-525 ◽

Cited By ~ 9

Author(s):

Radu Tudor Ionescu ◽

Marius Popescu ◽

Aoife Cahill

Keyword(s):

State Of The Art ◽

Native Language ◽

Extensive Study ◽

The State ◽

Language Identification ◽

Language Transfer ◽

Word Choice ◽

Transfer Effects ◽

String Kernels ◽

Kernel Approach

The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Recently, an approach that uses only character p-grams as features has been proposed for the task of native language identification (NLI). The approach obtained state-of-the-art results by combining several string kernels using multiple kernel learning. Despite the fact that the approach based on string kernels performs so well, several questions about this method remain unanswered. First, it is not clear why such a simple approach can compete with far more complex approaches that take words, lemmas, syntactic information, or even semantics into account. Second, although the approach is designed to be language independent, all experiments to date have been on English. This work is an extensive study that aims to systematically present the string kernel approach and to clarify the open questions mentioned above. A broad set of native language identification experiments were conducted to compare the string kernels approach with other state-of-the-art methods. The empirical results obtained in all of the experiments conducted in this work indicate that the proposed approach achieves state-of-the-art performance in NLI, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the results obtained on both the Arabic and the Norwegian corpora demonstrate that the proposed approach is language independent. In the Arabic native language identification task, string kernels show an increase of more than 17% over the best accuracy reported so far. The results of string kernels on Norwegian native language identification are also significantly better than the state-of-the-art approach. In addition, in a cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state-of-the-art system by 32.3%. To gain additional insights about the string kernels approach, the features selected by the classifier as being more discriminating are analyzed in this work. The analysis also offers information about localized language transfer effects, since the features used by the proposed model are p-grams of various lengths. The features captured by the model typically include stems, function words, and word prefixes and suffixes, which have the potential to generalize over purely word-based features. By analyzing the discriminating features, this article offers insights into two kinds of language transfer effects, namely, word choice (lexical transfer) and morphological differences. The goal of the current study is to give a full view of the string kernels approach and shed some light on why this approach works so well.

Download Full-text

Exploiting native language interference for native language identification

Natural Language Engineering ◽

10.1017/s1351324920000595 ◽

2020 ◽

pp. 1-31

Author(s):

Ilia Markov ◽

Vivi Nastase ◽

Carlo Strapparava

Keyword(s):

Second Language ◽

State Of The Art ◽

Native Language ◽

Emotion Expression ◽

The State ◽

Language Identification ◽

Independent Features ◽

Insight Into

Abstract Native language identification (NLI)—the task of automatically identifying the native language (L1) of persons based on their writings in the second language (L2)—is based on the hypothesis that characteristics of L1 will surface and interfere in the production of texts in L2 to the extent that L1 is identifiable. We present an in-depth investigation of features that model a variety of linguistic phenomena potentially involved in native language interference in the context of the NLI task: the languages’ structuring of information through punctuation usage, emotion expression in language, and similarities of form with the L1 vocabulary through the use of anglicized words, cognates, and other misspellings. The results of experiments with different combinations of features in a variety of settings allow us to quantify the native language interference value of these linguistic phenomena and show how robust they are in cross-corpus experiments and with respect to proficiency in L2. These experiments provide a deeper insight into the NLI task, showing how native language interference explains the gap between baseline, corpus-independent features, and the state of the art that relies on features/representations that cover (indiscriminately) a variety of linguistic phenomena.

Download Full-text

Native Language Identification With Classifier Stacking and Ensembles

Computational Linguistics ◽

10.1162/coli_a_00323 ◽

2018 ◽

Vol 44 (3) ◽

pp. 403-446 ◽

Cited By ~ 7

Author(s):

Shervin Malmasi ◽

Mark Dras

Keyword(s):

State Of The Art ◽

Native Language ◽

Ensemble Methods ◽

Large Data ◽

Language Identification ◽

Large Data Sets ◽

Data Sets ◽

Classification Models ◽

Multiple Classifiers ◽

Current State

Ensemble methods using multiple classifiers have proven to be among the most successful approaches for the task of Native Language Identification (NLI), achieving the current state of the art. However, a systematic examination of ensemble methods for NLI has yet to be conducted. Additionally, deeper ensemble architectures such as classifier stacking have not been closely evaluated. We present a set of experiments using three ensemble-based models, testing each with multiple configurations and algorithms. This includes a rigorous application of meta-classification models for NLI, achieving state-of-the-art results on several large data sets, evaluated in both intra-corpus and cross-corpus modes.

Download Full-text

A Dynamic Clause Specific Initial Weight Assignment for Solving Satisfiability Problems Using Local Search

Algorithms ◽

10.3390/a14010012 ◽

2021 ◽

Vol 14 (1) ◽

pp. 12

Author(s):

Abdelraouf Ishtaiwi ◽

Feda Alshahwan ◽

Naser Jamal ◽

Wael Hadi ◽

Muhammad AbuArqoub

Keyword(s):

Local Search ◽

State Of The Art ◽

Small Neighborhood ◽

Extensive Study ◽

The State ◽

Stochastic Local Search ◽

Dynamic Allocation ◽

Initial Weight ◽

Large Neighborhood ◽

Weight Assignment

For decades, the use of weights has proven its superior ability to improve dynamic local search weighting algorithms’ overall performance. This paper proposes a new mechanism where the initial clause’s weights are dynamically allocated based on the problem’s structure. The new mechanism starts by examining each clause in terms of its size and the extent of its link, and its proximity to other clauses. Based on our examination, we categorized the clauses into four categories: (1) clauses small in size and linked with a small neighborhood, (2) clauses small in size and linked with a large neighborhood, (3) clauses large in size and linked with a small neighborhood, and (4) clauses large in size and linked with a large neighborhood. Then, the initial weights are dynamically allocated according to each clause category. To examine the efficacy of the dynamic initial weight assignment, we conducted an extensive study of our new technique on many problems. The study concluded that the dynamic allocation of initial weights contributes significantly to improving the search process’s performance and quality. To further investigate the new mechanism’s effect, we compared the new mechanism with the state-of-the-art algorithms belonging to the same family in terms of using weights, and it was clear that the new mechanism outperformed the state-of-the-art clause weighting algorithms. We also show that the new mechanism could be generalized with minor changes to be utilized within the general-purpose stochastic local search state-of-the-art weighting algorithms.

Download Full-text

Native Language Identification with String Kernels

Advances in Computer Vision and Pattern Recognition - Knowledge Transfer between Computer Vision and Text Mining ◽

10.1007/978-3-319-30367-3_8 ◽

2016 ◽

pp. 193-227 ◽

Cited By ~ 1

Author(s):

Radu Tudor Ionescu ◽

Marius Popescu

Keyword(s):

Native Language ◽

Language Identification ◽

String Kernels

Download Full-text

Can string kernels pass the test of time in Native Language Identification?

10.18653/v1/w17-5024 ◽

2017 ◽

Cited By ~ 2

Author(s):

Radu Tudor Ionescu ◽

Marius Popescu

Keyword(s):

Native Language ◽

Language Identification ◽

String Kernels

Download Full-text

Practical Picture Processing

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100051700 ◽

1974 ◽

Vol 32 ◽

pp. 338-339

Author(s):

T. A. Welton

Keyword(s):

Radiation Damage ◽

Coherence Length ◽

Spatial Information ◽

State Of The Art ◽

Coherent Radiation ◽

The State ◽

Energy Spread ◽

Electron Micrograph ◽

Picture Processing ◽

Molecular Skeleton

Various authors have emphasized the spatial information resident in an electron micrograph taken with adequately coherent radiation. In view of the completion of at least one such instrument, this opportunity is taken to summarize the state of the art of processing such micrographs. We use the usual symbols for the aberration coefficients, and supplement these with £ and 6 for the transverse coherence length and the fractional energy spread respectively. He also assume a weak, biologically interesting sample, with principal interest lying in the molecular skeleton remaining after obvious hydrogen loss and other radiation damage has occurred.

Download Full-text