scholarly journals On the Longest Common Subsequence of Conjugation Invariant Random Permutations

10.37236/8669 ◽  
2020 ◽  
Vol 27 (4) ◽  
Author(s):  
Mohamed Slim Kammoun

Bukh and Zhou conjectured that the expectation of the length of the longest common subsequence of two i.i.d random permutations of size $n$ is greater than $\sqrt{n}$. We prove in this paper that there exists a universal constant $n_1$ such that their conjecture is satisfied for any pair of i.i.d random permutations of size greater than $n_1$ with distribution invariant under conjugation.  More generally, in the case where the laws of the two permutations are not necessarily the same, we give a lower bound for the expectation. In particular, we prove that if one of the permutations is invariant under conjugation and with a good control of the expectation of the number of its cycles, the limiting fluctuations of the length of the longest common subsequence are of Tracy-Widom type. This result holds independently of the law of the second permutation.

Author(s):  
Iqra Muneer ◽  
Rao Muhammad Adeel Nawab

Cross-Lingual Text Reuse Detection (CLTRD) has recently attracted the attention of the research community due to a large amount of digital text readily available for reuse in multiple languages through online digital repositories. In addition, efficient machine translation systems are freely and readily available to translate text from one language into another, which makes it quite easy to reuse text across languages, and consequently difficult to detect it. In the literature, the most prominent and widely used approach for CLTRD is Translation plus Monolingual Analysis (T+MA). To detect CLTR for English-Urdu language pair, T+MA has been used with lexical approaches, namely, N-gram Overlap, Longest Common Subsequence, and Greedy String Tiling. This clearly shows that T+MA has not been thoroughly explored for the English-Urdu language pair. To fulfill this gap, this study presents an in-depth and detailed comparison of 26 approaches that are based on T+MA. These approaches include semantic similarity approaches (semantic tagger based approaches, WordNet-based approaches), probabilistic approach (Kullback-Leibler distance approach), monolingual word embedding-based approaches siamese recurrent architecture, and monolingual sentence transformer-based approaches for English-Urdu language pair. The evaluation was carried out using the CLEU benchmark corpus, both for the binary and the ternary classification tasks. Our extensive experimentation shows that our proposed approach that is a combination of 26 approaches obtained an F 1 score of 0.77 and 0.61 for the binary and ternary classification tasks, respectively, and outperformed the previously reported approaches [ 41 ] ( F 1 = 0.73) for the binary and ( F 1 = 0.55) for the ternary classification tasks) on the CLEU corpus.


Sign in / Sign up

Export Citation Format

Share Document