A Fast and Simple Online Synchronous Context Free Grammar Extractor

Baltescu Paul; Blunsom Phil

doi:10.2478/pralin-2014-0010

A Fast and Simple Online Synchronous Context Free Grammar Extractor

Prague Bulletin of Mathematical Linguistics ◽

10.2478/pralin-2014-0010 ◽

2014 ◽

Vol 102 (1) ◽

pp. 17-26 ◽

Cited By ~ 1

Author(s):

Baltescu Paul ◽

Blunsom Phil

Keyword(s):

Traditional Approach ◽

Training Data ◽

Context Free Grammar ◽

Parallel Corpora ◽

Suffix Arrays ◽

Parallel Data ◽

Efficient Data ◽

A New Technique ◽

Translation Systems ◽

Context Free

Abstract Hierarchical phrase-based machine translation systems rely on the synchronous context free grammar formalism to learn and use translation rules containing gaps. The grammars learned by such systems become unmanageably large even for medium sized parallel corpora. The traditional approach of preprocessing the training data and loading all possible translation rules into memory does not scale well for hierarchical phrase-based systems. Online grammar extractors address this problem by constructing memory efficient data structures on top of the source sideof the parallel data (often based on suffix arrays), which are usedto efficiently match phrases in the corpus and to extract translation rules on the fly during decoding. This paper describes an open source implementation of an online synchronous context free grammar extractor. Our approach builds on the work of Lopez (2008a) and introduces a new technique for extending the lists of phrase matches for phrases containing gaps that reduces the extraction time by a factor of 4. Our extractor is available as part of the cdec toolkit1 (Dyer et al., 2010).

Download Full-text

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

Applied Sciences ◽

10.3390/app9102036 ◽

2019 ◽

Vol 9 (10) ◽

pp. 2036

Author(s):

Jinyi Zhang ◽

Tadahiro Matsumoto

Keyword(s):

Machine Translation ◽

Scientific Paper ◽

Training Data ◽

Word Alignment ◽

Sentence Pair ◽

Neural Machine Translation ◽

Parallel Corpora ◽

Translation Quality ◽

Parallel Data ◽

Source Sentence

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method.

Download Full-text

An Efficient Data Security for SSL Using Reverse Context Free Grammar Productions

Journal of Advanced Research in Dynamical and Control Systems ◽

10.5373/jardcs/v11sp12/20193304 ◽

2019 ◽

Vol 11 (12-SPECIAL ISSUE) ◽

pp. 984-993

Author(s):

T. Mohan Raj

Keyword(s):

Data Security ◽

Context Free Grammar ◽

Efficient Data ◽

Context Free

Download Full-text

Mining Parallel Corpora from Sina Weibo and Twitter

Computational Linguistics ◽

10.1162/coli_a_00249 ◽

2016 ◽

Vol 42 (2) ◽

pp. 307-343 ◽

Cited By ~ 2

Author(s):

Wang Ling ◽

Luís Marujo ◽

Chris Dyer ◽

Alan W. Black ◽

Isabel Trancoso

Keyword(s):

Dynamic Programming Algorithm ◽

Training Data ◽

Programming Algorithm ◽

Translation System ◽

Sina Weibo ◽

Parallel Corpora ◽

Parallel Data ◽

Alignment Problem ◽

Machine Translation System ◽

Multiple Languages

Microblogs such as Twitter, Facebook, and Sina Weibo (China's equivalent of Twitter) are a remarkable linguistic resource. In contrast to content from edited genres such as newswire, microblogs contain discussions of virtually every topic by numerous individuals in different languages and dialects and in different styles. In this work, we show that some microblog users post “self-translated” messages targeting audiences who speak different languages, either by writing the same message in multiple languages or by retweeting translations of their original posts in a second language. We introduce a method for finding and extracting this naturally occurring parallel data. Identifying the parallel content requires solving an alignment problem, and we give an optimally efficient dynamic programming algorithm for this. Using our method, we extract nearly 3M Chinese–English parallel segments from Sina Weibo using a targeted crawl of Weibo users who post in multiple languages. Additionally, from a random sample of Twitter, we obtain substantial amounts of parallel data in multiple language pairs. Evaluation is performed by assessing the accuracy of our extraction approach relative to a manual annotation as well as in terms of utility as training data for a Chinese–English machine translation system. Relative to traditional parallel data resources, the automatically extracted parallel data yield substantial translation quality improvements in translating microblog text and modest improvements in translating edited news content.

Download Full-text

Framework for rare event detection using Artificial Neural Network based context free grammar

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-189164 ◽

2020 ◽

Vol 39 (6) ◽

pp. 8463-8475

Author(s):

Palanivel Srinivasan ◽

Manivannan Doraipandian

Keyword(s):

Neural Network ◽

Artificial Neural Network ◽

Event Detection ◽

Performance Metrics ◽

Rare Events ◽

Rare Event ◽

Video Stream ◽

Context Free Grammar ◽

Artificial Neural ◽

Context Free

Rare event detections are performed using spatial domain and frequency domain-based procedures. Omnipresent surveillance camera footages are increasing exponentially due course the time. Monitoring all the events manually is an insignificant and more time-consuming process. Therefore, an automated rare event detection contrivance is required to make this process manageable. In this work, a Context-Free Grammar (CFG) is developed for detecting rare events from a video stream and Artificial Neural Network (ANN) is used to train CFG. A set of dedicated algorithms are used to perform frame split process, edge detection, background subtraction and convert the processed data into CFG. The developed CFG is converted into nodes and edges to form a graph. The graph is given to the input layer of an ANN to classify normal and rare event classes. Graph derived from CFG using input video stream is used to train ANN Further the performance of developed Artificial Neural Network Based Context-Free Grammar – Rare Event Detection (ACFG-RED) is compared with other existing techniques and performance metrics such as accuracy, precision, sensitivity, recall, average processing time and average processing power are used for performance estimation and analyzed. Better performance metrics values have been observed for the ANN-CFG model compared with other techniques. The developed model will provide a better solution in detecting rare events using video streams.

Download Full-text

Context Free Grammar (CFG) for MODI Script

International Institute of Engineers April 20-21, 2015 Bangkok (Thailand) ◽

10.15242/iie.e0415011 ◽

2015 ◽

Keyword(s):

Context Free Grammar ◽

Context Free

Download Full-text

Searching for universal model of amyloid signaling motifs using probabilistic context-free grammars

BMC Bioinformatics ◽

10.1186/s12859-021-04139-y ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Witold Dyrka ◽

Marlena Gąsior-Głogowska ◽

Monika Szefczyk ◽

Natalia Szulc

Keyword(s):

Functional Relationship ◽

High Sensitivity ◽

Alternative Methods ◽

Discriminative Power ◽

Context Free Grammar ◽

Protein Motifs ◽

Functional Features ◽

Universal Grammars ◽

Context Free ◽

Probabilistic Context

Abstract Background Amyloid signaling motifs are a class of protein motifs which share basic structural and functional features despite the lack of clear sequence homology. They are hard to detect in large sequence databases either with the alignment-based profile methods (due to short length and diversity) or with generic amyloid- and prion-finding tools (due to insufficient discriminative power). We propose to address the challenge with a machine learning grammatical model capable of generalizing over diverse collections of unaligned yet related motifs. Results First, we introduce and test improvements to our probabilistic context-free grammar framework for protein sequences that allow for inferring more sophisticated models achieving high sensitivity at low false positive rates. Then, we infer universal grammars for a collection of recently identified bacterial amyloid signaling motifs and demonstrate that the method is capable of generalizing by successfully searching for related motifs in fungi. The results are compared to available alternative methods. Finally, we conduct spectroscopy and staining analyses of selected peptides to verify their structural and functional relationship. Conclusions While the profile HMMs remain the method of choice for modeling homologous sets of sequences, PCFGs seem more suitable for building meta-family descriptors and extrapolating beyond the seed sample.

Download Full-text

A generalization of the concept of a context-free grammar

Cybernetics ◽

10.1007/bf01068989 ◽

1974 ◽

Vol 8 (3) ◽

pp. 349-351

Author(s):

A. A. Letichevskii

Keyword(s):

Context Free Grammar ◽

Context Free

Download Full-text

Knowledge Sources for Constituent Parsing of German, a Morphologically Rich and Less-Configurational Language

Computational Linguistics ◽

10.1162/coli_a_00135 ◽

2013 ◽

Vol 39 (1) ◽

pp. 57-85 ◽

Cited By ~ 2

Author(s):

Alexander Fraser ◽

Helmut Schmid ◽

Richárd Farkas ◽

Renjing Wang ◽

Hinrich Schütze

Keyword(s):

State Of The Art ◽

Lessons Learned ◽

Knowledge Sources ◽

Lexical Knowledge ◽

Context Free Grammar ◽

The Impact ◽

Context Free ◽

Probabilistic Context

We study constituent parsing of German, a morphologically rich and less-configurational language. We use a probabilistic context-free grammar treebank grammar that has been adapted to the morphologically rich properties of German by markovization and special features added to its productions. We evaluate the impact of adding lexical knowledge. Then we examine both monolingual and bilingual approaches to parse reranking. Our reranking parser is the new state of the art in constituency parsing of the TIGER Treebank. We perform an analysis, concluding with lessons learned, which apply to parsing other morphologically rich and less-configurational languages.

Download Full-text