A Fast and Simple Online Synchronous Context Free Grammar Extractor
Abstract Hierarchical phrase-based machine translation systems rely on the synchronous context free grammar formalism to learn and use translation rules containing gaps. The grammars learned by such systems become unmanageably large even for medium sized parallel corpora. The traditional approach of preprocessing the training data and loading all possible translation rules into memory does not scale well for hierarchical phrase-based systems. Online grammar extractors address this problem by constructing memory efficient data structures on top of the source sideof the parallel data (often based on suffix arrays), which are usedto efficiently match phrases in the corpus and to extract translation rules on the fly during decoding. This paper describes an open source implementation of an online synchronous context free grammar extractor. Our approach builds on the work of Lopez (2008a) and introduces a new technique for extending the lists of phrase matches for phrases containing gaps that reduces the extraction time by a factor of 4. Our extractor is available as part of the cdec toolkit1 (Dyer et al., 2010).