scholarly journals Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

Information ◽  
2019 ◽  
Vol 11 (1) ◽  
pp. 24
Author(s):  
Yang Yuan ◽  
Xiao Li ◽  
Ya-Ting Yang

To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.

Author(s):  
Zi Lin ◽  
Yang Liu

Previously, researchers paid no attention to the creation of unambiguous morpheme embeddings independent from the corpus, while such information plays an important role in expressing the exact meanings of words for parataxis languages like Chinese. In this paper, after constructing the Chinese lexical and semantic ontology based on word-formation, we propose a novel approach to implanting the structured rational knowledge into distributed representation at morpheme level, naturally avoiding heavy disambiguation in the corpus. We design a template to create the instances as pseudo-sentences merely from the pieces of knowledge of morphemes built in the lexicon. To exploit hierarchical information and tackle the data sparseness problem, the instance proliferation technique is applied based on similarity to expand the collection of pseudo-sentences. The distributed representation for morphemes can then be trained on these pseudo-sentences using word2vec. For evaluation, we validate the paradigmatic and syntagmatic relations of morpheme embeddings, and apply the obtained embeddings to word similarity measurement, achieving significant improvements over the classical models by more than 5 Spearman scores or 8 percentage points, which shows very promising prospects for adoption of the new source of knowledge.


2021 ◽  
Vol 11 (4) ◽  
pp. 3769-3783
Author(s):  
Merin Cherian ◽  
Kannan Balakrishnan

An evaluation of static word embedding models for Malayalam is conducted in this paper. In this work, we have created a well-documented and pre-processed corpus for Malayalam. Word vectors were created for this corpus using three different word embedding models and they were evaluated using intrinsic evaluators. Quality of word representation is tested using word analogy, word similarity and concept categorization. The testing is independent of the downstream language processing tasks. Experimental results on Malayalam word representations of GloVe, FastText and Word2Vec are reported in this work. It is shown that higher-dimensional word representation and larger window size gave better results on intrinsic evaluators.


Author(s):  
Segen F. Estefen ◽  
Paulo Roberto da Costa ◽  
Eliab Ricarte ◽  
Marcelo M. Pinheiro

Wave energy is a renewable and non-polluting source and its use is being studied in different countries. The paper presents an overview on the harnessing of energy from waves and the activities associated with setting up a plant for extracting energy from waves in Port of Pecem, on the coast of Ceara State, Brazil. The technology employed is based on storing water under pressure in a hyperbaric chamber, from which a controlled jet of water drives a standard turbine. The wave resource at the proposed location is presented in terms of statistics data obtained from previous monitoring. The device components are described and small scale model tested under regular waves representatives of the installation region. Based on the experimental results values of prescribed pressures are identified in order to optimize the power generation.


Author(s):  
Rashmini Naranpanawa ◽  
Ravinga Perera ◽  
Thilakshi Fonseka ◽  
Uthayasanker Thayasivam

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.


2021 ◽  
Vol 8 (4) ◽  
pp. 729
Author(s):  
Ema Rachmawati ◽  
Nur Azizah Agustina ◽  
Febryanti Sthevanie

<p class="Abstract">Ras dapat digunakan untuk mengkategorikan manusia dalam populasi atau kelompok besar. Oleh karena itu, pengenalan ras dapat berguna untuk mempermudah dalam mengidentifikasi seseorang dan membantu dalam mempersempit lingkup pencarian. Penggunaan wajah sebagai dasar pengenalan ras mengarahkan penelitian pada identifikasi penggunaan bagian wajah yang berpengaruh signifikan terhadap kinerja pengenalan ras. Pada penelitian ini bagian wajah berupa hidung dan mulut diidentifikasi untuk digunakan sebagai dasar pengenalan ras Mongoloid, Kaukasoid, dan Negroid. Ciri <em>Gray Level Co-occurrence Matrix</em> (GLCM) diekstrak dari bagian hidung dan mulut untuk selanjutnya diklasifikasi menggunakan Random Forest. Hasil eksperimen menunjukkan bahwa penggunaan ciri gabungan dari hidung dan mulut mampu menghasilkan kinerja sistem yang paling baik jika dibandingkan penggunaan hidung atau mulut saja.</p><p class="Abstract"> </p><p class="Abstract"><strong><em>Abst</em></strong><strong><em>r</em></strong><strong><em>act</em></strong></p><p align="center"><em>Race can be used to categorize humans in populations or large groups. Therefore, racial recognition can be useful to make it easier to identify a person and help narrow the scope of the search. The use of faces as a basis for race recognition directs research on identifying the use of facial parts that significantly influence the performance of race recognition. In this study, the face parts of the nose and mouth were identified to be used as a basis for the recognition of the Mongoloid, Caucasoid, and Negroid races. The Gray Level Co-occurrence Matrix (GLCM) feature is extracted from the nose and mouth to be classified using Random Forest. The experimental results show that the use of combined features of the nose and mouth is able to produce the best system performance compared to the use of the nose or mouth only.</em></p><p class="Abstract"> </p>


2021 ◽  
Vol 2021 ◽  
pp. 1-10
Author(s):  
Michael Adjeisah ◽  
Guohua Liu ◽  
Douglas Omwenga Nyabuga ◽  
Richard Nuetey Nortey ◽  
Jinling Song

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.


Author(s):  
Nidhi Agrawal ◽  
Bhuvnesh Kumar ◽  
Bhanu Verma ◽  
Harald Mehling ◽  
Bharti Arora

2014 ◽  
Vol 54 (1) ◽  
pp. 74-78 ◽  
Author(s):  
Pavel Skopec ◽  
Jan Hrdlička ◽  
Michal Kaválek

This paper deals with determining the specific emissions from the combustion of two kinds of biomass fuels in a small-scale boiler. The tested fuels were pellets made of wood and pellets made of rape plant straw. In order to evaluate the specific emissions, several combustion experiments were carried out using a commercial 25 kW pellet-fired boiler. The specific emissions of CO, SO<sub>2</sub> and NO<sub>x</sub> were evaluated in relation to a unit of burned fuel, a unit of calorific value and a unit of produced heat. The specific emissions were compared with some data acquired from the reference literature, with relatively different results. The differences depend mainly on the procedure used for determining the values, and references provide no information about this. Although some of our experimental results may fit with one of the reference sources, they do not fit with the other. The reliability of the references is therefore disputable.


Sign in / Sign up

Export Citation Format

Share Document