Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

Yang Yuan; Xiao Li; Ya-Ting Yang

doi:10.3390/info11010024

Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

Information ◽

10.3390/info11010024 ◽

2019 ◽

Vol 11 (1) ◽

pp. 24

Author(s):

Yang Yuan ◽

Xiao Li ◽

Ya-Ting Yang

Keyword(s):

Word Pair ◽

Word Embedding ◽

Experimental Results ◽

Small Scale ◽

Word Similarity ◽

Parallel Corpus ◽

Low Resource ◽

Data Sparseness ◽

Percentage Points ◽

Occurrence Matrix

To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks.

Download Full-text

Implanting Rational Knowledge into Distributed Representation at Morpheme Level

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33012954 ◽

2019 ◽

Vol 33 ◽

pp. 2954-2961

Author(s):

Zi Lin ◽

Yang Liu

Keyword(s):

Word Formation ◽

Similarity Measurement ◽

Distributed Representation ◽

Word Similarity ◽

Data Sparseness ◽

Percentage Points ◽

Novel Approach ◽

Classical Models ◽

Semantic Ontology ◽

Rational Knowledge

Previously, researchers paid no attention to the creation of unambiguous morpheme embeddings independent from the corpus, while such information plays an important role in expressing the exact meanings of words for parataxis languages like Chinese. In this paper, after constructing the Chinese lexical and semantic ontology based on word-formation, we propose a novel approach to implanting the structured rational knowledge into distributed representation at morpheme level, naturally avoiding heavy disambiguation in the corpus. We design a template to create the instances as pseudo-sentences merely from the pieces of knowledge of morphemes built in the lexicon. To exploit hierarchical information and tackle the data sparseness problem, the instance proliferation technique is applied based on similarity to expand the collection of pseudo-sentences. The distributed representation for morphemes can then be trained on these pseudo-sentences using word2vec. For evaluation, we validate the paradigmatic and syntagmatic relations of morpheme embeddings, and apply the obtained embeddings to word similarity measurement, achieving significant improvements over the classical models by more than 5 Spearman scores or 8 percentage points, which shows very promising prospects for adoption of the new source of knowledge.

Download Full-text

Evaluating Word Embedding Models for Malayalam

Revista Gestão Inovação e Tecnologias ◽

10.47059/revistageintec.v11i4.2406 ◽

2021 ◽

Vol 11 (4) ◽

pp. 3769-3783

Author(s):

Merin Cherian ◽

Kannan Balakrishnan

Keyword(s):

Language Processing ◽

Window Size ◽

Word Embedding ◽

Experimental Results ◽

Word Similarity ◽

Higher Dimensional ◽

Word Representation

An evaluation of static word embedding models for Malayalam is conducted in this paper. In this work, we have created a well-documented and pre-processed corpus for Malayalam. Word vectors were created for this corpus using three different word embedding models and they were evaluated using intrinsic evaluators. Quality of word representation is tested using word analogy, word similarity and concept categorization. The testing is independent of the downstream language processing tasks. Experimental results on Malayalam word representations of GloVe, FastText and Word2Vec are reported in this work. It is shown that higher-dimensional word representation and larger window size gave better results on intrinsic evaluators.

Download Full-text

Wave Energy Hyperbaric Device for Electricity Production

Volume 5: Ocean Space Utilization; Polar and Arctic Sciences and Technology; The Robert Dean Symposium on Coastal and Ocean Engineering; Special Symposium on Offshore Renewable Energy ◽

10.1115/omae2007-29743 ◽

2007 ◽

Cited By ~ 7

Author(s):

Segen F. Estefen ◽

Paulo Roberto da Costa ◽

Eliab Ricarte ◽

Marcelo M. Pinheiro

Keyword(s):

Power Generation ◽

Wave Energy ◽

Experimental Results ◽

Electricity Production ◽

Scale Model ◽

Small Scale ◽

Regular Waves ◽

Hyperbaric Chamber ◽

Small Scale Model

Wave energy is a renewable and non-polluting source and its use is being studied in different countries. The paper presents an overview on the harnessing of energy from waves and the activities associated with setting up a plant for extracting energy from waves in Port of Pecem, on the coast of Ceara State, Brazil. The technology employed is based on storing water under pressure in a hyperbaric chamber, from which a controlled jet of water drives a standard turbine. The wave resource at the proposed location is presented in terms of statistics data obtained from previous monitoring. The device components are described and small scale model tested under regular waves representatives of the installation region. Based on the experimental results values of prescribed pressures are identified in order to optimize the power generation.

Download Full-text

Analyzing Subword Techniques to Improve English to Sinhala Neural Machine Translation

International Journal of Asian Language Processing ◽

10.1142/s2717554520500174 ◽

2021 ◽

pp. 2050017

Author(s):

Rashmini Naranpanawa ◽

Ravinga Perera ◽

Thilakshi Fonseka ◽

Uthayasanker Thayasivam

Keyword(s):

Machine Translation ◽

State Of The Art ◽

Statistical Machine Translation ◽

Translation System ◽

Rare Word ◽

Neural Machine Translation ◽

Parallel Corpus ◽

Low Resource ◽

Word Level ◽

Morphologically Rich Languages

Neural machine translation (NMT) is a remarkable approach which performs much better than the Statistical machine translation (SMT) models when there is an abundance of parallel corpus. However, vanilla NMT is primarily based upon word-level with a fixed vocabulary. Therefore, low resource morphologically rich languages such as Sinhala are mostly affected by the out of vocabulary (OOV) and Rare word problems. Recent advancements in subword techniques have opened up opportunities for low resource communities by enabling open vocabulary translation. In this paper, we extend our recently published state-of-the-art EN-SI translation system using the transformer and explore standard subword techniques on top of it to identify which subword approach has a greater effect on English Sinhala language pair. Our models demonstrate that subword segmentation strategies along with the state-of-the-art NMT can perform remarkably when translating English sentences into a rich morphology language regardless of a large parallel corpus.

Download Full-text

Pengenalan Ras Berdasarkan Hidung Dan Mulut Menggunakan Gray Level Co-Occurrence Matrix

Jurnal Teknologi Informasi dan Ilmu Komputer ◽

10.25126/jtiik.2021844366 ◽

2021 ◽

Vol 8 (4) ◽

pp. 729

Author(s):

Ema Rachmawati ◽

Nur Azizah Agustina ◽

Febryanti Sthevanie

Keyword(s):

Random Forest ◽

System Performance ◽

Experimental Results ◽

Gray Level ◽

The Face ◽

Combined Features ◽

Occurrence Matrix ◽

Large Groups ◽

Performance Of Race

Ras dapat digunakan untuk mengkategorikan manusia dalam populasi atau kelompok besar. Oleh karena itu, pengenalan ras dapat berguna untuk mempermudah dalam mengidentifikasi seseorang dan membantu dalam mempersempit lingkup pencarian. Penggunaan wajah sebagai dasar pengenalan ras mengarahkan penelitian pada identifikasi penggunaan bagian wajah yang berpengaruh signifikan terhadap kinerja pengenalan ras. Pada penelitian ini bagian wajah berupa hidung dan mulut diidentifikasi untuk digunakan sebagai dasar pengenalan ras Mongoloid, Kaukasoid, dan Negroid. Ciri Gray Level Co-occurrence Matrix (GLCM) diekstrak dari bagian hidung dan mulut untuk selanjutnya diklasifikasi menggunakan Random Forest. Hasil eksperimen menunjukkan bahwa penggunaan ciri gabungan dari hidung dan mulut mampu menghasilkan kinerja sistem yang paling baik jika dibandingkan penggunaan hidung atau mulut saja. AbstractRace can be used to categorize humans in populations or large groups. Therefore, racial recognition can be useful to make it easier to identify a person and help narrow the scope of the search. The use of faces as a basis for race recognition directs research on identifying the use of facial parts that significantly influence the performance of race recognition. In this study, the face parts of the nose and mouth were identified to be used as a basis for the recognition of the Mongoloid, Caucasoid, and Negroid races. The Gray Level Co-occurrence Matrix (GLCM) feature is extracted from the nose and mouth to be classified using Random Forest. The experimental results show that the use of combined features of the nose and mouth is able to produce the best system performance compared to the use of the nose or mouth only.

Download Full-text

Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Computational Intelligence and Neuroscience ◽

10.1155/2021/6682385 ◽

2021 ◽

Vol 2021 ◽

pp. 1-10

Author(s):

Michael Adjeisah ◽

Guohua Liu ◽

Douglas Omwenga Nyabuga ◽

Richard Nuetey Nortey ◽

Jinling Song

Keyword(s):

Machine Translation ◽

Language Processing ◽

Training Data ◽

Target Language ◽

Similarity Metrics ◽

Mahalanobis Distances ◽

Parallel Corpora ◽

Parallel Corpus ◽

Low Resource ◽

Sentence Level

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Download Full-text

Experimental results of a solar space heating system with PCM: Small-scale setup and real-scale setup

Materials Today Proceedings ◽

10.1016/j.matpr.2021.12.002 ◽

2021 ◽

Author(s):

Nidhi Agrawal ◽

Bhuvnesh Kumar ◽

Bhanu Verma ◽

Harald Mehling ◽

Bharti Arora

Keyword(s):

Heating System ◽

Experimental Results ◽

Space Heating ◽

Small Scale ◽

Real Scale

Download Full-text

SPECIFIC EMISSIONS FROM BIOMASS COMBUSTION

Acta Polytechnica ◽

10.14311/ap.2014.54.0074 ◽

2014 ◽

Vol 54 (1) ◽

pp. 74-78 ◽

Cited By ~ 2

Author(s):

Pavel Skopec ◽

Jan Hrdlička ◽

Michal Kaválek

Keyword(s):

Calorific Value ◽

Experimental Results ◽

The Other ◽

Biomass Combustion ◽

Small Scale ◽

Biomass Fuels ◽

Rape Plant ◽

Specific Emissions

This paper deals with determining the specific emissions from the combustion of two kinds of biomass fuels in a small-scale boiler. The tested fuels were pellets made of wood and pellets made of rape plant straw. In order to evaluate the specific emissions, several combustion experiments were carried out using a commercial 25 kW pellet-fired boiler. The specific emissions of CO, SO2 and NOx were evaluated in relation to a unit of burned fuel, a unit of calorific value and a unit of produced heat. The specific emissions were compared with some data acquired from the reference literature, with relatively different results. The differences depend mainly on the procedure used for determining the values, and references provide no information about this. Although some of our experimental results may fit with one of the reference sources, they do not fit with the other. The reliability of the references is therefore disputable.

Download Full-text

Tilting plane tests on a small-scale masonry cross vault: Experimental results and numerical simulations through a heterogeneous approach

Engineering Structures ◽

10.1016/j.engstruct.2016.05.017 ◽

2016 ◽

Vol 123 ◽

pp. 300-312 ◽

Cited By ~ 19

Author(s):

G. Milani ◽

M. Rossi ◽

C. Calderini ◽

S. Lagomarsino

Keyword(s):

Numerical Simulations ◽

Experimental Results ◽

Small Scale

Download Full-text

Combining Word Embedding and Semantic Lexicon for Chinese Word Similarity Computation

Natural Language Understanding and Intelligent Applications - Lecture Notes in Computer Science ◽

10.1007/978-3-319-50496-4_69 ◽

2016 ◽

pp. 766-777 ◽

Cited By ~ 6

Author(s):

Jiahuan Pei ◽

Cong Zhang ◽

Degen Huang ◽

Jianjun Ma

Keyword(s):

Word Embedding ◽

Chinese Word ◽

Word Similarity ◽

Similarity Computation

Download Full-text