A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models

Usman Naseem; Imran Razzak; Shah Khalid Khan; Mukesh Prasad

doi:10.1145/3434237

A Comprehensive Survey on Word Representation Models: From Classical to State-of-the-Art Word Representation Language Models

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3434237 ◽

2021 ◽

Vol 20 (5) ◽

pp. 1-35

Author(s):

Usman Naseem ◽

Imran Razzak ◽

Shah Khalid Khan ◽

Mukesh Prasad

Keyword(s):

Language Processing ◽

State Of The Art ◽

Research Area ◽

Language Models ◽

Word Representation ◽

Comprehensive Survey ◽

History Of ◽

Important Research Area ◽

Representation Language ◽

Vector Representations

Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP-related tasks. In the end, this survey briefly discusses the commonly used ML- and DL-based classifiers, evaluation metrics, and the applications of these word embeddings in different NLP tasks.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Applied Sciences ◽

10.3390/app9183648 ◽

2019 ◽

Vol 9 (18) ◽

pp. 3648

Author(s):

Casper S. Shikali ◽

Zhou Sijie ◽

Liu Qihe ◽

Refuoe Mokhosi

Keyword(s):

Language Processing ◽

Critical Role ◽

Language Model ◽

Central Africa ◽

Spoken Language ◽

Language Models ◽

Word Embeddings ◽

Word Representation

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.

Download Full-text

Text: An R-package for Analyzing and Visualizing Human Language Using Natural Language Processing and Deep Learning

10.31234/osf.io/293kt ◽

2021 ◽

Author(s):

Oscar Nils Erik Kjell ◽

H. Andrew Schwartz ◽

Salvatore Giorgi

Keyword(s):

Deep Learning ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rating Scale ◽

State Of The Art ◽

R Package ◽

Language Models ◽

Categorical Variables ◽

Human Language

The language that individuals use for expressing themselves contains rich psychological information. Recent significant advances in Natural Language Processing (NLP) and Deep Learning (DL), namely transformers, have resulted in large performance gains in tasks related to understanding natural language such as machine translation. However, these state-of-the-art methods have not yet been made easily accessible for psychology researchers, nor designed to be optimal for human-level analyses. This tutorial introduces text (www.r-text.org), a new R-package for analyzing and visualizing human language using transformers, the latest techniques from NLP and DL. Text is both a modular solution for accessing state-of-the-art language models and an end-to-end solution catered for human-level analyses. Hence, text provides user-friendly functions tailored to test hypotheses in social sciences for both relatively small and large datasets. This tutorial describes useful methods for analyzing text, providing functions with reliable defaults that can be used off-the-shelf as well as providing a framework for the advanced users to build on for novel techniques and analysis pipelines. The reader learns about six methods: 1) textEmbed: to transform text to traditional or modern transformer-based word embeddings (i.e., numeric representations of words); 2) textTrain: to examine the relationships between text and numeric/categorical variables; 3) textSimilarity and 4) textSimilarityTest: to computing semantic similarity scores between texts and significance test the difference in meaning between two sets of texts; and 5) textProjection and 6) textProjectionPlot: to examine and visualize text within the embedding space according to latent or specified construct dimensions (e.g., low to high rating scale scores).

Download Full-text

Audio Captioning with Composition of Acoustic and Semantic Information

International Journal of Semantic Computing ◽

10.1142/s1793351x21400018 ◽

2021 ◽

Vol 15 (02) ◽

pp. 143-160

Author(s):

Ayşegül Özkaya Eren ◽

Mustafa Sert

Keyword(s):

Language Processing ◽

Semantic Information ◽

State Of The Art ◽

Research Area ◽

Audio Features ◽

Audio Clip ◽

Proposed Model ◽

Decoder Architecture ◽

Gated Recurrent Units ◽

New Research

Generating audio captions is a new research area that combines audio and natural language processing to create meaningful textual descriptions for audio clips. To address this problem, previous studies mostly use the encoder–decoder-based models without considering semantic information. To fill this gap, we present a novel encoder–decoder architecture using bi-directional Gated Recurrent Units (BiGRU) with audio and semantic embeddings. We extract semantic embedding by obtaining subjects and verbs from the audio clip captions and combine these embedding with audio embedding to feed the BiGRU-based encoder–decoder model. To enable semantic embeddings for the test audios, we introduce a Multilayer Perceptron classifier to predict the semantic embeddings of those clips. We also present exhaustive experiments to show the efficiency of different features and datasets for our proposed model the audio captioning task. To extract audio features, we use the log Mel energy features, VGGish embeddings, and a pretrained audio neural network (PANN) embeddings. Extensive experiments on two audio captioning datasets Clotho and AudioCaps show that our proposed model outperforms state-of-the-art audio captioning models across different evaluation metrics and using the semantic information improves the captioning performance.

Download Full-text

History of Bioelectrical Study and the Electrophysiology of the Primo Vascular System

Evidence-based Complementary and Alternative Medicine ◽

10.1155/2013/486823 ◽

2013 ◽

Vol 2013 ◽

pp. 1-14 ◽

Cited By ~ 2

Author(s):

Sang Hyun Park ◽

Eung Hwi Kim ◽

Ho Jong Chang ◽

Seung Zhoo Yoon ◽

Ji Woong Yoon ◽

...

Keyword(s):

Vascular System ◽

Electrophysiological Study ◽

Research Result ◽

Circulatory System ◽

Research Area ◽

Anatomical Structure ◽

Important Research ◽

Research Results ◽

History Of ◽

Important Research Area

Background. Primo vascular system is a new anatomical structure whose research results have reported the possibility of a new circulatory system similar to the blood vascular system and cells. Electrophysiology, which measures and analyzes bioelectrical signals tissues and cells, is an important research area for investigating the function of tissues and cells. The bioelectrical study of the primo vascular system has been reported by using modern techniques since the early 1960s by Bonghan Kim. This paper reviews the research result of the electrophysiological study of the primo vascular system for the discussion of the circulatory function. We hope it would help to study the electrophysiology of the primo vascular system for researchers. This paper will use the following exchangeable expressions: Kyungrak system = Bonghan system = Bonghan circulatory system = primo vascular system = primo system; Bonghan corpuscle = primo node; Bonghan duct = primo vessel. We think that objective descriptions of reviewed papers are more important than unified expressions when citing the papers. That said, this paper will unify the expressions of the primo vascular system.

Download Full-text

A comprehensive survey on data provenance: State-of-the-art approaches and their deployments for IoT security enforcement

Journal of Computer Security ◽

10.3233/jcs-200108 ◽

2021 ◽

pp. 1-24

Author(s):

Md Morshed Alam ◽

Weichao Wang

Keyword(s):

State Of The Art ◽

Data Provenance ◽

Attack Behavior ◽

Malicious Attack ◽

Survey Paper ◽

Comprehensive Information ◽

Comprehensive Survey ◽

History Of ◽

System Data ◽

Iot Devices

Data provenance collects comprehensive information about the events and operations in a computer system at both application and kernel levels. It provides a detailed and accurate history of transactions that help delineate the data flow scenario across the whole system. Data provenance helps achieve system resilience by uncovering several malicious attack traces after a system compromise that are leveraged by the analyzer to understand the attack behavior and discover the level of damage. Existing literature demonstrates a number of research efforts on information capture, management, and analysis of data provenance. In recent years, provenance in IoT devices attracts several research efforts because of the proliferation of commodity IoT devices. In this survey paper, we present a comparative study of the state-of-the-art approaches to provenance by classifying them based on frameworks, deployed techniques, and subjects of interest. We also discuss the emergence and scope of data provenance in IoT network. Finally, we present the urgency in several directions that data provenance needs to pursue, including data management and analysis.

Download Full-text

Concept Recognition as a Machine Translation Problem

10.1101/2020.12.03.410829 ◽

2020 ◽

Author(s):

Mayla R Boguslav ◽

Negacy D Hailu ◽

Michael Bada ◽

William A Baumgartner ◽

Lawrence E Hunter

Keyword(s):

Machine Learning ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Training Data ◽

Language Models ◽

Alternative Methods ◽

Automated Assignment ◽

Concept Recognition ◽

Alternative Approaches

AbstractBackgroundAutomated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models had the potential to outperform multi-class classification approaches. Here we systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning.ResultsWe report on our extensive studies of alternative methods and hyperparameter selections. The results not only identify the best-performing systems and parameters across a wide variety of ontologies but also illuminate about the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) for span detection (as previously found) along with the Open-source Toolkit for Neural Machine Translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies in CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches.ConclusionsMachine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT Shared Task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.

Download Full-text

A Comprehensive Exploration of Pre-training Language Models

10.36227/techrxiv.14820348 ◽

2021 ◽

Author(s):

Tong Guo

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Contextual Information ◽

Experimental Results ◽

Language Models

Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results shows that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for the transformer-encoder layers.

Download Full-text

Concept recognition as a machine translation problem

BMC Bioinformatics ◽

10.1186/s12859-021-04141-4 ◽

2021 ◽

Vol 22 (S1) ◽

Author(s):

Mayla R. Boguslav ◽

Negacy D. Hailu ◽

Michael Bada ◽

William A. Baumgartner ◽

Lawrence E. Hunter

Keyword(s):

Machine Learning ◽

Machine Translation ◽

Language Processing ◽

State Of The Art ◽

Training Data ◽

Language Models ◽

Alternative Methods ◽

Automated Assignment ◽

Concept Recognition ◽

Alternative Approaches

Abstract Background Automated assignment of specific ontology concepts to mentions in text is a critical task in biomedical natural language processing, and the subject of many open shared tasks. Although the current state of the art involves the use of neural network language models as a post-processing step, the very large number of ontology classes to be recognized and the limited amount of gold-standard training data has impeded the creation of end-to-end systems based entirely on machine learning. Recently, Hailu et al. recast the concept recognition problem as a type of machine translation and demonstrated that sequence-to-sequence machine learning models have the potential to outperform multi-class classification approaches. Methods We systematically characterize the factors that contribute to the accuracy and efficiency of several approaches to sequence-to-sequence machine learning through extensive studies of alternative methods and hyperparameter selections. We not only identify the best-performing systems and parameters across a wide variety of ontologies but also provide insights into the widely varying resource requirements and hyperparameter robustness of alternative approaches. Analysis of the strengths and weaknesses of such systems suggest promising avenues for future improvements as well as design choices that can increase computational efficiency with small costs in performance. Results Bidirectional encoder representations from transformers for biomedical text mining (BioBERT) for span detection along with the open-source toolkit for neural machine translation (OpenNMT) for concept normalization achieve state-of-the-art performance for most ontologies annotated in the CRAFT Corpus. This approach uses substantially fewer computational resources, including hardware, memory, and time than several alternative approaches. Conclusions Machine translation is a promising avenue for fully machine-learning-based concept recognition that achieves state-of-the-art results on the CRAFT Corpus, evaluated via a direct comparison to previous results from the 2019 CRAFT shared task. Experiments illuminating the reasons for the surprisingly good performance of sequence-to-sequence methods targeting ontology identifiers suggest that further progress may be possible by mapping to alternative target concept representations. All code and models can be found at: https://github.com/UCDenver-ccp/Concept-Recognition-as-Translation.

Download Full-text

Comparative Analysis of Transformer based Language Models

10.5121/csit.2021.110111 ◽

2021 ◽

Author(s):

Aman Pathak

Keyword(s):

Comparative Analysis ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Language Models ◽

Learning Models ◽

Contributing Factors ◽

Language Abilities ◽

The Past ◽

Power Standard

Natural language processing (NLP) has witnessed many substantial advancements in the past three years. With the introduction of the Transformer and self-attention mechanism, language models are now able to learn better representations of the natural language. These attentionbased models have achieved exceptional state-of-the-art results on various NLP benchmarks. One of the contributing factors is the growing use of transfer learning. Models are pre-trained on unsupervised objectives using rich datasets that develop fundamental natural language abilities that are fine-tuned further on supervised data for downstream tasks. Surprisingly, current researches have led to a novel era of powerful models that no longer require finetuning. The objective of this paper is to present a comparative analysis of some of the most influential language models. The benchmarks of the study are problem-solving methodologies, model architecture, compute power, standard NLP benchmark accuracies and shortcomings.

Download Full-text