MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain

Muzamil Hussain Syed; Sun-Tae Chung

doi:10.3390/app11136007

MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain

Applied Sciences ◽

10.3390/app11136007 ◽

2021 ◽

Vol 11 (13) ◽

pp. 6007

Author(s):

Muzamil Hussain Syed ◽

Sun-Tae Chung

Keyword(s):

Domain Adaptation ◽

Language Model ◽

Named Entity Recognition ◽

Word Embedding ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Feature Vectors ◽

Named Entity ◽

Domain Specific

Entity-based information extraction is one of the main applications of Natural Language Processing (NLP). Recently, deep transfer-learning utilizing contextualized word embedding from pre-trained language models has shown remarkable results for many NLP tasks, including Named-entity recognition (NER). BERT (Bidirectional Encoder Representations from Transformers) is gaining prominent attention among various contextualized word embedding models as a state-of-the-art pre-trained language model. It is quite expensive to train a BERT model from scratch for a new application domain since it needs a huge dataset and enormous computing time. In this paper, we focus on menu entity extraction from online user reviews for the restaurant and propose a simple but effective approach for NER task on a new domain where a large dataset is rarely available or difficult to prepare, such as food menu domain, based on domain adaptation technique for word embedding and fine-tuning the popular NER task network model ‘Bi-LSTM+CRF’ with extended feature vectors. The proposed NER approach (named as ‘MenuNER’) consists of two step-processes: (1) Domain adaptation for target domain; further pre-training of the off-the-shelf BERT language model (BERT-base) in semi-supervised fashion on a domain-specific dataset, and (2) Supervised fine-tuning the popular Bi-LSTM+CRF network for downstream task with extended feature vectors obtained by concatenating word embedding from the domain-adapted pre-trained BERT model from the first step, character embedding and POS tag feature information. Experimental results on handcrafted food menu corpus from customers’ review dataset show that our proposed approach for domain-specific NER task, that is: food menu named-entity recognition, performs significantly better than the one based on the baseline off-the-shelf BERT-base model. The proposed approach achieves 92.5% F1 score on the YELP dataset for the MenuNER task.

Download Full-text

Low-Resource Named Entity Recognition via the Pre-Training Model

Symmetry ◽

10.3390/sym13050786 ◽

2021 ◽

Vol 13 (5) ◽

pp. 786

Author(s):

Siqi Chen ◽

Yijie Pei ◽

Zunwang Ke ◽

Wushour Silamu

Keyword(s):

Data Augmentation ◽

Language Model ◽

Named Entity Recognition ◽

Name Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Low Resource ◽

Named Entity ◽

High Resource

Named entity recognition (NER) is an important task in the processing of natural language, which needs to determine entity boundaries and classify them into pre-defined categories. For low-resource languages, most state-of-the-art systems require tens of thousands of annotated sentences to obtain high performance. However, there is minimal annotated data available about Uyghur and Hungarian (UH languages) NER tasks. There are also specificities in each task—differences in words and word order across languages make it a challenging problem. In this paper, we present an effective solution to providing a meaningful and easy-to-use feature extractor for named entity recognition tasks: fine-tuning the pre-trained language model. Therefore, we propose a fine-tuning method for a low-resource language model, which constructs a fine-tuning dataset through data augmentation; then the dataset of a high-resource language is added; and finally the cross-language pre-trained model is fine-tuned on this dataset. In addition, we propose an attention-based fine-tuning strategy that uses symmetry to better select relevant semantic and syntactic information from pre-trained language models and apply these symmetry features to name entity recognition tasks. We evaluated our approach on Uyghur and Hungarian datasets, which showed wonderful performance compared to some strong baselines. We close with an overview of the available resources for named entity recognition and some of the open research questions.

Download Full-text

BioALBERT: A Simple and Effective Pre-trained Language Model for Biomedical Named Entity Recognition

10.21203/rs.3.rs-90025/v1 ◽

2020 ◽

Author(s):

Usman Naseem ◽

Matloob Khushi ◽

Vinay Reddy ◽

Sakthivel Rajendran ◽

Imran Razzak ◽

...

Keyword(s):

State Of The Art ◽

Language Model ◽

Named Entity Recognition ◽

Training Data ◽

Entity Recognition ◽

Future Research ◽

Named Entity ◽

Domain Specific ◽

Context Dependent ◽

Biomedical Named Entity Recognition

Abstract Background: In recent years, with the growing amount of biomedical documents, coupled with advancement in natural language processing algorithms, the research on biomedical named entity recognition (BioNER) has increased exponentially. However, BioNER research is challenging as NER in the biomedical domain are: (i) often restricted due to limited amount of training data, (ii) an entity can refer to multiple types and concepts depending on its context and, (iii) heavy reliance on acronyms that are sub-domain specific. Existing BioNER approaches often neglect these issues and directly adopt the state-of-the-art (SOTA) models trained in general corpora which often yields unsatisfactory results. Results: We propose biomedical ALBERT (A Lite Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) - bioALBERT - an effective domain-specific pre-trained language model trained on huge biomedical corpus designed to capture biomedical context-dependent NER. We adopted self-supervised loss function used in ALBERT that targets on modelling inter-sentence coherence to better learn context-dependent representations and incorporated parameter reduction strategies to minimise memory usage and enhance the training time in BioNER. In our experiments, BioALBERT outperformed comparative SOTA BioNER models on eight biomedical NER benchmark datasets with four different entity types. The performance is increased for; (i) disease type corpora by 7.47% (NCBI-disease) and 10.63% (BC5CDR-disease); (ii) drug-chem type corpora by 4.61% (BC5CDR-Chem) and 3.89 (BC4CHEMD); (iii) gene-protein type corpora by 12.25% (BC2GM) and 6.42% (JNLPBA); and (iv) Species type corpora by 6.19% (LINNAEUS) and 23.71% (Species-800) is observed which leads to a state-of-the-art results. Conclusions: The performance of proposed model on four different biomedical entity types shows that our model is robust and generalizable in recognizing biomedical entities in text. We trained four different variants of BioALBERT models which are available for the research community to be used in future research.

Download Full-text

Few-Sample Named Entity Recognition for Security Vulnerability Reports by Fine-Tuning Pre-trained Language Models

10.1007/978-3-030-87839-9_3 ◽

2021 ◽

pp. 55-78

Author(s):

Guanqun Yang ◽

Shay Dineen ◽

Zhipeng Lin ◽

Xueqing Liu

Keyword(s):

Named Entity Recognition ◽

Fine Tuning ◽

Entity Recognition ◽

Language Models ◽

Security Vulnerability ◽

Named Entity

Download Full-text

Medical Named Entity Recognition from Un-labelled Medical Records based on Pre-trained Language Models and Domain Dictionary

Data Intelligence ◽

10.1162/dint_a_00105 ◽

2021 ◽

pp. 1-13

Author(s):

Chaojie Wen ◽

Tao Chen ◽

Xudong Jia ◽

Jiang Zhu

Keyword(s):

Medical Records ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Medical Texts ◽

Named Entities ◽

Named Entity ◽

Domain Specific ◽

Medical Entity ◽

Medical Documents

Abstract Medical named entity recognition (NER) is an area in which medical named entities are recognized from medical texts, such as diseases, drugs, surgery reports, anatomical parts, examination documents, and so on. Conventional medical NER methods do not make full use of un-labelled medical texts embedded in medical documents. To address this issue, we propose a medical NER approach based on pre-trained language models and a domain dictionary. First, we construct a medical entity dictionary by extracting medical entities from labelled medical texts and collecting medical entities from other resources, such as the Yidu-N4K dataset. Second, we employ this dictionary to train domain-specific pre-trained language models using un-labelled medical texts. Third, we employ a pseudo labelling mechanism in un-labelled medical texts to automatically annotate texts and create pseudo labels. Fourth, the BiLSTM-CRF sequence tagging model is used to fine-tune the pre-trained language models. Our experiments on the un-labelled medical texts, which are extracted from Chinese electronic medical records, show that the proposed NER approach enables the strict and relaxed F1 scores to be 88.7% and 95.3%, respectively.

Download Full-text

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-202286 ◽

2021 ◽

pp. 1-12

Author(s):

Yingwen Fu ◽

Nankai Lin ◽

Xiaotian Lin ◽

Shengyi Jiang

Keyword(s):

Language Processing ◽

State Of The Art ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Neural Models ◽

Performance Models ◽

Named Entity ◽

High Resource ◽

Benchmark Datasets

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Download Full-text

Study of Pre-trained Language Models for Named Entity Recognition in Clinical Trial Eligibility Criteria from Multiple Corpora

10.1109/ichi52183.2021.00095 ◽

2021 ◽

Author(s):

Jianfu Li ◽

Qiang Wei ◽

Omid Ghiasvand ◽

Miao Chen ◽

Victor Lobanov ◽

...

Keyword(s):

Clinical Trial ◽

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Eligibility Criteria ◽

Named Entity

Download Full-text

Joint Pre-trained Chinese Named Entity Recognition based on Bi-directional Language Model

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421530037 ◽

2021 ◽

Author(s):

Ma Changxia ◽

Zhang Chen

Keyword(s):

Language Model ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

A Study on the Impact of Intradomain Finetuning of Deep Language Models for Legal Named Entity Recognition in Portuguese

Intelligent Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-030-61377-8_46 ◽

2020 ◽

pp. 648-662

Author(s):

Luiz Henrique Bonifacio ◽

Paulo Arantes Vilela ◽

Gustavo Rocha Lobato ◽

Eraldo Rezende Fernandes

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Language Models ◽

Named Entity ◽

The Impact

Download Full-text

Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extraction in Chinese Electronic Medical Records

Data Intelligence ◽

10.1162/dint_a_00093 ◽

2021 ◽

pp. 1-13

Author(s):

Xia Li ◽

Qinghua Wen ◽

Zengtao Jiao ◽

Jiangtao Zhang

Keyword(s):

Electronic Medical Records ◽

Medical Records ◽

Named Entity Recognition ◽

Event Extraction ◽

Entity Recognition ◽

Language Models ◽

Data Sets ◽

External Resources ◽

Named Entity ◽

Evaluation Task

Abstract The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 presented clinical named entity recognition and event extraction for the Chinese electronic medical records. Two annotated data sets and some other additional resources for these two subtasks were provided for participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation and external resources are also helpful.

Download Full-text

Crowdsourcing Learning as Domain Adaptation: A Case Study on Named Entity Recognition

10.18653/v1/2021.acl-long.432 ◽

2021 ◽

Author(s):

Xin Zhang ◽

Guangwei Xu ◽

Yueheng Sun ◽

Meishan Zhang ◽

Pengjun Xie

Keyword(s):

Domain Adaptation ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text