An Enhanced Malay Named Entity Recognition using Combination Approach for Crime Textual Data Analysis

Siti Azirah Asmai; Muhammad Sharilazlan; Halizah Basiron; Sabrina Ahmad

doi:10.14569/ijacsa.2018.090960

Named Entity Recognition and Relation Extraction

ACM Computing Surveys ◽

10.1145/3445965 ◽

2021 ◽

Vol 54 (1) ◽

pp. 1-39

Author(s):

Zara Nasar ◽

Syed Waqar Jaffry ◽

Muhammad Kamran Malik

Keyword(s):

Deep Learning ◽

State Of The Art ◽

Named Entity Recognition ◽

Relation Extraction ◽

The State ◽

Entity Recognition ◽

Joint Models ◽

Named Entity ◽

Textual Data ◽

Benchmark Datasets

With the advent of Web 2.0, there exist many online platforms that result in massive textual-data production. With ever-increasing textual data at hand, it is of immense importance to extract information nuggets from this data. One approach towards effective harnessing of this unstructured textual data could be its transformation into structured text. Hence, this study aims to present an overview of approaches that can be applied to extract key insights from textual data in a structured way. For this, Named Entity Recognition and Relation Extraction are being majorly addressed in this review study. The former deals with identification of named entities, and the latter deals with problem of extracting relation between set of entities. This study covers early approaches as well as the developments made up till now using machine learning models. Survey findings conclude that deep-learning-based hybrid and joint models are currently governing the state-of-the-art. It is also observed that annotated benchmark datasets for various textual-data generators such as Twitter and other social forums are not available. This scarcity of dataset has resulted into relatively less progress in these domains. Additionally, the majority of the state-of-the-art techniques are offline and computationally expensive. Last, with increasing focus on deep-learning frameworks, there is need to understand and explain the under-going processes in deep architectures.

Download Full-text

Dutch Named Entity Recognition and De-Identification Methods for the Human Resource Domain

International Journal on Natural Language Computing ◽

10.5121/ijnlc.2020.9602 ◽

2020 ◽

Vol 9 (6) ◽

pp. 23-34

Author(s):

Chaïm van Toledo ◽

Friso van Dijk ◽

Marco Spruit

Keyword(s):

Performance Appraisal ◽

Human Resource ◽

Dutch Text ◽

Named Entity Recognition ◽

Entity Recognition ◽

Identification Methods ◽

Named Entity ◽

Job Titles ◽

Textual Data ◽

And Performance

The human resource (HR) domain contains various types of privacy-sensitive textual data, such as e-mail correspondence and performance appraisal. Doing research on these documents brings several challenges, one of them anonymisation. In this paper, we evaluate the current Dutch text de-identification methods for the HR domain in four steps. First, by updating one of these methods with the latest named entity recognition (NER) models. The result is that the NER model based on the CoNLL 2002 corpus in combination with the BERTje transformer give the best combination for suppressing persons (recall 0.94) and locations (recall 0.82). For suppressing gender, DEDUCE is performing best (recall 0.53). Second NER evaluation is based on both strict de-identification of entities (a person must be suppressed as a person) and third evaluation on a loose sense of de-identification (no matter what how a person is suppressed, as long it is suppressed). In the fourth and last step a new kind of NER dataset is tested for recognising job titles in tezts.

Download Full-text

Occupational profiling driven by online job advertisements: Taking the data analysis and processing engineering technicians as an example

PLoS ONE ◽

10.1371/journal.pone.0253308 ◽

2021 ◽

Vol 16 (6) ◽

pp. e0253308

Author(s):

Lina Cao ◽

Jian Zhang ◽

Xinquan Ge ◽

Jindong Chen

Keyword(s):

Data Analysis ◽

Named Entity Recognition ◽

Entity Recognition ◽

Survey Method ◽

Named Entities ◽

Named Entity ◽

Occupational Information ◽

Multiple Dimensions ◽

Job Advertisements ◽

Similarity Algorithm

The occupational profiling system driven by the traditional survey method has some shortcomings such as lag in updating, time consumption and laborious revision. It is necessary to refine and improve the traditional occupational portrait system through dynamic occupational information. Under the circumstances of big data, this paper showed the feasibility of vocational portraits driven by job advertisements with data analysis and processing engineering technicians (DAPET) as an example. First, according to the description of occupation in the Chinese Occupation Classification Grand Dictionary, a text similarity algorithm was used to preliminarily choose recruitment data with high similarity. Second, Convolutional Neural Networks for Sentence Classification (TextCNN) was used to further classify the preliminary corpus to obtain a precise occupational dataset. Third, the specialty and skill were taken as named entities that were automatically extracted by the named entity recognition technology. Finally, putting the extracted entities into the occupational dataset, the occupation characteristics of multiple dimensions were depicted to form a profile of the vocation.

Download Full-text

Evaluating Dutch Named Entity Recognition and De-Identification Methods in the Human Resource Domain

10.5121/csit.2020.101520 ◽

2020 ◽

Author(s):

Chaïm van Toledo ◽

Friso van Dijk ◽

Marco Spruit

Keyword(s):

Performance Appraisal ◽

Human Resource ◽

Dutch Text ◽

Named Entity Recognition ◽

Entity Recognition ◽

Identification Methods ◽

Named Entity ◽

Textual Data ◽

And Performance ◽

E Mail

The human resource (HR) domain contains various types of privacy-sensitive textual data, such as e-mail correspondence and performance appraisal. Doing research on these documents brings several challenges, one of them anonymisation. In this paper, we evaluate the current Dutch text de-identification methods for the HR domain in three steps. First, by updating one of these methods with the latest named entity recognition (NER) models. The result is that the NER model based on the CoNLL 2002 corpus in combination with the BERTje transformer give the best combination for suppressing persons (recall 0.94) and locations (recall 0.82). For suppressing gender, DEDUCE is performing best (recall 0.53). Second NER evaluation is based on both strict de-identification of entities (a person must be suppressed as a person) and third evaluation on a loose sense of de-identification (no matter what how a person is suppressed, as long it is suppressed).

Download Full-text

ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition

Complexity ◽

10.1155/2021/6633213 ◽

2021 ◽

Vol 2021 ◽

pp. 1-6

Author(s):

Nada Boudjellal ◽

Huaping Zhang ◽

Asif Khan ◽

Arshad Ahmad ◽

Rashid Naseem ◽

...

Keyword(s):

Named Entity Recognition ◽

Model Performance ◽

Recognition Task ◽

Entity Recognition ◽

Small Scale ◽

Text Data ◽

Named Entities ◽

Named Entity ◽

Textual Data ◽

Biomedical Named Entity Recognition

The web is being loaded daily with a huge volume of data, mainly unstructured textual data, which increases the need for information extraction and NLP systems significantly. Named-entity recognition task is a key step towards efficiently understanding text data and saving time and effort. Being a widely used language globally, English is taking over most of the research conducted in this field, especially in the biomedical domain. Unlike other languages, Arabic suffers from lack of resources. This work presents a BERT-based model to identify biomedical named entities in the Arabic text data (specifically disease and treatment named entities) that investigates the effectiveness of pretraining a monolingual BERT model with a small-scale biomedical dataset on enhancing the model understanding of Arabic biomedical text. The model performance was compared with two state-of-the-art models (namely, AraBERT and multilingual BERT cased), and it outperformed both models with 85% F1-score.

Download Full-text

Developing and Deploying Algorithms for Information Extraction using Classification Measures for Named Entity Recognition

International Journal of Computer Sciences and Engineering ◽

10.26438/ijcse/v6i10.235248 ◽

2018 ◽

Vol 6 (10) ◽

pp. 235-248

Author(s):

Rehan Khan ◽

A.J. Singh

Keyword(s):

Information Extraction ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

Arabic named entity recognition using optimized feature sets

10.3115/1613715.1613755 ◽

2008 ◽

Cited By ~ 38

Author(s):

Yassine Benajiba ◽

Mona Diab ◽

Paolo Rosso

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Feature Sets ◽

Named Entity

Download Full-text

Open-Source Tools for Morphology, Lemmatization, POS Tagging and Named Entity Recognition

10.3115/v1/p14-5003 ◽

2014 ◽

Cited By ~ 25

Author(s):

Jana Straková ◽

Milan Straka ◽

Jan Hajič

Keyword(s):

Open Source ◽

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity ◽

Pos Tagging

Download Full-text

Semantische Suche nach wissenschaftlichen Videos – Automatische Verschlagwortung durch Named Entity Recognition

Zeitschrift für Bibliothekswesen und Bibliographie ◽

10.3196/18642950146145154 ◽

2014 ◽

Vol 61 (4-5) ◽

pp. 254-258 ◽

Cited By ~ 2

Author(s):

Margret Plank ◽

Sven Strobel

Keyword(s):

Named Entity Recognition ◽

Entity Recognition ◽

Named Entity

Download Full-text

Developing a RadLex-based Named Entity Recognition Tool for Mining Textual Radiology Reports (Preprint)

10.2196/preprints.25378 ◽

2020 ◽

Author(s):

Shintaro Tsuji ◽

Andrew Wen ◽

Naoki Takahashi ◽

Hongjian Zhang ◽

Katsuhiko Ogasawara ◽

...

Keyword(s):

Named Entity Recognition ◽

Noun Phrases ◽

General Purpose ◽

Entity Recognition ◽

Free Text ◽

Clinical Text ◽

Named Entity ◽

Radiology Reports ◽

Two Measures ◽

F Measure

BACKGROUND Named entity recognition (NER) plays an important role in extracting the features of descriptions for mining free-text radiology reports. However, the performance of existing NER tools is limited because the number of entities depends on its dictionary lookup. Especially, the recognition of compound terms is very complicated because there are a variety of patterns. OBJECTIVE The objective of the study is to develop and evaluate a NER tool concerned with compound terms using the RadLex for mining free-text radiology reports. METHODS We leveraged the clinical Text Analysis and Knowledge Extraction System (cTAKES) to develop customized pipelines using both RadLex and SentiWordNet (a general-purpose dictionary, GPD). We manually annotated 400 of radiology reports for compound terms (Cts) in noun phrases and used them as the gold standard for the performance evaluation (precision, recall, and F-measure). Additionally, we also created a compound-term-enhanced dictionary (CtED) by analyzing false negatives (FNs) and false positives (FPs), and applied it for another 100 radiology reports for validation. We also evaluated the stem terms of compound terms, through defining two measures: an occurrence ratio (OR) and a matching ratio (MR). RESULTS The F-measure of the cTAKES+RadLex+GPD was 32.2% (Precision 92.1%, Recall 19.6%) and that of combined the CtED was 67.1% (Precision 98.1%, Recall 51.0%). The OR indicated that stem terms of “effusion”, "node", "tube", and "disease" were used frequently, but it still lacks capturing Cts. The MR showed that 71.9% of stem terms matched with that of ontologies and RadLex improved about 22% of the MR from the cTAKES default dictionary. The OR and MR revealed that the characteristics of stem terms would have the potential to help generate synonymous phrases using ontologies. CONCLUSIONS We developed a RadLex-based customized pipeline for parsing radiology reports and demonstrated that CtED and stem term analysis has the potential to improve dictionary-based NER performance toward expanding vocabularies.

Download Full-text