scholarly journals Enhancing the Performance of Telugu Named Entity Recognition Using Gazetteer Features

Information ◽  
2020 ◽  
Vol 11 (2) ◽  
pp. 82
Author(s):  
SaiKiranmai Gorla ◽  
Lalita Bhanu Murthy Neti ◽  
Aruna Malapati

Named entity recognition (NER) is a fundamental step for many natural language processing tasks and hence enhancing the performance of NER models is always appreciated. With limited resources being available, NER for South-East Asian languages like Telugu is quite a challenging problem. This paper attempts to improve the NER performance for Telugu using gazetteer-related features, which are automatically generated using Wikipedia pages. We make use of these gazetteer features along with other well-known features like contextual, word-level, and corpus features to build NER models. NER models are developed using three well-known classifiers—conditional random field (CRF), support vector machine (SVM), and margin infused relaxed algorithms (MIRA). The gazetteer features are shown to improve the performance, and theMIRA-based NER model fared better than its counterparts SVM and CRF.

2021 ◽  
Vol 11 (19) ◽  
pp. 9038
Author(s):  
Wazir Ali ◽  
Jay Kumar ◽  
Zenglin Xu ◽  
Rajesh Kumar ◽  
Yazhou Ren

Named entity recognition (NER) is a fundamental task in many natural language processing (NLP) applications, such as text summarization and semantic information retrieval. Recently, deep neural networks (NNs) with the attention mechanism yield excellent performance in NER by taking advantage of character-level and word-level representation learning. In this paper, we propose a deep context-aware bidirectional long short-term memory (CaBiLSTM) model for the Sindhi NER task. The model relies upon contextual representation learning (CRL), bidirectional encoder, self-attention, and sequential conditional random field (CRF). The CaBiLSTM model incorporates task-oriented CRL based on joint character-level and word-level representations. It takes character-level input to learn the character representations. Afterwards, the character representations are transformed into word features, and the bidirectional encoder learns the word representations. The output of the final encoder is fed into the self-attention through a hidden layer before decoding. Finally, we employ the CRF for the prediction of label sequences. The baselines and the proposed CaBiLSTM model are compared by exploiting pretrained Sindhi GloVe (SdGloVe), Sindhi fastText (SdfastText), task-oriented, and CRL-based word representations on the recently proposed SiNER dataset. Our proposed CaBiLSTM model achieved a high F1-score of 91.25% on the SiNER dataset with CRL without relying on additional handmade features, such as hand-crafted rules, gazetteers, or dictionaries.


2010 ◽  
Vol 1 ◽  
pp. 26-58 ◽  
Author(s):  
Asif Ekbal ◽  
Sivaji Bandyopadhyay

This paper reports about a multi-engine approach for the development of a Named Entity Recognition (NER) system in Bengali by combining the classifiers such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) with the help of weighted voting techniques. The training set consists of approximately 272K wordforms, out of which 150K wordforms have been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. An appropriate tag conversion routine has been defined in order to convert the 122K wordforms of the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL)1 data into the desired forms. The individual classifiers make use of the different contextual information of the words along with the variety of features that are helpful to predict the various NE classes. Lexical context patterns, generated from an unlabeled corpus of 3 million wordforms in a semi-automatic way, have been used as the features of the classifiers in order to improve their performance. In addition, we propose a number of techniques to post-process the output of each classifier in order to reduce the errors and to improve the performance further. Finally, we use three weighted voting techniques to combine the individual models. Experimental results show the effectiveness of the proposed multi-engine approach with the overall Recall, Precision and F-Score values of 93.98%, 90.63% and 92.28%, respectively, which shows an improvement of 14.92% in F-Score over the best performing baseline SVM based system and an improvement of 18.36% in F-Score over the least performing baseline ME based system. Comparative evaluation results also show that the proposed system outperforms the three other existing Bengali NER systems.


2021 ◽  
Vol 75 (3) ◽  
pp. 94-99
Author(s):  
A.M. Yelenov ◽  
◽  
A.B. Jaxylykova ◽  

This research focuses on a comparative study of the Named Entity Recognition task for scientific article texts. Natural language processing could be considered as one of the cornerstones in the machine learning area which devotes its attention to the problems connected with the understanding of different natural languages and linguistic analysis. It was already shown that current deep learning techniques have a good performance and accuracy in such areas as image recognition, pattern recognition, computer vision, that could mean that such technology probably would be successful in the neuro-linguistic programming area too and lead to a dramatic increase on the research interest on this topic. For a very long time, quite trivial algorithms have been used in this area, such as support vector machines or various types of regression, basic encoding on text data was also used, which did not provide high results. The following dataset was used to process the experiment models: Dataset Scientific Entity Relation Core. The algorithms used were Long short-term memory, Random Forest Classifier with Conditional Random Fields, and Named-entity recognition with Bidirectional Encoder Representations from Transformers. In the findings, the metrics scores of all models were compared to each other to make a comparison. This research is devoted to the processing of scientific articles, concerning the machine learning area, because the subject is not investigated on enough properly level.The consideration of this task can help machines to understand natural languages better, so that they can solve other neuro-linguistic programming tasks better, enhancing scores in common sense.


Information ◽  
2020 ◽  
Vol 11 (1) ◽  
pp. 45 ◽  
Author(s):  
Shardrom Johnson ◽  
Sherlock Shen ◽  
Yuanchen Liu

Usually taken as linguistic features by Part-Of-Speech (POS) tagging, Named Entity Recognition (NER) is a major task in Natural Language Processing (NLP). In this paper, we put forward a new comprehensive-embedding, considering three aspects, namely character-embedding, word-embedding, and pos-embedding stitched in the order we give, and thus get their dependencies, based on which we propose a new Character–Word–Position Combined BiLSTM-Attention (CWPC_BiAtt) for the Chinese NER task. Comprehensive-embedding via the Bidirectional Llong Short-Term Memory (BiLSTM) layer can get the connection between the historical and future information, and then employ the attention mechanism to capture the connection between the content of the sentence at the current position and that at any location. Finally, we utilize Conditional Random Field (CRF) to decode the entire tagging sequence. Experiments show that CWPC_BiAtt model we proposed is well qualified for the NER task on Microsoft Research Asia (MSRA) dataset and Weibo NER corpus. A high precision and recall were obtained, which verified the stability of the model. Position-embedding in comprehensive-embedding can compensate for attention-mechanism to provide position information for the disordered sequence, which shows that comprehensive-embedding has completeness. Looking at the entire model, our proposed CWPC_BiAtt has three distinct characteristics: completeness, simplicity, and stability. Our proposed CWPC_BiAtt model achieved the highest F-score, achieving the state-of-the-art performance in the MSRA dataset and Weibo NER corpus.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Han Huang ◽  
Hongyu Wang ◽  
Dawei Jin

Named entity recognition (NER) is an indispensable and very important part of many natural language processing technologies, such as information extraction, information retrieval, and intelligent Q & A. This paper describes the development of the AL-CRF model, which is a NER approach based on active learning (AL). The algorithmic sequence of the processes performed by the AL-CRF model is the following: first, the samples are clustered using the k-means approach. Then, stratified sampling is performed on the produced clusters in order to obtain initial samples, which are used to train the basic conditional random field (CRF) classifier. The next step includes the initiation of the selection process which uses the criterion of entropy. More specifically, samples having the highest entropy values are added to the training set. Afterwards, the learning process is repeated, and the CRF classifier is retrained based on the obtained training set. The learning and the selection process of the AL is running iteratively until the harmonic mean F stabilizes and the final NER model is obtained. Several NER experiments are performed on legislative and medical cases in order to validate the AL-CRF performance. The testing data include Chinese judicial documents and Chinese electronic medical records (EMRs). Testing indicates that our proposed algorithm has better recognition accuracy and recall rate compared to the conventional CRF model. Moreover, the main advantage of our approach is that it requires fewer manually labelled training samples, and at the same time, it is more effective. This can result in a more cost effective and more reliable process.


2020 ◽  
Vol 8 ◽  
pp. 605-620 ◽  
Author(s):  
Takashi Shibuya ◽  
Eduard Hovy

When an entity name contains other names within it, the identification of all combinations of names can become difficult and expensive. We propose a new method to recognize not only outermost named entities but also inner nested ones. We design an objective function for training a neural model that treats the tag sequence for nested entities as the second best path within the span of their parent entity. In addition, we provide the decoding method for inference that extracts entities iteratively from outermost ones to inner ones in an outside-to-inside way. Our method has no additional hyperparameters to the conditional random field based model widely used for flat named entity recognition tasks. Experiments demonstrate that our method performs better than or at least as well as existing methods capable of handling nested entities, achieving F1-scores of 85.82%, 84.34%, and 77.36% on ACE-2004, ACE-2005, and GENIA datasets, respectively.


2014 ◽  
Vol 37 (1) ◽  
pp. 1-22 ◽  
Author(s):  
Asif Ekbal ◽  
Sivaji Bandyopadhyay

This paper reports a voted Named Entity Recognition (NER) system that exploits appropriate unlabeled data. Initially, we develop NER systems using the supervised machine learning algorithms such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM). Each of these models makes use of the language independent features in the form of different contextual and orthographic word-level features along with the language dependent features extracted from the Part-of-Speech (POS) tagger and gazetteers. Context patterns generated from the unlabeled data using an active learning method are also used as the features in each of the classifiers. A semi-supervised method is proposed to describe the measures to automatically select effective unlabeled documents as well as sentences from the unlabeled data. Finally, the supervised models are combined together into a final system by defining appropriate weighted voting technique. Experimental results for a resource-poor language like Bengali show the effectiveness of the proposed approach with the overall recall, precision and F-measure values of 93.81%, 92.18% and 92.98%, respectively.


Named Entity Recognition (NER) is a significant errand in Natural Language Processing (NLP) applications like Information Extraction, Question Answering and so on. In this paper, factual way to deal with perceive Kannada named substances like individual name, area name, association name, number, estimation and time is proposed. We have achieved higher accuracy in CRF approach than the in HMM approach. The accuracy of classification is more accurate in CRF approach due to flexibility of adding more features unlike joint probability alone as in HMM. In HMM it is not practical to represent multiple overlapping features and long term dependencies. CRF ++ Tool Kit is used for experimentation. The consequences of acknowledgment are empowering and the approach has the exactness around 86%.


Author(s):  
Shohei Higashiyama ◽  
Blondel Mathieu ◽  
Kazuhiro Seki ◽  
Kuniaki Uehara

Named Entity Recognition (NER) is a fundamental natural language processing task for the identifi cation and classifi cation of expressions into predefi ned categories, such as person and organization. Existing NER systems usually target about 10 categories and do not incorporate analysis of category relations. However, categories often belong naturally to some predefi ned hierarchy. In such cases, the distance between categories in the hierarchy becomes a rich source of information that can be exploited. This is intuitively useful particularly when the categories are numerous. On that account, this paper proposes an NER approach that can leverage category hierarchy information by introducing, in the structured perceptron framework, a cost function more strongly penalizing category predictions that are more distant from the correct category in the hierarchy. Experimental results on the GENIA biomedical text corpus indicate the effectiveness of the proposed approach as compared with the case where no cost function is utilized. In addition, the proposed approach demonstrates the superior performance over a representative work using multi-class support vector machines on the same corpus. A possible direction to further improve the proposed approach is to investigate more elaborate cost functions than a simple additive cost adopted in this work.  


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Xuming Han ◽  
Feng Zhou ◽  
Zhiyuan Hao ◽  
Qiaoming Liu ◽  
Yong Li ◽  
...  

Named entity recognition (NER) is a subtask in natural language processing, and its accuracy greatly affects the effectiveness of downstream tasks. Aiming at the problem of insufficient expression of potential Chinese features in named entity recognition tasks, this paper proposes a multifeature adaptive fusion Chinese named entity recognition (MAF-CNER) model. The model uses bidirectional long short-term memory (BiLSTM) neural network to extract stroke and radical features and adopts a weighted concatenation method to fuse two sets of features adaptively. This method can better integrate the two sets of features, thereby improving the model entity recognition ability. In order to fully test the entity recognition performance of this model, we compared the basic model and other mainstream models on Microsoft Research Asia (MSRA) and “China People’s Daily” dataset from January to June 1998. Experimental results show that this model is better than other models, with F1 values of 97.01% and 96.78%, respectively.


Sign in / Sign up

Export Citation Format

Share Document