scholarly journals Chemical Entity Recognition and Resolution to ChEBI

2012 ◽  
Vol 2012 ◽  
pp. 1-9 ◽  
Author(s):  
Tiago Grego ◽  
Catia Pesquita ◽  
Hugo P. Bastos ◽  
Francisco M. Couto

Chemical entities are ubiquitous through the biomedical literature and the development of text-mining systems that can efficiently identify those entities are required. Due to the lack of available corpora and data resources, the community has focused its efforts in the development of gene and protein named entity recognition systems, but with the release of ChEBI and the availability of an annotated corpus, this task can be addressed. We developed a machine-learning-based method for chemical entity recognition and a lexical-similarity-based method for chemical entity resolution and compared them with Whatizit, a popular-dictionary-based method. Our methods outperformed the dictionary-based method in all tasks, yielding an improvement in F-measure of 20% for the entity recognition task, 2–5% for the entity-resolution task, and 15% for combined entity recognition and resolution tasks.

2021 ◽  
Vol 75 (3) ◽  
pp. 94-99
Author(s):  
A.M. Yelenov ◽  
◽  
A.B. Jaxylykova ◽  

This research focuses on a comparative study of the Named Entity Recognition task for scientific article texts. Natural language processing could be considered as one of the cornerstones in the machine learning area which devotes its attention to the problems connected with the understanding of different natural languages and linguistic analysis. It was already shown that current deep learning techniques have a good performance and accuracy in such areas as image recognition, pattern recognition, computer vision, that could mean that such technology probably would be successful in the neuro-linguistic programming area too and lead to a dramatic increase on the research interest on this topic. For a very long time, quite trivial algorithms have been used in this area, such as support vector machines or various types of regression, basic encoding on text data was also used, which did not provide high results. The following dataset was used to process the experiment models: Dataset Scientific Entity Relation Core. The algorithms used were Long short-term memory, Random Forest Classifier with Conditional Random Fields, and Named-entity recognition with Bidirectional Encoder Representations from Transformers. In the findings, the metrics scores of all models were compared to each other to make a comparison. This research is devoted to the processing of scientific articles, concerning the machine learning area, because the subject is not investigated on enough properly level.The consideration of this task can help machines to understand natural languages better, so that they can solve other neuro-linguistic programming tasks better, enhancing scores in common sense.


Data ◽  
2021 ◽  
Vol 6 (7) ◽  
pp. 71
Author(s):  
Gonçalo Carnaz ◽  
Mário Antunes ◽  
Vitor Beires Nogueira

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.


2019 ◽  
pp. 1-8 ◽  
Author(s):  
Tomasz Oliwa ◽  
Steven B. Maron ◽  
Leah M. Chase ◽  
Samantha Lomnicki ◽  
Daniel V.T. Catenacci ◽  
...  

PURPOSE Robust institutional tumor banks depend on continuous sample curation or else subsequent biopsy or resection specimens are overlooked after initial enrollment. Curation automation is hindered by semistructured free-text clinical pathology notes, which complicate data abstraction. Our motivation is to develop a natural language processing method that dynamically identifies existing pathology specimen elements necessary for locating specimens for future use in a manner that can be re-implemented by other institutions. PATIENTS AND METHODS Pathology reports from patients with gastroesophageal cancer enrolled in The University of Chicago GI oncology tumor bank were used to train and validate a novel composite natural language processing-based pipeline with a supervised machine learning classification step to separate notes into internal (primary review) and external (consultation) reports; a named-entity recognition step to obtain label (accession number), location, date, and sublabels (block identifiers); and a results proofreading step. RESULTS We analyzed 188 pathology reports, including 82 internal reports and 106 external consult reports, and successfully extracted named entities grouped as sample information (label, date, location). Our approach identified up to 24 additional unique samples in external consult notes that could have been overlooked. Our classification model obtained 100% accuracy on the basis of 10-fold cross-validation. Precision, recall, and F1 for class-specific named-entity recognition models show strong performance. CONCLUSION Through a combination of natural language processing and machine learning, we devised a re-implementable and automated approach that can accurately extract specimen attributes from semistructured pathology notes to dynamically populate a tumor registry.


2016 ◽  
Vol 12 (4) ◽  
pp. 21-44 ◽  
Author(s):  
R. Hema ◽  
T. V. Geetha

The two main challenges in chemical entity recognition are: (i) New chemical compounds are constantly being synthesized infinitely. (ii) High ambiguity in chemical representation in which a chemical entity is being described by different nomenclatures. Therefore, the identification and maintenance of chemical terminologies is a tough task. Since most of the existing text mining methods followed the term-based approaches, the problems of polysemy and synonymy came into the picture. So, a Named Entity Recognition (NER) system based on pattern matching in chemical domain is developed to extract the chemical entities from chemical documents. The Tf-idf and PMI association measures are used to filter out the non-chemical terms. The F-score of 92.19% is achieved for chemical NER. This proposed method is compared with the baseline method and other existing approaches. As the final step, the filtered chemical entities are classified into sixteen functional groups. The classification is done using SVM One against All multiclass classification approach and achieved the accuracy of 87%. One-way ANOVA is used to test the quality of pattern matching method with the other existing chemical NER methods.


2019 ◽  
Vol 26 (2) ◽  
pp. 163-182 ◽  
Author(s):  
Serge Sharoff

AbstractSome languages have very few NLP resources, while many of them are closely related to better-resourced languages. This paper explores how the similarity between the languages can be utilised by porting resources from better- to lesser-resourced languages. The paper introduces a way of building a representation shared across related languages by combining cross-lingual embedding methods with a lexical similarity measure which is based on the weighted Levenshtein distance. One of the outcomes of the experiments is a Panslavonic embedding space for nine Balto-Slavonic languages. The paper demonstrates that the resulting embedding space helps in such applications as morphological prediction, named-entity recognition and genre classification.


2019 ◽  
Vol 9 (18) ◽  
pp. 3658 ◽  
Author(s):  
Jianliang Yang ◽  
Yuenan Liu ◽  
Minghui Qian ◽  
Chenghua Guan ◽  
Xiangfei Yuan

Clinical named entity recognition is an essential task for humans to analyze large-scale electronic medical records efficiently. Traditional rule-based solutions need considerable human effort to build rules and dictionaries; machine learning-based solutions need laborious feature engineering. For the moment, deep learning solutions like Long Short-term Memory with Conditional Random Field (LSTM–CRF) achieved considerable performance in many datasets. In this paper, we developed a multitask attention-based bidirectional LSTM–CRF (Att-biLSTM–CRF) model with pretrained Embeddings from Language Models (ELMo) in order to achieve better performance. In the multitask system, an additional task named entity discovery was designed to enhance the model’s perception of unknown entities. Experiments were conducted on the 2010 Informatics for Integrating Biology & the Bedside/Veterans Affairs (I2B2/VA) dataset. Experimental results show that our model outperforms the state-of-the-art solution both on the single model and ensemble model. Our work proposes an approach to improve the recall in the clinical named entity recognition task based on the multitask mechanism.


Sign in / Sign up

Export Citation Format

Share Document