Detection and normalization of medical terms using domain-specific term frequency and adaptive ranking

Author(s):  
Mi-Young Kim ◽  
Randy Goebel
2021 ◽  
Vol 21 (1) ◽  
Author(s):  
Pilar López-Úbeda ◽  
Alexandra Pomares-Quimbaya ◽  
Manuel Carlos Díaz-Galiano ◽  
Stefan Schulz

Abstract Background Controlled vocabularies are fundamental resources for information extraction from clinical texts using natural language processing (NLP). Standard language resources available in the healthcare domain such as the UMLS metathesaurus or SNOMED CT are widely used for this purpose, but with limitations such as lexical ambiguity of clinical terms. However, most of them are unambiguous within text limited to a given clinical specialty. This is one rationale besides others to classify clinical text by the clinical specialty to which they belong. Results This paper addresses this limitation by proposing and applying a method that automatically extracts Spanish medical terms classified and weighted per sub-domain, using Spanish MEDLINE titles and abstracts as input. The hypothesis is biomedical NLP tasks benefit from collections of domain terms that are specific to clinical subdomains. We use PubMed queries that generate sub-domain specific corpora from Spanish titles and abstracts, from which token n-grams are collected and metrics of relevance, discriminatory power, and broadness per sub-domain are computed. The generated term set, called Spanish core vocabulary about clinical specialties (SCOVACLIS), was made available to the scientific community and used in a text classification problem obtaining improvements of 6 percentage points in the F-measure compared to the baseline using Multilayer Perceptron, thus demonstrating the hypothesis that a specialized term set improves NLP tasks. Conclusion The creation and validation of SCOVACLIS support the hypothesis that specific term sets reduce the level of ambiguity when compared to a specialty-independent and broad-scope vocabulary.


2021 ◽  
Author(s):  
Chuanxiao Li ◽  
Wenqiang Li ◽  
Zhong Tang ◽  
Song Li ◽  
Hai Xiang

Abstract As a vital step of text classification (TC) task, the assignment of term weight has a great influence on the performance of TC. Currently, masses of term weighting schemes can be utilized, such as term frequency-inverse documents frequency (TF-IDF) and term frequency-relevance frequency (TF-RF), and they are all consisted of local part (TF) and global part (e.g., IDF, RF). However, most of these schemes adopt the logarithmic processing on their respective global parts, and it is natural to consider whether the logarithmic processing apply to all these schemes or not. Actually, for a specific term weighting scheme, due to its different ratio of local weight and global weight resulting from logarithmic processing, it usually shows diverse text clasification results on different text sets, which presents poor robustness. To explore the influence of logarithmic processing imposed on the global weight on the classification result of term weighting schemes, TF-RF is selected as the representative because it can achieve a better performance among these schemes adopting logarithmic processing. Then, two propositions along with corresponding methods about the relation between TF part and RF part are proposed based on TF-RF. In addition, two groups of experiments are conducted on the two methods. The first group of experiments proves that one method (denoted as TF-ERF) is more helpful to the improvement than the other one (denoted as ETF-RF). The second group of experiments shows that TF-ERF not only ourperforms TF-RF but also obtains better performance than other existing term weighting schemes.


Sign in / Sign up

Export Citation Format

Share Document