Distributional Similarity for Chinese: Exploiting Characters and Radicals

Mathematical Problems in Engineering ◽

10.1155/2012/347257 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Peng Jin ◽

John Carroll ◽

Yunfang Wu ◽

Diana McCarthy

Keyword(s):

Language Processing ◽

Gold Standard ◽

Chinese Language ◽

Similarity Score ◽

Content Word ◽

Word Similarity ◽

Similar Word ◽

Distributional Similarity ◽

Similarity Computation ◽

Large Corpus

Distributional Similarity has attracted considerable attention in the field of natural language processing as an automatic means of countering the ubiquitous problem of sparse data. As a logographic language, Chinese words consist of characters and each of them is composed of one or more radicals. The meanings of characters are usually highly related to the words which contain them. Likewise, radicals often make a predictable contribution to the meaning of a character: characters that have the same components tend to have similar or related meanings. In this paper, we utilize these properties of the Chinese language to improve Chinese word similarity computation. Given a content word, we first extract similar words based on a large corpus and a similarity score for ranking. This rank is then adjusted according to the characters and components shared between the similar word and the target word. Experiments on two gold standard datasets show that the adjusted rank is superior and closer to human judgments than the original rank. In addition to quantitative evaluation, we examine the reasons behind errors drawing on linguistic phenomena for our explanations.

Download Full-text

Word and sentence embedding tools to measure semantic similarity of Gene Ontology terms by their definitions

10.1101/103648 ◽

2017 ◽

Cited By ~ 1

Author(s):

Dat Duong ◽

Wasi Uddin Ahmad ◽

Eleazar Eskin ◽

Kai-Wei Chang ◽

Jingyi Jessica Li

Keyword(s):

Neural Network ◽

Gene Ontology ◽

Language Processing ◽

Classification Accuracy ◽

Dimensional Space ◽

Similarity Score ◽

Biological Functions ◽

Word Similarity ◽

True Protein ◽

Go Terms

AbstractThe Gene Ontology (GO) database contains GO terms that describe biological functions of genes. Previous methods for comparing GO terms have relied on the fact that GO terms are organized into a tree structure. In this paradigm, the locations of two GO terms in the tree dictate their similarity score. In this paper, we introduce two new solutions for this problem, by focusing instead on the definitions of the GO terms. We apply neural network based techniques from the natural language processing (NLP) domain. The first method does not rely on the GO tree, whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO definitions by treating them as two unordered sets of words. The word similarity is estimated by a word embedding model that maps words into an N-dimensional space. In our second approach, we account for the word-ordering within a sentence. We use a sentence encoder to embed GO definitions into vectors and estimate how likely one definition entails another. We validate our methods in two ways. In the first experiment, we test the model’s ability to differentiate a true protein-protein network from a randomly generated network. In the second experiment, we test the model in identifying orthologs from randomly-matched genes in human, mouse, and fly. In both experiments, a hybrid of NLP and GO-tree based method achieves the best classification accuracy.Availabilitygithub.com/datduong/NLPMethods2CompareGOterms

Download Full-text

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

Natural Language Engineering ◽

10.1017/s135132490400364x ◽

2005 ◽

Vol 11 (2) ◽

pp. 207-238 ◽

Cited By ~ 139

Author(s):

NAIWEN XUE ◽

FEI XIA ◽

FU-DONG CHIOU ◽

MARTA PALMER

Keyword(s):

Language Processing ◽

Large Scale ◽

Chinese Language ◽

Phrase Structure ◽

The Public ◽

Part Of Speech ◽

The World ◽

Structure Annotation ◽

Annotation Quality ◽

Large Corpus

With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.

Download Full-text

Proceedings of the second workshop on Chinese language processing held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics -

10.3115/1117769 ◽

2000 ◽

Keyword(s):

Annual Meeting ◽

Computational Linguistics ◽

Language Processing ◽

Chinese Language

Download Full-text

Analysis Accuracy of Similar Word Based Clustering (EWSB) Algorithm on Machine Translator Bahasa Indonesia-Minang

Kinetik Game Technology Information System Computer Network Computing Electronics and Control ◽

10.22219/kinetik.v3i3.241 ◽

2018 ◽

Vol 3 (3) ◽

Author(s):

Herry Sujaini

Keyword(s):

Machine Translation ◽

Clustering Algorithm ◽

Statistical Machine Translation ◽

Target Language ◽

Word Similarity ◽

Similar Word ◽

Word Clustering ◽

Translation Accuracy ◽

Bahasa Indonesia

Extended Word Similarity Based (EWSB) Clustering is a word clustering algorithm based on the value of words similarity obtained from the computation of a corpus. One of the benefits of clustering with this algorithm is to improve the translation of a statistical machine translation. Previous research proved that EWSB algorithm could improve the Indonesian-English translator, where the algorithm was applied to Indonesian language as target language.This paper discusses the results of a research using EWSB algorithm on a Indonesian to Minang statistical machine translator, where the algorithm is applied to Minang language as the target language. The research obtained resulted that the EWSB algorithm is quite effective when used in Minang language as the target language. The results of this study indicate that EWSB algorithm can improve the translation accuracy by 6.36%.

Download Full-text

A WORD-BASED CHINESE LANGUAGE UNDERSTANDING SYSTEM

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001488000042 ◽

1988 ◽

Vol 02 (01) ◽

pp. 25-35

Author(s):

TIAN-SHUN YAO

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Chinese Language ◽

Computer Programs ◽

World Knowledge ◽

Knowledge Source ◽

Language Understanding ◽

Language Analysis ◽

The World

With the word-based theory of natural language processing, a word-based Chinese language understanding system has been developed. In the light of psychological language analysis and the features of the Chinese language, this theory of natural language processing is presented with the description of the computer programs based on it. The heart of the system is to define a Total Information Dictionary and the World Knowledge Source used in the system. The purpose of this research is to develop a system which can understand not only Chinese sentences but also the whole text.

Download Full-text

Learning from Disagreement: A Survey

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.12752 ◽

2021 ◽

Vol 72 ◽

pp. 1385-1470

Author(s):

Alexandra N. Uma ◽

Tommaso Fornaciari ◽

Dirk Hovy ◽

Silviu Paun ◽

Barbara Plank ◽

...

Keyword(s):

Language Processing ◽

Gold Standard ◽

Training Methods ◽

High Quality ◽

Training Models ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Growing Body ◽

Research Questions ◽

Speech Tagging

Many tasks in Natural Language Processing (NLP) and Computer Vision (CV) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (AI) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on NLP and CV tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials.

Download Full-text

Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing

10.18653/v1/w15-31 ◽

2015 ◽

Keyword(s):

Language Processing ◽

Chinese Language

Download Full-text

A Deep Learning based Approach to Argument Recommendation

10.5121/csit.2021.111408 ◽

2021 ◽

Author(s):

Guangjie Li ◽

Yi Tang ◽

Biyi Yi ◽

Xiang Zhang ◽

Yan He

Keyword(s):

Neural Network ◽

Deep Learning ◽

Natural Language Processing ◽

Language Processing ◽

Context Information ◽

Data Embedding ◽

Software Developers ◽

Code Completion ◽

Software Engineers ◽

Large Corpus

Code completion is one of the most useful features provided by advanced IDEs and is widely used by software developers. However, as a kind of code completion, recommending arguments for method calls is less used. Most of existing argument recommendation approaches provide a long list of syntactically correct candidate arguments, which is difficult for software engineers to select the correct arguments from the long list. To this end, we propose a deep learning based approach to recommending arguments instantly when programmers type in method names they intend to invoke. First, we extract context information from a large corpus of opensource applications. Second, we preprocess the extracted dataset, which involves natural language processing and data embedding. Third, we feed the preprocessed dataset to a specially designed convolutional neural network to rank and recommend actual arguments. With the resulting CNN model trained with sample applications, we can sort the candidate arguments in a reasonable order and recommend the first one as the correct argument. We evaluate the proposed approach on 100 open-source Java applications. Results suggest that the proposed approach outperforms the state-of-theart approaches in recommending arguments.

Download Full-text

LIS4: Lesk Inspired Sense Specific Semantic Similarity using WordNet

Journal of Information & Knowledge Management ◽

10.1142/s0219649221500064 ◽

2021 ◽

pp. 2150006

Author(s):

Saravanakumar Kandasamy ◽

Aswani Kumar Cherukuri

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Semantic Similarity ◽

Language Processing ◽

Gold Standard ◽

Question Answering ◽

Knowledge Based ◽

Benchmark Datasets ◽

Processing Information

Semantic similarity quantification between concepts is one of the inevitable parts in domains like Natural Language Processing, Information Retrieval, Question Answering, etc. to understand the text and their relationships better. Last few decades, many measures have been proposed by incorporating various corpus-based and knowledge-based resources. WordNet and Wikipedia are two of the Knowledge-based resources. The contribution of WordNet in the above said domain is enormous due to its richness in defining a word and all of its relationship with others. In this paper, we proposed an approach to quantify the similarity between concepts that exploits the synsets and the gloss definitions of different concepts using WordNet. Our method considers the gloss definitions, contextual words that are helping in defining a word, synsets of contextual word and the confidence of occurrence of a word in other word’s definition for calculating the similarity. The evaluation based on different gold standard benchmark datasets shows the efficiency of our system in comparison with other existing taxonomical and definitional measures.

Download Full-text

Linguistic Inquiry and Word Count (LIWC)

Applied Natural Language Processing ◽

10.4018/978-1-60960-741-8.ch012 ◽

2012 ◽

pp. 206-229 ◽

Cited By ~ 24

Author(s):

Cindy K. Chung ◽

James W. Pennebaker

Keyword(s):

Group Dynamics ◽

Language Processing ◽

Behavioral Outcomes ◽

Content Word ◽

Analysis Tool ◽

Word Count ◽

Psychological States ◽

Psychological Measures ◽

Potential Applications ◽

Linguistic Inquiry

Linguistic Inquiry and Word Count (LIWC; Pennebaker, Booth, & Francis, 2007) is a word counting software program that references a dictionary of grammatical, psychological, and content word categories. LIWC has been used to efficiently classify texts along psychological dimensions and to predict behavioral outcomes, making it a text analysis tool widely used in the social sciences. LIWC can be considered to be a tool for applied natural language processing since, beyond classification, the relative uses of various LIWC categories can reflect the underlying psychology of demographic characteristics, honesty, health, status, relationship quality, group dynamics, or social context. By using a comparison group or longitudinal information, or validation with other psychological measures, LIWC analyses can be informative of a variety of psychological states and behaviors. Combining LIWC categories using new algorithms or using the processor to assess new categories and languages further extend the potential applications of LIWC.

Download Full-text