N-gram probability effects in a cloze task

Cyrus Shaoul; R. Harald Baayen; Chris F. Westbury

doi:10.1075/ml.9.3.04sha

N-gram probability effects in a cloze task

The Mental Lexicon ◽

10.1075/ml.9.3.04sha ◽

2014 ◽

Vol 9 (3) ◽

pp. 437-472 ◽

Cited By ~ 6

Author(s):

Cyrus Shaoul ◽

R. Harald Baayen ◽

Chris F. Westbury

Keyword(s):

Conditional Probability ◽

Language Processing ◽

Implicit Knowledge ◽

Empirical Measure ◽

Word Choice ◽

Conditional Probabilities ◽

Linguistic Context ◽

Specific Context ◽

Probabilistic Information ◽

N Gram

What knowledge influences our choice of words when we write or speak? Predicting which word a person will produce next is not easy, even when the linguistic context is known. One task that has been used to assess context dependent word choice is the fill-in-the-blank task, also called the cloze task. The cloze probability of specific context is an empirical measure found by asking many people to fill in the blank. In this paper we harness the power of large corpora to look at the influence of corpus-derived probabilistic information from a word’s micro-context on word choice. We asked young adults to complete short phrases called n-grams with up to 20 responses per phrase. The probability of the responded word and the conditional probability of the response given the context were predictive of the frequency with which each response was produced. Furthermore the order in which the participants generated multiple completions of the same context was predicted by the conditional probability as well. These results suggest that word choice in cloze tasks taps into implicit knowledge of a person’s past experience with that word in various contexts. Furthermore, the importance of n-gram conditional probabilities in our analysis is further evidence of implicit knowledge about multi-word sequences and support theories of language processing that involve anticipating or predicting based on context.

Download Full-text

The dependence of the majority voting decision-making probabilities on a multi-expert binary system experts number

Informatization and communication ◽

10.34219/2078-8320-2020-11-1-7-14 ◽

2020 ◽

pp. 7-14

Author(s):

E. D. Avedyan ◽

Le Thi Trang Linh

Keyword(s):

Decision Making ◽

Binary System ◽

Conditional Probability ◽

Majority Voting ◽

Correct Solution ◽

Conditional Probabilities ◽

Analytical Results ◽

Independent Expert ◽

Mutually Independent

The article presents the analytical results of the decision-making by the majority voting algorithm (MVA). Particular attention is paid to the case of an even number of experts. The conditional probabilities of the MVA for two hypotheses are given for an even number of experts and their properties are investigated depending on the conditional probability of decision-making by independent experts of equal qualifications and on their number. An approach to calculating the probabilities of the correct solution of the MVA with unequal values of the conditional probabilities of accepting hypotheses of each statistically mutually independent expert is proposed. The findings are illustrated by numerical and graphical calculations.

Download Full-text

n-Gram Based Language Processing using Twitter Dataset to Identify COVID-19 Patients

Sustainable Cities and Society ◽

10.1016/j.scs.2021.103048 ◽

2021 ◽

pp. 103048

Author(s):

Nidal Nasser ◽

Lutful Karim ◽

Ahmed El Ouadrhiri ◽

Asmaa Ali ◽

Nargis Khan

Keyword(s):

Language Processing ◽

N Gram

Download Full-text

Google Play Content Scraping and Knowledge Engineering using Natural Language Processing Techniques with the Analysis of User Reviews

Journal of Intelligent Systems ◽

10.1515/jisys-2019-0197 ◽

2020 ◽

Vol 30 (1) ◽

pp. 192-208 ◽

Cited By ~ 1

Author(s):

Hamza Aldabbas ◽

Abdullah Bajahzar ◽

Meshrif Alruily ◽

Ali Adil Qureshi ◽

Rana M. Amir Latif ◽

...

Keyword(s):

Logistic Regression ◽

Language Processing ◽

Mobile Application ◽

Knowledge Engineering ◽

Machine Learning Algorithms ◽

Application Development ◽

User Reviews ◽

N Gram ◽

Logistic Regression Algorithm ◽

Google Play

Abstract To maintain the competitive edge and evaluating the needs of the quality app is in the mobile application market. The user’s feedback on these applications plays an essential role in the mobile application development industry. The rapid growth of web technology gave people an opportunity to interact and express their review, rate and share their feedback about applications. In this paper we have scrapped 506259 of user reviews and applications rate from Google Play Store from 14 different categories. The statistical information was measured in the results using different of common machine learning algorithms such as the Logistic Regression, Random Forest Classifier, and Multinomial Naïve Bayes. Different parameters including the accuracy, precision, recall, and F1 score were used to evaluate Bigram, Trigram, and N-gram, and the statistical result of these algorithms was compared. The analysis of each algorithm, one by one, is performed, and the result has been evaluated. It is concluded that logistic regression is the best algorithm for review analysis of the Google Play Store applications. The results have been checked scientifically, and it is found that the accuracy of the logistic regression algorithm for analyzing different reviews based on three classes, i.e., positive, negative, and neutral.

Download Full-text

Fine-Grained Named Entity Typing over Distantly Supervised Data Based on Refined Representations

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6234 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7391-7398

Author(s):

Muhammad Asif Ali ◽

Yifang Sun ◽

Bing Li ◽

Wei Wang

Keyword(s):

Language Processing ◽

Training Data ◽

Specific Context ◽

Fine Grained ◽

Named Entity ◽

Distant Supervision ◽

Proposed Model ◽

Wide Range ◽

Relative Score ◽

Noisy Labels

Fine-Grained Named Entity Typing (FG-NET) is a key component in Natural Language Processing (NLP). It aims at classifying an entity mention into a wide range of entity types. Due to a large number of entity types, distant supervision is used to collect training data for this task, which noisily assigns type labels to entity mentions irrespective of the context. In order to alleviate the noisy labels, existing approaches on FG-NET analyze the entity mentions entirely independent of each other and assign type labels solely based on mention's sentence-specific context. This is inadequate for highly overlapping and/or noisy type labels as it hinders information passing across sentence boundaries. For this, we propose an edge-weighted attentive graph convolution network that refines the noisy mention representations by attending over corpus-level contextual clues prior to the end classification. Experimental evaluation shows that the proposed model outperforms the existing research by a relative score of upto 10.2% and 8.3% for macro-f1 and micro-f1 respectively.

Download Full-text

Machine-Learning-Based External Plagiarism Detecting Methodology From Monolingual Documents

Feature Dimension Reduction for Content-Based Image Identification - Advances in Multimedia and Interactive Technologies ◽

10.4018/978-1-5225-5775-3.ch007 ◽

2018 ◽

pp. 122-139

Author(s):

Saugata Bose ◽

Ritambhra Korpal

Keyword(s):

Machine Learning ◽

Language Processing ◽

Confusion Matrix ◽

False Negative ◽

False Negative Rate ◽

Search Space ◽

Machine Learning Algorithms ◽

C4.5 Decision Tree ◽

N Gram ◽

Four Levels

In this chapter, an initiative is proposed where natural language processing (NLP) techniques and supervised machine learning algorithms have been combined to detect external plagiarism. The major emphasis is on to construct a framework to detect plagiarism from monolingual texts by implementing n-gram frequency comparison approach. The framework is based on 120 characteristics which have been extracted during pre-processing steps using simple NLP approach. Afterward, filter metrics has been applied to select most relevant features and supervised classification learning algorithm has been used later to classify the documents in four levels of plagiarism. Then, confusion matrix was built to estimate the false positives and false negatives. Finally, the authors have shown C4.5 decision tree-based classifier's suitability on calculating accuracy over naive Bayes. The framework achieved 89% accuracy with low false positive and false negative rate and it shows higher precision and recall value comparing to passage similarities method, sentence similarity method, and search space reduction method.

Download Full-text

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Information ◽

10.3390/info10100317 ◽

2019 ◽

Vol 10 (10) ◽

pp. 317 ◽

Cited By ~ 1

Author(s):

Karol Nowakowski ◽

Michal Ptaszynski ◽

Fumito Masui

Keyword(s):

Language Processing ◽

High Performance ◽

Computational Cost ◽

Neural Model ◽

Word Segmentation ◽

Coarse Grained ◽

Endangered Language ◽

Modelling Techniques ◽

Series Of Experiments ◽

N Gram

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.

Download Full-text

NONCONGLOMERABILITY FOR COUNTABLY ADDITIVE MEASURES THAT ARE NOT κ-ADDITIVE

The Review of Symbolic Logic ◽

10.1017/s1755020316000344 ◽

2016 ◽

Vol 10 (2) ◽

pp. 284-300 ◽

Cited By ~ 4

Author(s):

MARK J. SCHERVISH ◽

TEDDY SEIDENFELD ◽

JOSEPH B. KADANE

Keyword(s):

Conditional Probability ◽

Inaccessible Cardinal ◽

Conditional Probabilities ◽

Uncountable Cardinal ◽

Additive Probability ◽

De Finetti ◽

Additive Measures

AbstractLet κ be an uncountable cardinal. Using the theory of conditional probability associated with de Finetti (1974) and Dubins (1975), subject to several structural assumptions for creating sufficiently many measurable sets, and assuming that κ is not a weakly inaccessible cardinal, we show that each probability that is not κ-additive has conditional probabilities that fail to be conglomerable in a partition of cardinality no greater than κ. This generalizes a result of Schervish, Seidenfeld, & Kadane (1984), which established that each finite but not countably additive probability has conditional probabilities that fail to be conglomerable in some countable partition.

Download Full-text

Conditional Probability

10.1093/oxfordhb/9780199607617.013.10 ◽

2017 ◽

Author(s):

Kenny Easwaran

Keyword(s):

Conditional Probability ◽

Conditional Probabilities ◽

Unconditional Probability

Conditional probability has been put to many uses in philosophy, and several proposals have been made regarding its relation to unconditional probability, especially in cases involving infinitely many alternatives that may have probability 0. This chapter briefly summarizes some of the literature connecting conditional probabilities to probabilities of conditionals and to Humphreys' Paradox for chances, and then investigates in greater depth the issues around probability 0. Approaches due to Popper, Rényi, and Kolmogorov are considered. Some of the limitations and alternative formulations of each are discussed, in particular the issues arising around the property of “conglomerability” and the idea that conditional probabilities may depend on a conditioning algebra rather than just an event.

Download Full-text

Disintegration and Bayesian inversion via string diagrams

Mathematical Structures in Computer Science ◽

10.1017/s0960129518000488 ◽

2019 ◽

Vol 29 (7) ◽

pp. 938-971 ◽

Cited By ~ 3

Author(s):

Kenta Cho ◽

Bart Jacobs

Keyword(s):

Probability Theory ◽

Conditional Probability ◽

Opposite Direction ◽

Bayesian Inversion ◽

Conditional Probabilities ◽

Discrete Probability ◽

Joint State

AbstractThe notions of disintegration and Bayesian inversion are fundamental in conditional probability theory. They produce channels, as conditional probabilities, from a joint state, or from an already given channel (in opposite direction). These notions exist in the literature, in concrete situations, but are presented here in abstract graphical formulations. The resulting abstract descriptions are used for proving basic results in conditional probability theory. The existence of disintegration and Bayesian inversion is discussed for discrete probability, and also for measure-theoretic probability – via standard Borel spaces and via likelihoods. Finally, the usefulness of disintegration and Bayesian inversion is illustrated in several examples.

Download Full-text

Effects of semantic plausibility, syntactic complexity and n-gram frequency on children's sentence repetition

Journal of Child Language ◽

10.1017/s0305000920000306 ◽

2020 ◽

pp. 1-25

Author(s):

Kamila POLIŠENSKÁ ◽

Shula CHIAT ◽

Jakub SZEWCZYK ◽

Katherine E. TWOMEY

Keyword(s):

Language Processing ◽

Sentence Processing ◽

Age Groups ◽

Linguistic Knowledge ◽

Syntactic Complexity ◽

Sentence Repetition ◽

N Gram ◽

Semantic Plausibility ◽

Post Hoc

Abstract Theories of language processing differ with respect to the role of abstract syntax and semantics vs surface-level lexical co-occurrence (n-gram) frequency. The contribution of each of these factors has been demonstrated in previous studies of children and adults, but none have investigated them jointly. This study evaluated the role of all three factors in a sentence repetition task performed by children aged 4–7 and 11–12 years. It was found that semantic plausibility benefitted performance in both age groups; syntactic complexity disadvantaged the younger group but benefitted the older group; while contrary to previous findings, n-gram frequency did not facilitate, and in a post-hoc analysis even hampered, performance. This new evidence suggests that n-gram frequency effects might be restricted to the highly constrained and frequent n-grams used in previous investigations, and that semantics and morphosyntax play a more powerful role than n-gram frequency, supporting the role of abstract linguistic knowledge in children's sentence processing.

Download Full-text