A Psycholinguistic Model for the Marking of Discourse Relations

Frances Yung; Kevin Duh; Taku Komura; Yuji Matsumoto

doi:10.5087/dad.2017.104

A Psycholinguistic Model for the Marking of Discourse Relations

Dialogue & Discourse ◽

10.5087/dad.2017.104 ◽

2017 ◽

Vol 8 (1) ◽

pp. 106-131

Author(s):

Frances Yung ◽

Kevin Duh ◽

Taku Komura ◽

Yuji Matsumoto

Keyword(s):

Bayesian Inference ◽

Natural Language ◽

Information Transmission ◽

Speech Acts ◽

Language Production ◽

State Of The Art ◽

Information Theoretic ◽

Explanatory Account ◽

Discourse Relations ◽

Discourse Connectives

Discourse relations can either be explicitly marked by discourse connectives (DCs), such as therefore and but, or implicitly conveyed in natural language utterances. How speakers choose between the two options is a question that is not well understood. In this study, we propose a psycholinguistic model that predicts whether or not speakers will produce an explicit marker given the discourse relation they wish to express. Our model is based on two information-theoretic frameworks: (1) the Rational Speech Acts model, which models the pragmatic interaction between language production and interpretation by Bayesian inference, and (2) the Uniform Information Density theory, which advocates that speakers adjust linguistic redundancy to maintain a uniform rate of information transmission. Specifically, our model quantifies the utility of using or omitting a DC based on the expected surprisal of comprehension, cost of production, and availability of other signals in the rest of the utterance. Experiments based on the Penn Discourse Treebank show that our approach outperforms the state-of-the-art performance at predicting the presence of DCs (Patterson and Kehler, 2013), in addition to giving an explanatory account of the speaker’s choice.

Download Full-text

The interdependence of frequency, predictability, and informativity in the segmental domain

Linguistics Vanguard ◽

10.1515/lingvan-2017-0028 ◽

2018 ◽

Vol 4 (s2) ◽

Cited By ~ 2

Author(s):

Uriel Cohen Priva ◽

T. Florian Jaeger

Keyword(s):

Natural Language ◽

Language Production ◽

Computational Approach ◽

Information Theoretic ◽

Redundancy Reduction ◽

Highly Correlated

AbstractIt has long been noted that language production seems to reflect a correlation between message redundancy and signal reduction. More frequent words and contextually predictable instances of words, for example, tend to be produced with shorter and less clear signals. The same tendency is observed in the language code (e.g. the phonological lexicon), where more frequent words and words that are typically contextually predictable tend to have fewer segments or syllables. Average predictability in context (informativity) also seems to be an important factor in understanding phonological alternations. What has received little attention so far is the relation between various information-theoretic indices – such as frequency, contextual predictability, and informativity. Although each of these indices has been associated with different theories about the source of the redundancy-reduction link, different indices tend to be highly correlated in natural language, making it difficult to tease apart their effects. We present a computational approach to this problem. We assess the correlations between frequency, predictability, and informativity, and assess when these correlations are likely to create spurious (null or non-null) effects depending on, for example, the amount of data available to the researcher.

Download Full-text

A PDTB-styled end-to-end discourse parser

Natural Language Engineering ◽

10.1017/s1351324912000307 ◽

2012 ◽

Vol 20 (2) ◽

pp. 151-184 ◽

Cited By ~ 47

Author(s):

ZIHENG LIN ◽

HWEE TOU NG ◽

MIN-YEN KAN

Keyword(s):

Comprehensive Evaluation ◽

State Of The Art ◽

Research Work ◽

Pipeline Architecture ◽

Multiple Components ◽

Current State ◽

Discourse Relations ◽

Data Driven Approach ◽

End To End ◽

Discourse Connectives

AbstractSince the release of the large discourse-level annotation of the Penn Discourse Treebank (PDTB), research work has been carried out on certain subtasks of this annotation, such as disambiguating discourse connectives and classifying Explicit or Implicit relations. We see a need to construct a full parser on top of these subtasks and propose a way to evaluate the parser. In this work, we have designed and developed an end-to-end discourse parser-to-parse free texts in the PDTB style in a fully data-driven approach. The parser consists of multiple components joined in a sequential pipeline architecture, which includes a connective classifier, argument labeler, explicit classifier, non-explicit classifier, and attribution span labeler. Our trained parser first identifies all discourse and non-discourse relations, locates and labels their arguments, and then classifies the sense of the relation between each pair of arguments. For the identified relations, the parser also determines the attribution spans, if any, associated with them. We introduce novel approaches to locate and label arguments, and to identify attribution spans. We also significantly improve on the current state-of-the-art connective classifier. We propose and present a comprehensive evaluation from both component-wise and error-cascading perspectives, in which we illustrate how each component performs in isolation, as well as how the pipeline performs with errors propagated forward. The parser gives an overall system F1 score of 46.80 percent for partial matching utilizing gold standard parses, and 38.18 percent with full automation.

Download Full-text

Report on the 4th Joint Workshop on Bibliometric-Enhanced Information Retrieval and Natural Language Processing for Digital Libraries at SIGIR 2019

ACM SIGIR Forum ◽

10.1145/3458553.3458554 ◽

2019 ◽

Vol 53 (2) ◽

pp. 3-10

Author(s):

Muthu Kumar Chandrasekaran ◽

Philipp Mayr

Keyword(s):

Information Retrieval ◽

Natural Language Processing ◽

Natural Language ◽

Research And Development ◽

Language Processing ◽

Digital Libraries ◽

State Of The Art ◽

Shared Task ◽

Processing Information ◽

Joint Workshop

The 4 th joint BIRNDL workshop was held at the 42nd ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2019) in Paris, France. BIRNDL 2019 intended to stimulate IR researchers and digital library professionals to elaborate on new approaches in natural language processing, information retrieval, scientometrics, and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and retrieval at scale. The workshop incorporated different paper sessions and the 5 th edition of the CL-SciSumm Shared Task.

Download Full-text

Large-scale Semantic Parsing without Question-Answer Pairs

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00190 ◽

2014 ◽

Vol 2 ◽

pp. 377-392 ◽

Cited By ~ 40

Author(s):

Siva Reddy ◽

Mirella Lapata ◽

Mark Steedman

Keyword(s):

Natural Language ◽

Large Scale ◽

Graph Matching ◽

State Of The Art ◽

The State ◽

Semantic Parsing ◽

Matching Problem ◽

Weak Supervision ◽

Benchmark Datasets

In this paper we introduce a novel semantic parsing approach to query Freebase in natural language without requiring manual annotations or question-answer pairs. Our key insight is to represent natural language via semantic graphs whose topology shares many commonalities with Freebase. Given this representation, we conceptualize semantic parsing as a graph matching problem. Our model converts sentences to semantic graphs using CCG and subsequently grounds them to Freebase guided by denotations as a form of weak supervision. Evaluation experiments on a subset of the Free917 and WebQuestions benchmark datasets show our semantic parser improves over the state of the art.

Download Full-text

Connective use in the narratives of bilingual children and monolingual children with SLI

Bilingualism Language and Cognition ◽

10.1017/s1366728915000577 ◽

2015 ◽

Vol 20 (1) ◽

pp. 98-113 ◽

Cited By ~ 4

Author(s):

ELENA TRIBUSHININA ◽

WILLEM M. MAK ◽

ELIZAVETA ANDREIUSHINA ◽

ELENA DUBINKINA ◽

TED SANDERS

Keyword(s):

Structural Properties ◽

Language Production ◽

Error Rates ◽

Bilingual Children ◽

Typically Developing ◽

Input Quantity ◽

Frequency Distributions ◽

Bilingual Group ◽

Crosslinguistic Influence ◽

Discourse Connectives

Differences between monolinguals and bilinguals are often attributed to crosslinguistic influence. This paper compares production of discourse connectives by Dutch–Russian bilinguals (Dutch-dominant), typically-developing Dutch/Russian monolinguals and Russian-speaking children with SLI. If non-target-like production in bilinguals is due to crosslinguistic influence, bilinguals should perform differently from both impaired and unimpaired monolinguals. However, if differences between bilinguals and monolinguals are due to other factors (e.g., input quantity, processing capacities), bilinguals’ language production might be similar to that of children with SLI. The results demonstrate that language dominance determines the direction of crosslinguistic influence. In terms of frequency distributions of Russian connectives across pragmatic contexts, the bilingual group performed differently from both monolingual groups and the differences were compatible with the structural properties of Dutch. However, based on error rates and types bilinguals could not be distinguished from the SLI group, suggesting that factors other than crosslinguistic influence may also be at play.

Download Full-text

How Synaptic Release Probability Shapes Neuronal Transmission: Information-Theoretic Analysis in a Cerebellar Granule Cell

Neural Computation ◽

10.1162/neco_a_00006-arleo ◽

2010 ◽

Vol 22 (8) ◽

pp. 2031-2058 ◽

Cited By ~ 30

Author(s):

Angelo Arleo ◽

Thierry Nieus ◽

Michele Bezzi ◽

Anna D'Errico ◽

Egidio D'Angelo ◽

...

Keyword(s):

Information Transmission ◽

Numerical Simulations ◽

Granule Cell ◽

Mossy Fiber ◽

Dendritic Tree ◽

Long Term Potentiation ◽

Cerebellar Granule Cell ◽

Information Theoretic ◽

Cerebellar Granule ◽

Release Probability

A nerve cell receives multiple inputs from upstream neurons by way of its synapses. Neuron processing functions are thus influenced by changes in the biophysical properties of the synapse, such as long-term potentiation (LTP) or depression (LTD). This observation has opened new perspectives on the biophysical basis of learning and memory, but its quantitative impact on the information transmission of a neuron remains partially elucidated. One major obstacle is the high dimensionality of the neuronal input-output space, which makes it unfeasible to perform a thorough computational analysis of a neuron with multiple synaptic inputs. In this work, information theory was employed to characterize the information transmission of a cerebellar granule cell over a region of its excitatory input space following synaptic changes. Granule cells have a small dendritic tree (on average, they receive only four mossy fiber afferents), which greatly bounds the input combinatorial space, reducing the complexity of information-theoretic calculations. Numerical simulations and LTP experiments quantified how changes in neurotransmitter release probability (p) modulated information transmission of a cerebellar granule cell. Numerical simulations showed that p shaped the neurotransmission landscape in unexpected ways. As p increased, the optimality of the information transmission of most stimuli did not increase strictly monotonically; instead it reached a plateau at intermediate p levels. Furthermore, our results showed that the spatiotemporal characteristics of the inputs determine the effect of p on neurotransmission, thus permitting the selection of distinctive preferred stimuli for different p values. These selective mechanisms may have important consequences on the encoding of cerebellar mossy fiber inputs and the plasticity and computation at the next circuit stage, including the parallel fiber–Purkinje cell synapses.

Download Full-text

Recommending Relevant Tutorial Fragments for API-Related Natural Language Questions

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194021500406 ◽

2021 ◽

Vol 31 (09) ◽

pp. 1251-1275

Author(s):

Di Wu ◽

Xiao-Yuan Jing ◽

Haowen Chen ◽

Xiaohui Kong ◽

Jifeng Xuan

Keyword(s):

Natural Language ◽

State Of The Art ◽

Metric Learning ◽

Application Programming Interface ◽

Manual Annotation ◽

Candidate List ◽

Novel Approach ◽

Application Programming ◽

Programming Interface ◽

Reciprocal Rank

Application Programming Interface (API) tutorial is an important API learning resource. To help developers learn APIs, an API tutorial is often split into a number of consecutive units that describe the same topic (i.e. tutorial fragment). We regard a tutorial fragment explaining an API as a relevant fragment of the API. Automatically recommending relevant tutorial fragments can help developers learn how to use an API. However, existing approaches often employ supervised or unsupervised manner to recommend relevant fragments, which suffers from much manual annotation effort or inaccurate recommended results. Furthermore, these approaches only support developers to input exact API names. In practice, developers often do not know which APIs to use so that they are more likely to use natural language to describe API-related questions. In this paper, we propose a novel approach, called Tutorial Fragment Recommendation (TuFraRec), to effectively recommend relevant tutorial fragments for API-related natural language questions, without much manual annotation effort. For an API tutorial, we split it into fragments and extract APIs from each fragment to build API-fragment pairs. Given a question, TuFraRec first generates several clarification APIs that are related to the question. We use clarification APIs and API-fragment pairs to construct candidate API-fragment pairs. Then, we design a semi-supervised metric learning (SML)-based model to find relevant API-fragment pairs from the candidate list, which can work well with a few labeled API-fragment pairs and a large number of unlabeled API-fragment pairs. In this way, the manual effort for labeling the relevance of API-fragment pairs can be reduced. Finally, we sort and recommend relevant API-fragment pairs based on the recommended strategy. We evaluate TuFraRec on 200 API-related natural language questions and two public tutorial datasets (Java and Android). The results demonstrate that on average TuFraRec improves NDCG@5 by 0.06 and 0.09, and improves Mean Reciprocal Rank (MRR) by 0.07 and 0.09 on two tutorial datasets as compared with the state-of-the-art approach.

Download Full-text

Surface-marker-based dialog modelling: A progress report on the MAREDI project

Natural Language Engineering ◽

10.1017/s1351324903003231 ◽

2003 ◽

Vol 9 (4) ◽

pp. 325-363

Author(s):

SYLVAIN DELISLE ◽

BERNARD MOULIN ◽

TERRY COPECK

Keyword(s):

Information Systems ◽

Natural Language ◽

Speech Acts ◽

Spoken Language ◽

Progress Report ◽

Surface Marker ◽

Simplified Model ◽

Surface Markers ◽

Connectionist Network ◽

Current State

Most information systems that deal with natural language texts do not tolerate much deviation from their idealized and simplified model of language. Spoken dialog is notoriously ungrammatical, however. Because the MAREDI project focuses in particular on the automatic analysis of scripted dialogs, we needed to develop a robust capacity to analyze transcribed spoken language. This paper summarizes the current state of our work. It presents the main elements of our approach, which is based on exploiting surface markers as the best route to the semantics of the conversation modelled. We highlight the foundations of our particular conversational model, and give an overview of the MAREDI system. We then discuss its three key modules, a connectionist network to recognise speech acts, a robust syntactic analyzer, and a semantic analyzer.

Download Full-text

Analyzing Compositionality-Sensitivity of NLI Models

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33016867 ◽

2019 ◽

Vol 33 ◽

pp. 6867-6874 ◽

Cited By ~ 1

Author(s):

Yixin Nie ◽

Yicheng Wang ◽

Mohit Bansal

Keyword(s):

Natural Language ◽

High Probability ◽

State Of The Art ◽

Model Performance ◽

Analysis Tool ◽

Bag Of Words ◽

Compositional Semantics ◽

Sensitivity Testing ◽

Performance Loss ◽

Future Work

Success in natural language inference (NLI) should require a model to understand both lexical and compositional semantics. However, through adversarial evaluation, we find that several state-of-the-art models with diverse architectures are over-relying on the former and fail to use the latter. Further, this compositionality unawareness is not reflected via standard evaluation on current datasets. We show that removing RNNs in existing models or shuffling input words during training does not induce large performance loss despite the explicit removal of compositional information. Therefore, we propose a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone (i.e., on which a bag-of-words model gives a high probability to one wrong label), hence revealing the models’ actual compositionality awareness. We show that this setup not only highlights the limited compositional ability of current NLI models, but also differentiates model performance based on design, e.g., separating shallow bag-of-words models from deeper, linguistically-grounded tree-based models. Our evaluation setup is an important analysis tool: complementing currently existing adversarial and linguistically driven diagnostic evaluations, and exposing opportunities for future work on evaluating models’ compositional understanding.

Download Full-text

Revealing the Transmission Dynamics of COVID-19: A Bayesian Framework for Rt Estimation

10.21203/rs.3.rs-137557/v1 ◽

2021 ◽

Author(s):

Xian Yang ◽

Shuo Wang ◽

Yuting Xing ◽

Ling Li ◽

Richard Yi Da Xu ◽

...

Keyword(s):

Bayesian Inference ◽

Model Selection ◽

Reproduction Number ◽

State Of The Art ◽

Transmission Dynamics ◽

Selection Mechanism ◽

Problem Of Time ◽

Joint Inference ◽

Abrupt Changes ◽

Bayesian Smoothing

Abstract In epidemiological modelling, the instantaneous reproduction number, Rt, is important to understand the transmission dynamics of infectious diseases. Current Rt estimates often suffer from problems such as lagging, averaging and uncertainties demoting the usefulness of Rt. To address these problems, we propose a new method in the framework of sequential Bayesian inference where a Data Assimilation approach is taken for Rt estimation, resulting in the state-of-the-art ‘DARt’ system for Rt estimation. With DARt, the problem of time misalignment caused by lagging observations is tackled by incorporating observation delays into the joint inference of infections and Rt; the drawback of averaging is improved by instantaneous updating upon new observations and a model selection mechanism capturing abrupt changes caused by interventions; the uncertainty is quantified and reduced by employing Bayesian smoothing. We validate the performance of DARt through simulations and demonstrate its power in revealing the transmission dynamics of COVID-19.

Download Full-text