scholarly journals Combining Lexical and Syntactic Features for Detecting Content-Dense Texts in News

2017 ◽  
Vol 60 ◽  
pp. 179-219 ◽  
Author(s):  
Yinfei Yang ◽  
Ani Nenkova

Content-dense news report important factual information about an event in direct, succinct manner. Information seeking applications such as information extraction, question answering and summarization normally assume all text they deal with is content-dense. Here we empirically test this assumption on news articles from the business, U.S. international relations, sports and science journalism domains. Our findings clearly indicate that about half of the news texts in our study are in fact not content-dense and motivate the development of a supervised content-density detector. We heuristically label a large training corpus for the task and train a two-layer classifying model based on lexical and unlexicalized syntactic features. On manually annotated data, we compare the performance of domain-specific classifiers, trained on data only from a given news domain and a general classifier in which data from all four domains is pooled together. Our annotation and prediction experiments demonstrate that the concept of content density varies depending on the domain and that naive annotators provide judgement biased toward the stereotypical domain label. Domain-specific classifiers are more accurate for domains in which content-dense texts are typically fewer. Domain independent classifiers reproduce better naive crowdsourced judgements. Classification prediction is high across all conditions, around 80%.

AI Magazine ◽  
2010 ◽  
Vol 31 (3) ◽  
pp. 93 ◽  
Author(s):  
Stephen Soderland ◽  
Brendan Roof ◽  
Bo Qin ◽  
Shi Xu ◽  
Mausam ◽  
...  

Information extraction (IE) can identify a set of relations from free text to support question answering (QA). Until recently, IE systems were domain-specific and needed a combination of manual engineering and supervised learning to adapt to each target domain. A new paradigm, Open IE operates on large text corpora without any manual tagging of relations, and indeed without any pre-specified relations. Due to its open-domain and open-relation nature, Open IE is purely textual and is unable to relate the surface forms to an ontology, if known in advance. We explore the steps needed to adapt Open IE to a domain-specific ontology and demonstrate our approach of mapping domain-independent tuples to an ontology using domains from DARPA’s Machine Reading Project. Our system achieves precision over 0.90 from as few as 8 training examples for an NFL-scoring domain.


Author(s):  
Uga Sproģis ◽  
Matīss Rikters

We present the Latvian Twitter Eater Corpus - a set of tweets in the narrow domain related to food, drinks, eating and drinking. The corpus has been collected over time-span of over 8 years and includes over 2 million tweets entailed with additional useful data. We also separate two sub-corpora of question and answer tweets and sentiment annotated tweets. We analyse the contents of the corpus and demonstrate use-cases for the sub-corpora by training domain-specific question-answering and sentiment-analysis models using the data from the corpus.


2006 ◽  
Vol 25 ◽  
pp. 17-74 ◽  
Author(s):  
S. Thiebaux ◽  
C. Gretton ◽  
J. Slaney ◽  
D. Price ◽  
F. Kabanza

A decision process in which rewards depend on history rather than merely on the current state is called a decision process with non-Markovian rewards (NMRDP). In decision-theoretic planning, where many desirable behaviours are more naturally expressed as properties of execution sequences rather than as properties of states, NMRDPs form a more natural model than the commonly adopted fully Markovian decision process (MDP) model. While the more tractable solution methods developed for MDPs do not directly apply in the presence of non-Markovian rewards, a number of solution methods for NMRDPs have been proposed in the literature. These all exploit a compact specification of the non-Markovian reward function in temporal logic, to automatically translate the NMRDP into an equivalent MDP which is solved using efficient MDP solution methods. This paper presents NMRDPP (Non-Markovian Reward Decision Process Planner), a software platform for the development and experimentation of methods for decision-theoretic planning with non-Markovian rewards. The current version of NMRDPP implements, under a single interface, a family of methods based on existing as well as new approaches which we describe in detail. These include dynamic programming, heuristic search, and structured methods. Using NMRDPP, we compare the methods and identify certain problem features that affect their performance. NMRDPP's treatment of non-Markovian rewards is inspired by the treatment of domain-specific search control knowledge in the TLPlan planner, which it incorporates as a special case. In the First International Probabilistic Planning Competition, NMRDPP was able to compete and perform well in both the domain-independent and hand-coded tracks, using search control knowledge in the latter.


2012 ◽  
Vol 21 (04) ◽  
pp. 383-403 ◽  
Author(s):  
ELENA FILATOVA

Wikipedia is used as a training corpus for many information selection tasks: summarization, question-answering, etc. The information presented in Wikipedia articles as well as the order in which this information is presented, is treated as the gold standard and is used for improving the quality of information selection systems. However, the Wikipedia articles corresponding to the same entry (person, location, event, etc.) written in different languages have substantial differences regarding what information is included in these articles. In this paper we analyze the regularities of information overlap among the articles about the same Wikipedia entry written in different languages: some information facts are covered in the Wikipedia articles in many languages, while others are covered only in a few languages. We introduce a hypothesis that the structure of this information overlap is similar to the information overlap structure (pyramid model) used in summarization evaluation, as well as the information overlap/repetition structure used to identify important information for multidocument summarization. We prove the correctness of our hypothesis by building a summarization system according to the presented information overlap hypothesis. This system summarizes English Wikipedia articles given the articles about the same Wikipedia entries written in other languages. To evaluate the quality of the created summaries, we use Amazon Mechanical Turk as the source of human subjects who can reliably judge the quality of the created text. We also compare the summaries generated according to the information overlap hypothesis against the lead line baseline which is considered to be the most reliable way to generate summaries of Wikipedia articles. The summarization experiment proves the correctness of the introduced multilingual Wikipedia information overlap hypothesis.


2021 ◽  
Author(s):  
Huseyin Denli ◽  
Hassan A Chughtai ◽  
Brian Hughes ◽  
Robert Gistri ◽  
Peng Xu

Abstract Deep learning has recently been providing step-change capabilities, particularly using transformer models, for natural language processing applications such as question answering, query-based summarization, and language translation for general-purpose context. We have developed a geoscience-specific language processing solution using such models to enable geoscientists to perform rapid, fully-quantitative and automated analysis of large corpuses of data and gain insights. One of the key transformer-based model is BERT (Bidirectional Encoder Representations from Transformers). It is trained with a large amount of general-purpose text (e.g., Common Crawl). Use of such a model for geoscience applications can face a number of challenges. One is due to the insignificant presence of geoscience-specific vocabulary in general-purpose context (e.g. daily language) and the other one is due to the geoscience jargon (domain-specific meaning of words). For example, salt is more likely to be associated with table salt within a daily language but it is used as a subsurface entity within geosciences. To elevate such challenges, we retrained a pre-trained BERT model with our 20M internal geoscientific records. We will refer the retrained model as GeoBERT. We fine-tuned the GeoBERT model for a number of tasks including geoscience question answering and query-based summarization. BERT models are very large in size. For example, BERT-Large has 340M trained parameters. Geoscience language processing with these models, including GeoBERT, could result in a substantial latency when all database is processed at every call of the model. To address this challenge, we developed a retriever-reader engine consisting of an embedding-based similarity search as a context retrieval step, which helps the solution to narrow the context for a given query before processing the context with GeoBERT. We built a solution integrating context-retrieval and GeoBERT models. Benchmarks show that it is effective to help geologists to identify answers and context for given questions. The prototype will also produce a summary to different granularity for a given set of documents. We have also demonstrated that domain-specific GeoBERT outperforms general-purpose BERT for geoscience applications.


Author(s):  
Weijian Ni ◽  
Tong Liu ◽  
Qingtian Zeng ◽  
Nengfu Xie

Domain terminologies are a basic resource for various natural language processing tasks. To automatically discover terminologies for a domain of interest, most traditional approaches mostly rely on a domain-specific corpus given in advance; thus, the performance of traditional approaches can only be guaranteed when collecting a high-quality domain-specific corpus, which requires extensive human involvement and domain expertise. In this article, we propose a novel approach that is capable of automatically mining domain terminologies using search engine's query log—a type of domain-independent corpus of higher availability, coverage, and timeliness than a manually collected domain-specific corpus. In particular, we represent query log as a heterogeneous network and formulate the task of mining domain terminology as transductive learning on the heterogeneous network. In the proposed approach, the manifold structure of domain-specificity inherent in query log is captured by using a novel network embedding algorithm and further exploited to reduce the need for the manual annotation efforts for domain terminology classification. We select Agriculture and Healthcare as the target domains and experiment using a real query log from a commercial search engine. Experimental results show that the proposed approach outperforms several state-of-the-art approaches.


Sign in / Sign up

Export Citation Format

Share Document