A Text Mining Approach for Definition Question Answering

Author(s):  
Claudia Denicia-Carral ◽  
Manuel Montes-y-Gómez ◽  
Luis Villaseñor-Pineda ◽  
René García Hernández
2020 ◽  
pp. 1686-1704
Author(s):  
Emna Hkiri ◽  
Souheyl Mallat ◽  
Mounir Zrigui

The event extraction task consists in determining and classifying events within an open-domain text. It is very new for the Arabic language, whereas it attained its maturity for some languages such as English and French. Events extraction was also proved to help Natural Language Processing tasks such as Information Retrieval and Question Answering, text mining, machine translation etc… to obtain a higher performance. In this article, we present an ongoing effort to build a system for event extraction from Arabic texts using Gate platform and other tools.


Author(s):  
Xiaoyan Yu ◽  
Manas Tungare ◽  
Weiguo Fan ◽  
Manuel Pérez-Quiñones ◽  
Edward A. Fox ◽  
...  

Starting with a vast number of unstructured or semistructured documents, text mining tools analyze and sift through them to present to users more valuable information specific to their information needs. The technologies in text mining include information extraction, topic tracking, summarization, categorization/ classification, clustering, concept linkage, information visualization, and question answering [Fan, Wallace, Rich, & Zhang, 2006]. In this chapter, we share our hands-on experience with one specific text mining task — text classification [Sebastiani, 2002]. Information occurs in various formats, and some formats have a specific structure or specific information that they contain: we refer to these as `genres’. Examples of information genres include news items, reports, academic articles, etc. In this paper, we deal with a specific genre type, course syllabus. A course syllabus is such a genre, with the following commonly-occurring fields: title, description, instructor’s name, textbook details, class schedule, etc. In essence, a course syllabus is the skeleton of a course. Free and fast access to a collection of syllabi in a structured format could have a significant impact on education, especially for educators and life-long learners. Educators can borrow ideas from others’ syllabi to organize their own classes. It also will be easy for life-long learners to find popular textbooks and even important chapters when they would like to learn a course on their own. Unfortunately, searching for a syllabus on the Web using Information Retrieval [Baeza-Yates & Ribeiro-Neto, 1999] techniques employed by a generic search engine often yields too many non-relevant search result pages (i.e., noise) — some of these only provide guidelines on syllabus creation; some only provide a schedule for a course event; some have outgoing links to syllabi (e.g. a course list page of an academic department). Therefore, a well-designed classifier for the search results is needed, that would help not only to filter noise out, but also to identify more relevant and useful syllabi.


Author(s):  
Antonio Juárez-González ◽  
Alberto Téllez-Valero ◽  
Claudia Denicia-Carral ◽  
Manuel Montes-y-Gómez ◽  
Luis Villaseñor-Pineda

2016 ◽  
Vol 29 (2) ◽  
pp. 255-275 ◽  
Author(s):  
Arash Joorabchi ◽  
Michael English ◽  
Abdulhussain E. Mahdi

Purpose – The use of social media and in particular community Question Answering (Q & A) websites by learners has increased significantly in recent years. The vast amounts of data posted on these sites provide an opportunity to investigate the topics under discussion and those receiving most attention. The purpose of this paper is to automatically analyse the content of a popular computer programming Q & A website, StackOverflow (SO), determine the exact topics of posted Q & As, and narrow down their categories to help determine subject difficulties of learners. By doing so, the authors have been able to rank identified topics and categories according to their frequencies, and therefore, mark the most asked about subjects and, hence, identify the most difficult and challenging topics commonly faced by learners of computer programming and software development. Design/methodology/approach – In this work the authors have adopted a heuristic research approach combined with a text mining approach to investigate the topics and categories of Q & A posts on the SO website. Almost 186,000 Q & A posts were analysed and their categories refined using Wikipedia as a crowd-sourced classification system. After identifying and counting the occurrence frequency of all the topics and categories, their semantic relationships were established. This data were then presented as a rich graph which could be visualized using graph visualization software such as Gephi. Findings – Reported results and corresponding discussion has given an indication that the insight gained from the process can be further refined and potentially used by instructors, teachers, and educators to pay more attention to and focus on the commonly occurring topics/subjects when designing their course material, delivery, and teaching methods. Research limitations/implications – The proposed approach limits the scope of the analysis to a subset of Q & As which contain one or more links to Wikipedia. Therefore, developing more sophisticated text mining methods capable of analysing a larger portion of available data would improve the accuracy and generalizability of the results. Originality/value – The application of text mining and data analytics technologies in education has created a new interdisciplinary field of research between the education and information sciences, called Educational Data Mining (EDM). The work presented in this paper falls under this field of research; and it is an early attempt at investigating the practical applications of text mining technologies in the area of computer science (CS) education.


Author(s):  
José Ignacio Serrano

Owing to the growing amount of digital information stored in natural language, systems that automatically process text are of crucial importance and extremely useful. There is currently a considerable amount of research work (Sebastiani, 2002; Crammer et al., 2003) using a large variety of machine learning algorithms and other Knowledge Discovery in Databases (KDD) methods that are applied to Text Categorization (automatically labeling of texts according to category), Information Retrieval (retrieval of texts similar to a given cue), Information Extraction (identification of pieces of text that contains certain meanings), and Question/Answering (automatic answering of user questions about a certain topic). The texts or documents used can be stored either in ad hoc databases or in the World Wide Web. Data mining in texts, the well-known Text Mining, is a case of KDD with some particular issues: on one hand, the features are obtained from the words contained in texts or are the words themselves. Therefore, text mining systems faces with a huge amount of attributes. On the other hand, the features are highly correlated to form meanings, so it is necessary to take the relationships among words into account, what implies the consideration of syntax and semantics as human beings do. KDD techniques require input texts to be represented as a set of attributes in order to deal with them. The text-to-representation process is called text or document indexing, and the attributes and called indexes. Accordingly, indexing is a crucial process in text mining because indexed representations must collect, only with a set of indexes, most of the information expressed in natural language in the texts with the minimum loss of semantics, in order to perform as well as possible.


Question Answering is one of the most common applications for data acquisition. Although the majority of text-mining applications strive to improve the user experience and the tools used to find appropriate answers, the problems still exist because the web content is constantly increasing. The Questions Classification (QC) task is one of the main tasks in improving the classification system is to classify types of questions in the text mining application. A large number of QC methods are introduced to help resolve classification problems, most of which are bag of words approaches. In this project, we propose a QC system that uses Parts of Speech (POS) Tagger and Named Entity Recognition (NER) Tagger from the Stanford core Natural Language Processing (NLP) to classify the questions correctly. We started by cleaning the data by removing the available labels in the questions then we proceed by tagging the questions by splitting words and tagging each and every words in the input question with the POS Tagger. After this step, we will convert them into a pattern without changing the structure of the question. Then we proceed by tagging the question with NER Tagger. Finally, we will do confirmation process for certain question types which is performed by confirming question type module to make the system work efficiently.


2014 ◽  
Vol 08 (04) ◽  
pp. 461-489
Author(s):  
Hamid Mousavi ◽  
Shi Gao ◽  
Deirdre Kerr ◽  
Markus Iseli ◽  
Carlo Zaniolo

The Web is making possible many advanced text-mining applications, such as news summarization, essay grading, question answering, semantic search and structured queries on corpora of Web documents. For many of such applications, statistical text-mining techniques are of limited effectiveness since they do not utilize the morphological structure of the text. On the other hand, many approaches use NLP-based techniques that parse the text into parse trees, and then use patterns to mine and analyze parse trees which are often unnecessarily complex. To reduce this complexity and ease the entire process of text mining, we propose a weighted-graph representation of text, called TextGraphs, which captures the grammatical and semantic relations between words and terms in the text. TextGraphs are generated using a new text mining framework which is the main focus of this paper. Our framework, SemScape, uses a statistical parser to generate few of the most probable parse trees for each sentence and employs a novel two-step pattern-based technique to extract from parse trees candidate terms and their grammatical relations. Moreover, SemScape resolves coreferences by a novel technique, generates domain-specific TextGraphs by consulting ontologies, and provides a SPARQL-like query language and an optimized engine for semantically querying and mining TextGraphs.


2013 ◽  
Vol 46 ◽  
pp. 165-201 ◽  
Author(s):  
V. Qazvinian ◽  
D. R. Radev ◽  
S. M. Mohammad ◽  
B. Dorr ◽  
D. Zajic ◽  
...  

Researchers and scientists increasingly find themselves in the position of having to quickly understand large amounts of technical material. Our goal is to effectively serve this need by using bibliometric text mining and summarization techniques to generate summaries of scientific literature. We show how we can use citations to produce automatically generated, readily consumable, technical extractive summaries. We first propose C-LexRank, a model for summarizing single scientific articles based on citations, which employs community detection and extracts salient information-rich sentences. Next, we further extend our experiments to summarize a set of papers, which cover the same scientific topic. We generate extractive summaries of a set of Question Answering (QA) and Dependency Parsing (DP) papers, their abstracts, and their citation sentences and show that citations have unique information amenable to creating a summary.


Author(s):  
Ganesh Ramakrishnan ◽  
Pushpak Bhattacharyya

Text mining systems such as categorizers and query retrievers of the first generation were largely hinged on word level statistics and provided a wonderful first-cut approach. However systems based on simple word-level statistics quickly saturate in performance, despite the best data mining and machine learning algorithms. This problem can be traced to the fact that, typically, naive, word-based feature representations are used in text applications, which prove insufficient in bridging two types of chasms within and across documents, viz. lexical chasm and syntactic chasm . The latest wave in text mining technology has been marked by research that will make extraction of subtleties from the underlying meaning of text, a possibility. In the following two chapters, we pose the problem of underlying meaning extraction from text documents, coupled with world knowledge, as a problem of bridging the chasms by exploiting associations between entities. The entities are words or word collocations from documents. We utilize two types of entity associations, viz. paradigmatic (PA) and syntagmatic (SA). We present first-tier algorithms that use these two word associations in bridging the syntactic and lexical chasms. We also propose second-tier algorithms in two sample applications, viz., question answering and text classification which use the first-tier algorithms. Our contribution lies in the specific methods we introduce for exploiting entity association information present in WordNet, dictionaries, corpora and parse trees for improved performance in text mining applications.


Author(s):  
Annie Louis ◽  
Mirella Lapata

Online discussion forums and community question-answering websites provide one of the primary avenues for online users to share information. In this paper, we propose text mining techniques which aid users navigate troubleshooting-oriented data such as questions asked on forums and their suggested solutions. We introduce Bayesian generative models of the troubleshooting data and apply them to two interrelated tasks: (a) predicting the complexity of the solutions (e.g., plugging a keyboard in the computer is easier compared to installing a special driver) and (b) presenting them in a ranked order from least to most complex. Experimental results show that our models are on par with human performance on these tasks, while outperforming baselines based on solution length or readability.


Sign in / Sign up

Export Citation Format

Share Document