MKQA: A Linguistically Diverse Benchmark for Multilingual Open Domain Question Answering

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00433 ◽

2021 ◽

Vol 9 ◽

pp. 1389-1406

Author(s):

Shayne Longpre ◽

Yi Lu ◽

Joachim Daiber

Keyword(s):

Question Answering ◽

State Of The Art ◽

Linguistically Diverse ◽

Data Representation ◽

Independent Data ◽

Open Domain ◽

Low Resource ◽

Art Methods ◽

Questions And Answers ◽

Cross Lingual

Abstract Progress in cross-lingual modeling depends on challenging, realistic, and diverse evaluation sets. We introduce Multilingual Knowledge Questions and Answers (MKQA), an open- domain question answering evaluation set comprising 10k question-answer pairs aligned across 26 typologically diverse languages (260k question-answer pairs in total). Answers are based on heavily curated, language- independent data representation, making results comparable across languages and independent of language-specific passages. With 26 languages, this dataset supplies the widest range of languages to-date for evaluating question answering. We benchmark a variety of state- of-the-art methods and baselines for generative and extractive question answering, trained on Natural Questions, in zero shot and translation settings. Results indicate this dataset is challenging even in English, but especially in low-resource languages.1

Download Full-text

Relevance-guided Supervision for OpenQA with ColBERT

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00405 ◽

2021 ◽

Vol 9 ◽

pp. 929-944

Author(s):

Omar Khattab ◽

Christopher Potts ◽

Matei Zaharia

Keyword(s):

Question Answering ◽

State Of The Art ◽

Training Data ◽

Coarse Grained ◽

Retrieval Model ◽

Open Domain ◽

Weak Supervision ◽

Fine Grained ◽

Vector Representations ◽

Large Corpus

Abstract Systems for Open-Domain Question Answering (OpenQA) generally depend on a retriever for finding candidate passages in a large corpus and a reader for extracting answers from those passages. In much recent work, the retriever is a learned component that uses coarse-grained vector representations of questions and passages. We argue that this modeling choice is insufficiently expressive for dealing with the complexity of natural language questions. To address this, we define ColBERT-QA, which adapts the scalable neural retrieval model ColBERT to OpenQA. ColBERT creates fine-grained interactions between questions and passages. We propose an efficient weak supervision strategy that iteratively uses ColBERT to create its own training data. This greatly improves OpenQA retrieval on Natural Questions, SQuAD, and TriviaQA, and the resulting system attains state-of-the-art extractive OpenQA performance on all three datasets.

Download Full-text

XQA: A Cross-lingual Open-domain Question Answering Dataset

10.18653/v1/p19-1227 ◽

2019 ◽

Cited By ~ 1

Author(s):

Jiahua Liu ◽

Yankai Lin ◽

Zhiyuan Liu ◽

Maosong Sun

Keyword(s):

Question Answering ◽

Open Domain ◽

Cross Lingual

Download Full-text

Capturing Greater Context for Question Generation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6440 ◽

2020 ◽

Vol 34 (05) ◽

pp. 9065-9072

Author(s):

Luu Anh Tuan ◽

Darsh Shah ◽

Regina Barzilay

Keyword(s):

Reading Comprehension ◽

Question Answering ◽

State Of The Art ◽

Attention Mechanism ◽

Dialogue Systems ◽

Question Generation ◽

Multi Stage ◽

Art Methods ◽

Relevant Context ◽

Automatic Question Generation

Automatic question generation can benefit many applications ranging from dialogue systems to reading comprehension. While questions are often asked with respect to long documents, there are many challenges with modeling such long documents. Many existing techniques generate questions by effectively looking at one sentence at a time, leading to questions that are easy and not reflective of the human process of question generation. Our goal is to incorporate interactions across multiple sentences to generate realistic questions for long documents. In order to link a broad document context to the target answer, we represent the relevant context via a multi-stage attention mechanism, which forms the foundation of a sequence to sequence model. We outperform state-of-the-art methods on question generation on three question-answering datasets - SQuAD, MS MARCO and NewsQA. 1

Download Full-text

Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00039 ◽

2018 ◽

Vol 6 ◽

pp. 557-570 ◽

Cited By ~ 23

Author(s):

Xilun Chen ◽

Yu Sun ◽

Ben Athiwaratkun ◽

Claire Cardie ◽

Kilian Weinberger

Keyword(s):

State Of The Art ◽

Classification Problem ◽

Sentiment Classification ◽

Great Success ◽

Source Language ◽

Shared Feature ◽

Low Resource ◽

Feature Extractor ◽

Cross Lingual ◽

Averaging Network

In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN 1 ) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exist. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.

Download Full-text

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00303 ◽

2020 ◽

Vol 8 ◽

pp. 109-124

Author(s):

Shuyan Zhou ◽

Shruti Rijhwani ◽

John Wieting ◽

Jaime Carbonell ◽

Graham Neubig

Keyword(s):

State Of The Art ◽

Target Language ◽

Entity Linking ◽

Average Gain ◽

Source Language ◽

Low Resource ◽

High Resource ◽

Language Knowledge ◽

Cross Lingual ◽

Improved Model

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective: We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9% in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9% in in-KB accuracy of end-to-end XEL. 1

Download Full-text

Location-Aware Graph Convolutional Networks for Video Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6737 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11021-11028 ◽

Cited By ~ 5

Author(s):

Deng Huang ◽

Peihao Chen ◽

Runhao Zeng ◽

Qing Du ◽

Mingkui Tan ◽

...

Keyword(s):

Question Answering ◽

State Of The Art ◽

Attention Mechanism ◽

Video Frame ◽

Location Information ◽

Object Interaction ◽

Location Aware ◽

Art Methods ◽

Final Answer ◽

Video Question Answering

We addressed the challenging task of video question answering, which requires machines to answer questions about videos in a natural language form. Previous state-of-the-art methods attempt to apply spatio-temporal attention mechanism on video frame features without explicitly modeling the location and relations among object interaction occurred in videos. However, the relations between object interaction and their location information are very critical for both action recognition and question reasoning. In this work, we propose to represent the contents in the video as a location-aware graph by incorporating the location information of an object into the graph construction. Here, each node is associated with an object represented by its appearance and location features. Based on the constructed graph, we propose to use graph convolution to infer both the category and temporal locations of an action. As the graph is built on objects, our method is able to focus on the foreground action contents for better video question answering. Lastly, we leverage an attention mechanism to combine the output of graph convolution and encoded question features for final answer reasoning. Extensive experiments demonstrate the effectiveness of the proposed methods. Specifically, our method significantly outperforms state-of-the-art methods on TGIF-QA, Youtube2Text-QA and MSVD-QA datasets.

Download Full-text

Narrative Question Answering with Cutting-Edge Open-Domain QA Techniques: A Comprehensive Study

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00411 ◽

2021 ◽

Vol 9 ◽

pp. 1032-1046

Author(s):

Xiangyang Mou ◽

Chenghao Yang ◽

Mo Yu ◽

Bingsheng Yao ◽

Xiaoxiao Guo ◽

...

Keyword(s):

Quantitative Analysis ◽

Question Answering ◽

State Of The Art ◽

Cutting Edge ◽

Human Studies ◽

Open Domain ◽

Level Performance ◽

Similar Task ◽

Comprehensive Study

Abstract Recent advancements in open-domain question answering (ODQA), that is, finding answers from large open-domain corpus like Wikipedia, have led to human-level performance on many datasets. However, progress in QA over book stories (Book QA) lags despite its similar task formulation to ODQA. This work provides a comprehensive and quantitative analysis about the difficulty of Book QA: (1) We benchmark the research on the NarrativeQA dataset with extensive experiments with cutting-edge ODQA techniques. This quantifies the challenges Book QA poses, as well as advances the published state-of-the-art with a ∼7% absolute improvement on ROUGE-L. (2) We further analyze the detailed challenges in Book QA through human studies.1 Our findings indicate that the event-centric questions dominate this task, which exemplifies the inability of existing QA models to handle event-oriented scenarios.

Download Full-text

A State-of-the-Art Review of Nigerian Languages Natural Language Processing Research

Advances in IT Standards and Standardization Research - Developing Countries and Technology Inclusion in the 21st Century Information Society ◽

10.4018/978-1-7998-3468-7.ch008 ◽

2021 ◽

pp. 147-167

Author(s):

Toluwase Victor Asubiaro ◽

Ebelechukwu Gloria Igwe

Keyword(s):

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Optical Character Recognition ◽

State Of The Art ◽

Resource Development ◽

African Languages ◽

Low Resource ◽

Research Areas ◽

Cross Lingual

African languages, including those that are natives to Nigeria, are low-resource languages because they lack basic computing resources such as language-dependent hardware keyboard. Speakers of these low-resource languages are therefore unfairly deprived of information access on the internet. There is no information about the level of progress that has been made on the computation of Nigerian languages. Hence, this chapter presents a state-of-the-art review of Nigerian languages natural language processing. The review reveals that only four Nigerian languages; Hausa, Ibibio, Igbo, and Yoruba have been significantly studied in published NLP papers. Creating alternatives to hardware keyboard is one of the most popular research areas, and means such as automatic diacritics restoration, virtual keyboard, and optical character recognition have been explored. There was also an inclination towards speech and computational morphological analysis. Resource development and knowledge representation modeling of the languages using rapid resource development and cross-lingual methods are recommended.

Download Full-text

A Yes/No Answer Generator Based on Sentiment-Word Scores in Biomedical Question Answering

Data Analytics in Medicine ◽

10.4018/978-1-7998-1204-3.ch005 ◽

2020 ◽

pp. 103-116

Author(s):

Mourad Sarrouti ◽

Said Ouatik El Alaoui

Keyword(s):

Question Answering ◽

State Of The Art ◽

Biomedical Domain ◽

Open Domain ◽

Part Of Speech Tagging ◽

Part Of Speech ◽

Current State ◽

Sentiment Score ◽

Speech Tagging ◽

Sentiment Word

Background and Objective: Yes/no question answering (QA) in open-domain is a longstanding challenge widely studied over the last decades. However, it still requires further efforts in the biomedical domain. Yes/no QA aims at answering yes/no questions, which are seeking for a clear “yes” or “no” answer. In this paper, we present a novel yes/no answer generator based on sentiment-word scores in biomedical QA. Methods: In the proposed method, we first use the Stanford CoreNLP for tokenization and part-of-speech tagging all relevant passages to a given yes/no question. We then assign a sentiment score based on SentiWordNet to each word of the passages. Finally, the decision on either the answers “yes” or “no” is based on the obtained sentiment-passages score: “yes” for a positive final sentiment-passages score and “no” for a negative one. Results: Experimental evaluations performed on BioASQ collections show that the proposed method is more effective as compared with the current state-of-the-art method, and significantly outperforms it by an average of 15.68% in terms of accuracy.

Download Full-text

A Cross-Lingual German-English Framework for Open-Domain Question Answering

Evaluation of Multilingual and Multi-modal Information Retrieval - Lecture Notes in Computer Science ◽

10.1007/978-3-540-74999-8_40 ◽

2007 ◽

pp. 328-338

Author(s):

Bogdan Sacaleanu ◽

Günter Neumann

Keyword(s):

Question Answering ◽

Open Domain ◽

Cross Lingual

Download Full-text