scholarly journals Multi-View Visual Question Answering with Active Viewpoint Selection

Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2281 ◽  
Author(s):  
Yue Qiu ◽  
Yutaka Satoh ◽  
Ryota Suzuki ◽  
Kenji Iwata ◽  
Hirokatsu Kataoka

This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.

Author(s):  
Xinmeng Li ◽  
Mamoun Alazab ◽  
Qian Li ◽  
Keping Yu ◽  
Quanjun Yin

AbstractKnowledge graph question answering is an important technology in intelligent human–robot interaction, which aims at automatically giving answer to human natural language question with the given knowledge graph. For the multi-relation question with higher variety and complexity, the tokens of the question have different priority for the triples selection in the reasoning steps. Most existing models take the question as a whole and ignore the priority information in it. To solve this problem, we propose question-aware memory network for multi-hop question answering, named QA2MN, to update the attention on question timely in the reasoning process. In addition, we incorporate graph context information into knowledge graph embedding model to increase the ability to represent entities and relations. We use it to initialize the QA2MN model and fine-tune it in the training process. We evaluate QA2MN on PathQuestion and WorldCup2014, two representative datasets for complex multi-hop question answering. The result demonstrates that QA2MN achieves state-of-the-art Hits@1 accuracy on the two datasets, which validates the effectiveness of our model.


Author(s):  
Sanket Shah ◽  
Anand Mishra ◽  
Naganand Yadati ◽  
Partha Pratim Talukdar

Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about an image which can be answered purely based on its content. For example, given an image with people in it, a typical VQA question may inquire about the number of people in the image. More recently, there is growing interest in answering questions which require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present in the image. In spite of this progress, the important problem of answering questions requiring world knowledge about named entities (e.g., Barack Obama, White House, United Nations) in the image has not been addressed in prior research. We address this gap in this paper, and introduce KVQA – the first dataset for the task of (world) knowledge-aware VQA. KVQA consists of 183K question-answer pairs involving more than 18K named entities and 24K images. Questions in this dataset require multi-entity, multi-relation, and multi-hop reasoning over large Knowledge Graphs (KG) to arrive at an answer. To the best of our knowledge, KVQA is the largest dataset for exploring VQA over KG. Further, we also provide baseline performances using state-of-the-art methods on KVQA.


Author(s):  
Fei Liu ◽  
Jing Liu ◽  
Zhiwei Fang ◽  
Richang Hong ◽  
Hanqing Lu

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Therefore, in this paper, we propose a novel DCAF (Densely Connected Attention Flow) framework for modeling dense interactions. It densely connects all pairwise layers of the network via Attention Connectors, capturing fine-grained interplay between image and question across all hierarchical levels. The proposed Attention Connector efficiently connects the multi-modal features at any two layers with symmetric co-attention, and produces interaction-aware attention features. Experimental results on three publicly available datasets show that the proposed method achieves state-of-the-art performance.


2021 ◽  
Vol 12 ◽  
Author(s):  
Gregoire Pointeau ◽  
Solène Mirliaz ◽  
Anne-Laure Mealier ◽  
Peter Ford Dominey

How do people learn to talk about the causal and temporal relations between events, and the motivation behind why people do what they do? The narrative practice hypothesis of Hutto and Gallagher holds that children are exposed to narratives that provide training for understanding and expressing reasons for why people behave as they do. In this context, we have recently developed a model of narrative processing where a structured model of the developing situation (the situation model) is built up from experienced events, and enriched by sentences in a narrative that describe event meanings. The main interest is to develop a proof of concept for how narrative can be used to structure, organize and describe experience. Narrative sentences describe events, and they also define temporal and causal relations between events. These relations are specified by a class of narrative function words, including “because, before, after, first, finally.” The current research develops a proof of concept that by observing how people describe social events, a developmental robotic system can begin to acquire early knowledge of how to explain the reasons for events. We collect data from naïve subjects who use narrative function words to describe simple scenes of human-robot interaction, and then employ algorithms for extracting the statistical structure of how narrative function words link events in the situation model. By using these statistical regularities, the robot can thus learn from human experience about how to properly employ in question-answering dialogues with the human, and in generating canonical narratives for new experiences. The behavior of the system is demonstrated over several behavioral interactions, and associated narrative interaction sessions, while a more formal extended evaluation and user study will be the subject of future research. Clearly this is far removed from the power of the full blown narrative practice capability, but it provides a first step in the development of an experimental infrastructure for the study of socially situated narrative practice in human-robot interaction.


2009 ◽  
Author(s):  
Matthew S. Prewett ◽  
Kristin N. Saboe ◽  
Ryan C. Johnson ◽  
Michael D. Coovert ◽  
Linda R. Elliott

2010 ◽  
Author(s):  
Eleanore Edson ◽  
Judith Lytle ◽  
Thomas McKenna

2020 ◽  
Author(s):  
Agnieszka Wykowska ◽  
Jairo Pérez-Osorio ◽  
Stefan Kopp

This booklet is a collection of the position statements accepted for the HRI’20 conference workshop “Social Cognition for HRI: Exploring the relationship between mindreading and social attunement in human-robot interaction” (Wykowska, Perez-Osorio & Kopp, 2020). Unfortunately, due to the rapid unfolding of the novel coronavirus at the beginning of the present year, the conference and consequently our workshop, were canceled. On the light of these events, we decided to put together the positions statements accepted for the workshop. The contributions collected in these pages highlight the role of attribution of mental states to artificial agents in human-robot interaction, and precisely the quality and presence of social attunement mechanisms that are known to make human interaction smooth, efficient, and robust. These papers also accentuate the importance of the multidisciplinary approach to advance the understanding of the factors and the consequences of social interactions with artificial agents.


2019 ◽  
Author(s):  
Cinzia Di Dio ◽  
Federico Manzi ◽  
Giulia Peretti ◽  
Angelo Cangelosi ◽  
Paul L. Harris ◽  
...  

Studying trust within human-robot interaction is of great importance given the social relevance of robotic agents in a variety of contexts. We investigated the acquisition, loss and restoration of trust when preschool and school-age children played with either a human or a humanoid robot in-vivo. The relationship between trust and the quality of attachment relationships, Theory of Mind, and executive function skills was also investigated. No differences were found in children’s trust in the play-partner as a function of agency (human or robot). Nevertheless, 3-years-olds showed a trend toward trusting the human more than the robot, while 7-years-olds displayed the reverse behavioral pattern, thus highlighting the developing interplay between affective and cognitive correlates of trust.


Sign in / Sign up

Export Citation Format

Share Document