Joint embedding VQA model based on dynamic word vector

PeerJ Computer Science ◽

10.7717/peerj-cs.353 ◽

2021 ◽

Vol 7 ◽

pp. e353

Author(s):

Zhiyang Ma ◽

Wenfeng Zheng ◽

Xiaobing Chen ◽

Lirong Yin

Keyword(s):

Question Answering ◽

Feature Fusion ◽

Image Feature ◽

Text And Image ◽

Visual Question Answering ◽

Model Based ◽

Joint Embedding ◽

Real Language ◽

Language Environment ◽

Better Than

The existing joint embedding Visual Question Answering models use different combinations of image characterization, text characterization and feature fusion method, but all the existing models use static word vectors for text characterization. However, in the real language environment, the same word may represent different meanings in different contexts, and may also be used as different grammatical components. These differences cannot be effectively expressed by static word vectors, so there may be semantic and grammatical deviations. In order to solve this problem, our article constructs a joint embedding model based on dynamic word vector—none KB-Specific network (N-KBSN) model which is different from commonly used Visual Question Answering models based on static word vectors. The N-KBSN model consists of three main parts: question text and image feature extraction module, self attention and guided attention module, feature fusion and classifier module. Among them, the key parts of N-KBSN model are: image characterization based on Faster R-CNN, text characterization based on ELMo and feature enhancement based on multi-head attention mechanism. The experimental results show that the N-KBSN constructed in our experiment is better than the other 2017—winner (glove) model and 2019—winner (glove) model. The introduction of dynamic word vector improves the accuracy of the overall results.

Download Full-text

Visual question answering model based on graph neural network and contextual attention

Image and Vision Computing ◽

10.1016/j.imavis.2021.104165 ◽

2021 ◽

pp. 104165

Author(s):

Himanshu Sharma ◽

Anand Singh Jalal

Keyword(s):

Neural Network ◽

Question Answering ◽

Visual Question Answering ◽

Model Based

Download Full-text

Multi-modal Feature Fusion Based on Variational Autoencoder for Visual Question Answering

Pattern Recognition and Computer Vision - Lecture Notes in Computer Science ◽

10.1007/978-3-030-31723-2_56 ◽

2019 ◽

pp. 657-669

Author(s):

Liqing Chen ◽

Yifan Zhuo ◽

Yingjie Wu ◽

Yilei Wang ◽

Xianghan Zheng

Keyword(s):

Question Answering ◽

Feature Fusion ◽

Visual Question Answering ◽

Variational Autoencoder

Download Full-text

Generating and Evaluating Explanations of Attended and Error-Inducing Input Regions for VQA Models

10.22541/au.162464902.28050142/v1 ◽

2021 ◽

Author(s):

Arijit Ray ◽

Michael Cogswell ◽

Xiao Lin ◽

Kamran Alipour ◽

Ajay Divakaran ◽

...

Keyword(s):

Question Answering ◽

User Studies ◽

Incorrect Answer ◽

Visual Question Answering ◽

Predict Model ◽

Better Than

Attention maps, a popular heatmap-based explanation method for Visual Question Answering (VQA), are supposed to help users understand the model by highlighting portions of the image/question used by the model to infer answers. However, we see that users are often misled by current attention map visualizations that point to relevant regions despite the model producing an incorrect answer. Hence, we propose Error Maps that clarify the error by highlighting image regions where the model is prone to err. Error maps can indicate when a correctly attended region may be processed incorrectly leading to an incorrect answer, and hence, improve users’ understanding of those cases. To evaluate our new explanations, we further introduce a metric that simulates users’ interpretation of explanations to evaluate their potential helpfulness to understand model correctness. We finally conduct user studies to see that our new explanations help users understand model correctness better than baselines by an expected 30% and that our proxy helpfulness metrics correlate strongly (rho>0.97) with how well users can predict model correctness.

Download Full-text

Differential Networks for Visual Question Answering

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33018997 ◽

2019 ◽

Vol 33 ◽

pp. 8997-9004 ◽

Cited By ~ 1

Author(s):

Chenfei Wu ◽

Jinlai Liu ◽

Xiaojie Wang ◽

Ruifan Li

Keyword(s):

Question Answering ◽

Image Feature ◽

Visual Question Answering ◽

Key Points ◽

Potential Applications ◽

Observation Noise ◽

The Difference ◽

Difference Operation ◽

Novel Model ◽

Differential Networks

The task of Visual Question Answering (VQA) has emerged in recent years for its potential applications. To address the VQA task, the model should fuse feature elements from both images and questions efficiently. Existing models fuse image feature element vi and question feature element qi directly, such as an element product viqi. Those solutions largely ignore the following two key points: 1) Whether vi and qi are in the same space. 2) How to reduce the observation noises in vi and qi. We argue that two differences between those two feature elements themselves, like (vi − vj) and (qi −qj), are more probably in the same space. And the difference operation would be beneficial to reduce observation noise. To achieve this, we first propose Differential Networks (DN), a novel plug-and-play module which enables differences between pair-wise feature elements. With the tool of DN, we then propose DN based Fusion (DF), a novel model for VQA task. We achieve state-of-the-art results on four publicly available datasets. Ablation studies also show the effectiveness of difference operations in DF model.

Download Full-text

Multimodal feature fusion by relational reasoning and attention for visual question answering

Information Fusion ◽

10.1016/j.inffus.2019.08.009 ◽

2020 ◽

Vol 55 ◽

pp. 116-126 ◽

Cited By ~ 5

Author(s):

Weifeng Zhang ◽

Jing Yu ◽

Hua Hu ◽

Haiyang Hu ◽

Zengchang Qin

Keyword(s):

Question Answering ◽

Feature Fusion ◽

Relational Reasoning ◽

Visual Question Answering

Download Full-text

Visual question answering model based on visual relationship detection

Signal Processing Image Communication ◽

10.1016/j.image.2019.115648 ◽

2020 ◽

Vol 80 ◽

pp. 115648 ◽

Cited By ~ 10

Author(s):

Yuling Xi ◽

Yanning Zhang ◽

Songtao Ding ◽

Shaohua Wan

Keyword(s):

Question Answering ◽

Visual Question Answering ◽

Model Based

Download Full-text

Visual Question Answering Combining Multi-modal Feature Fusion and Multi-Attention Mechanism

2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) ◽

10.1109/icbaie52039.2021.9389877 ◽

2021 ◽

Author(s):

Cai Linqin ◽

Liao Zhongxu ◽

Zhou Sitong ◽

Chen Kejia

Keyword(s):

Question Answering ◽

Feature Fusion ◽

Attention Mechanism ◽

Visual Question Answering

Download Full-text

RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer

Mobile Information Systems ◽

10.1155/2021/2662064 ◽

2021 ◽

Vol 2021 ◽

pp. 1-9

Author(s):

Xu Zhang ◽

DeZhi Han ◽

Chin-Chen Chang

Keyword(s):

Question Answering ◽

Language Model ◽

Visual Images ◽

Text And Image ◽

Data Set ◽

Fine Grained ◽

Natural Language Question ◽

The Relationship ◽

Language Question ◽

Better Than

Visual question answering (VQA) is the natural language question-answering of visual images. The model of VQA needs to make corresponding answers according to specific questions based on understanding images, the most important of which is to understand the relationship between images and language. Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and language. The RDMMFET model consists of three parts: dense language encoder, image encoder, and multimodality fusion encoder. In addition, we designed three types of pretraining tasks: masked language model, masked image model, and multimodality fusion task. These pretraining tasks can help to understand the fine-grained alignment between text and image regions. Simulation results on the VQA v2.0 data set show that the RDMMFET model can work better than the previous model. Finally, we conducted detailed ablation studies on the RDMMFET model and provided the results of attention visualization, which proves that the RDMMFET model can significantly improve the effect of VQA.

Download Full-text