DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention

2021 ◽  
Vol 16 (1) ◽  
pp. 1-19
Author(s):  
Fenglin Liu ◽  
Xian Wu ◽  
Shen Ge ◽  
Xuancheng Ren ◽  
Wei Fan ◽  
...  

Vision-and-language (V-L) tasks require the system to understand both vision content and natural language, thus learning fine-grained joint representations of vision and language (a.k.a. V-L representations) is of paramount importance. Recently, various pre-trained V-L models are proposed to learn V-L representations and achieve improved results in many tasks. However, the mainstream models process both vision and language inputs with the same set of attention matrices. As a result, the generated V-L representations are entangled in one common latent space . To tackle this problem, we propose DiMBERT (short for Di sentangled M ultimodal-Attention BERT ), which is a novel framework that applies separated attention spaces for vision and language, and the representations of multi-modalities can thus be disentangled explicitly. To enhance the correlation between vision and language in disentangled spaces, we introduce the visual concepts to DiMBERT which represent visual information in textual format. In this manner, visual concepts help to bridge the gap between the two modalities. We pre-train DiMBERT on a large amount of image–sentence pairs on two tasks: bidirectional language modeling and sequence-to-sequence language modeling. After pre-train, DiMBERT is further fine-tuned for the downstream tasks. Experiments show that DiMBERT sets new state-of-the-art performance on three tasks (over four datasets), including both generation tasks (image captioning and visual storytelling) and classification tasks (referring expressions). The proposed DiM (short for Di sentangled M ultimodal-Attention) module can be easily incorporated into existing pre-trained V-L models to boost their performance, up to a 5% increase on the representative task. Finally, we conduct a systematic analysis and demonstrate the effectiveness of our DiM and the introduced visual concepts.

Author(s):  
Fenglin Liu ◽  
Xuancheng Ren ◽  
Yuanxin Liu ◽  
Kai Lei ◽  
Xu Sun

Recently, attention-based encoder-decoder models have been used extensively in image captioning. Yet there is still great difficulty for the current methods to achieve deep image understanding. In this work, we argue that such understanding requires visual attention to correlated image regions and semantic attention to coherent attributes of interest. To perform effective attention, we explore image captioning from a cross-modal perspective and propose the Global-and-Local Information Exploring-and-Distilling approach that explores and distills the source information in vision and language. It globally provides the aspect vector, a spatial and relational representation of images based on caption contexts, through the extraction of salient region groupings and attribute collocations, and locally extracts the fine-grained regions and attributes in reference to the aspect vector for word selection. Our fully-attentive model achieves a CIDEr score of 129.3 in offline COCO evaluation with remarkable efficiency in terms of accuracy, speed, and parameter budget.


Author(s):  
Xiangteng He ◽  
Yuxin Peng ◽  
Junjie Zhao

Fine-grained visual categorization (FGVC) is the discrimination of similar subcategories, whose main challenge is to localize the quite subtle visual distinctions between similar subcategories. There are two pivotal problems: discovering which region is discriminative and representative, and determining how many discriminative regions are necessary to achieve the best performance. Existing methods generally solve these two problems relying on the prior knowledge or experimental validation, which extremely restricts the usability and scalability of FGVC. To address the "which" and "how many" problems adaptively and intelligently, this paper proposes a stacked deep reinforcement learning approach (StackDRL). It adopts a two-stage learning architecture, which is driven by the semantic reward function. Two-stage learning localizes the object and its parts in sequence ("which"), and determines the number of discriminative regions adaptively ("how many"), which is quite appealing in FGVC. Semantic reward function drives StackDRL to fully learn the discriminative and conceptual visual information, via jointly combining the attention-based reward and category-based reward. Furthermore, unsupervised discriminative localization avoids the heavy labor consumption of labeling, and extremely strengthens the usability and scalability of our StackDRL approach. Comparing with ten state-of-the-art methods on CUB-200-2011 dataset, our StackDRL approach achieves the best categorization accuracy.


Complexity ◽  
2021 ◽  
Vol 2021 ◽  
pp. 1-9
Author(s):  
Yang He ◽  
Ling Tian ◽  
Lizong Zhang ◽  
Xi Zeng

Autonomous object detection powered by cutting-edge artificial intelligent techniques has been an essential component for sustaining complex smart city systems. Fine-grained image classification focuses on recognizing subcategories of specific levels of images. As a result of the high similarity between images in the same category and the high dissimilarity in the same subcategories, it has always been a challenging problem in computer vision. Traditional approaches usually rely on exploring only the visual information in images. Therefore, this paper proposes a novel Knowledge Graph Representation Fusion (KGRF) framework to introduce prior knowledge into fine-grained image classification task. Specifically, the Graph Attention Network (GAT) is employed to learn the knowledge representation from the constructed knowledge graph modeling the categories-subcategories and subcategories-attributes associations. By introducing the Multimodal Compact Bilinear (MCB) module, the framework can fully integrate the knowledge representation and visual features for learning the high-level image features. Extensive experiments on the Caltech-UCSD Birds-200-2011 dataset verify the superiority of our proposed framework over several existing state-of-the-art methods.


Author(s):  
Masakazu Iwamura ◽  
Yoshihiko Inoue ◽  
Kazunori Minatani ◽  
Koichi Kise

AbstractFor people with visual impairment, smartphone apps that use computer vision techniques to provide visual information have played important roles in supporting their daily lives. However, they can be used under a specific condition only. That is, only when the user knows where the object of interest is. In this paper, we first point out the fact mentioned above by categorizing the tasks that obtain visual information using computer vision techniques. Then, in looking for something as a representative task in a category, we argue suitable camera systems and rotation navigation methods. In the latter, we propose novel voice navigation methods. As a result of a user study comprised of seven people with visual impairment, we found that (1) a camera with a wide field of view such as an omnidirectional camera was preferred, and (2) users have different preferences in navigation methods.


Author(s):  
S. Bauer ◽  
F. Bagusat ◽  
E. Strassburger ◽  
M. Sauer ◽  
S. Hiermaier

Abstract A systematic study has been performed to gather more detailed experimental information on the equation of state and the Hugoniot elastic limit (HEL) of soda-lime glass as well as the failure front phenomenon. The key innovations of this study comprise experimental as well as analytical aspects. On the one hand, an extensive planar plate impact (PPI) test series has been carried out over a wide range of shock loading stress levels instrumented with two high-speed cameras and laser interferometers (PDV and VISAR). On the other hand, a systematic analysis concept has been developed and evaluated, including a combination of Lagrange diagrams with velocity profile data and a derivation of the equation of state together with an error estimation. Impact velocities ranged from 500 to 3000 m/s resulting in loadings of the soda-lime glass targets between 3.5 and 20.8 GPa. For stress levels between 3.5 and 6.7 GPa two high-speed cameras with 5 Mfps, positioned at the side and rear of the specimens, enabled the observation of shock waves and two different kinds of failure fronts. Therefore, visual information could be gathered not only in the purely elastic regime, but also in the transition region above 4 GPa and at stress levels beyond the HEL. The HEL of the soda-lime glass is determined to $$\left(5.0 \pm 0.2\right) \mathrm{GPa}$$ 5.0 ± 0.2 GPa . For the onset of an internal failure front a minimum longitudinal stress between 3.8 and 3.9 GPa is identified. The evaluated failure front velocities range from 800 to 2100 m/s. From the observed release response a minimum spall strength of 6.7 GPa and release wave velocities between 5740 and 9500 m/s are deduced.


2020 ◽  
Vol 34 (07) ◽  
pp. 11572-11579 ◽  
Author(s):  
Fenglin Liu ◽  
Xian Wu ◽  
Shen Ge ◽  
Wei Fan ◽  
Yuexian Zou

Recently, vision-and-language grounding problems, e.g., image captioning and visual question answering (VQA), has attracted extensive interests from both academic and industrial worlds. However, given the similarity of these tasks, the efforts to obtain better results by combining the merits of their algorithms are not well studied. Inspired by the recent success of federated learning, we propose a federated learning framework to obtain various types of image representations from different tasks, which are then fused together to form fine-grained image representations. The representations merge useful features from different vision-and-language grounding problems, and are thus much more powerful than the original representations alone in individual tasks. To learn such image representations, we propose the Aligning, Integrating and Mapping Network (aimNet). The aimNet is validated on three federated learning settings, which include horizontal federated learning, vertical federated learning, and federated transfer learning. Experiments of aimNet-based federated learning framework on two representative tasks, i.e., image captioning and VQA, demonstrate the effective and universal improvements of all metrics over the baselines. In image captioning, we are able to get 14% and 13% relative gain on the task-specific metrics CIDEr and SPICE, respectively. In VQA, we could also boost the performance of strong baselines by up to 3%.


2021 ◽  
pp. 014920632098728
Author(s):  
Elise Marescaux ◽  
Sophie De Winne ◽  
Lieven Brebels

Inspired by a pursuit of higher returns on human resource management (HRM) investments as well as a trend towards the individualization of HRM, several scholars have focused on the phenomenon of HR differentiation, that is, the differential allocation of resources across employees through the use of HRM practices. Yet, different definitions and angles to study HR differentiation have been used. As a result, ambiguities render it difficult to compare research findings and draw meaningful conclusions about HR differentiation and its consequences. Based on a systematic analysis of 164 articles from five different research streams (i.e., strategic HRM, talent management, i-deals, pay dispersion, and diversity management literatures), we identify four properties of HR differentiation (its basis, formalization, resource, and purpose) and propose a more fine-grained definition of the construct. Next, drawing from optimal distinctiveness–based inclusion theory, we develop an integrated multilevel model with propositions that helps explain the social psychological consequences of HR differentiation at three integrated levels of analysis (employee, workgroup, and organization). Subsequently, we derive an agenda for future research. In doing so, we contribute by developing a common language for scholars with different disciplinary backgrounds and inspire future research on HR differentiation.


Sign in / Sign up

Export Citation Format

Share Document