scholarly journals Cross‐modal semantic correlation learning by Bi‐CNN network

2021 ◽  
Author(s):  
Chaoyi Wang ◽  
Liang Li ◽  
Chenggang Yan ◽  
Zhan Wang ◽  
Yaoqi Sun ◽  
...  
2021 ◽  
pp. 200-208
Author(s):  
Ping Xiong ◽  
Lin Liang ◽  
Yunli Zhu ◽  
Tianqing Zhu

2020 ◽  
Vol 34 (07) ◽  
pp. 12289-12296 ◽  
Author(s):  
Zhuhui Wang ◽  
Shijie Wang ◽  
Haojie Li ◽  
Zhi Dou ◽  
Jianjun Li

The key of Weakly Supervised Fine-grained Image Classification (WFGIC) is how to pick out the discriminative regions and learn the discriminative features from them. However, most recent WFGIC methods pick out the discriminative regions independently and utilize their features directly, while neglecting the facts that regions' features are mutually semantic correlated and region groups can be more discriminative. To address these issues, we propose an end-to-end Graph-propagation based Correlation Learning (GCL) model to fully mine and exploit the discriminative potentials of region correlations for WFGIC. Specifically, in discriminative region localization phase, a Criss-cross Graph Propagation (CGP) sub-network is proposed to learn region correlations, which establishes correlation between regions and then enhances each region by weighted aggregating other regions in a criss-cross way. By this means each region's representation encodes the global image-level context and local spatial context simultaneously, thus the network is guided to implicitly discover the more powerful discriminative region groups for WFGIC. In discriminative feature representation phase, the Correlation Feature Strengthening (CFS) sub-network is proposed to explore the internal semantic correlation among discriminative patches' feature vectors, to improve their discriminative power by iteratively enhancing informative elements while suppressing the useless ones. Extensive experiments demonstrate the effectiveness of proposed CGP and CFS sub-networks, and show that the GCL model achieves better performance both in accuracy and efficiency.


Author(s):  
Dan Guo ◽  
Hui Wang ◽  
Meng Wang

Visual dialog is a challenging task, which involves multi-round semantic transformations between vision and language. This paper aims to address cross-modal semantic correlation for visual dialog. Motivated by that Vg (global vision), Vl (local vision), Q (question) and H (history) have inseparable relevances, the paper proposes a novel Dual Visual Attention Network (DVAN) to realize (Vg, Vl, Q, H)--> A. DVAN is a three-stage query-adaptive attention model. In order to acquire accurate A (answer), it first explores the textual attention, which imposes the question on history to pick out related context H'. Then, based on Q and H', it implements respective visual attentions to discover related global image visual hints Vg' and local object-based visual hints Vl'. Next, a dual crossing visual attention is proposed. Vg' and Vl' are mutually embedded to learn the complementary of visual semantics. Finally, the attended textual and visual features are combined to infer the answer. Experimental results on the VisDial v0.9 and v1.0 datasets validate the effectiveness of the proposed approach.


Author(s):  
Lei Zhu ◽  
Jiayu Song ◽  
Xiangxiang Wei ◽  
Long Jun

With the rapid development of Internet and the widely usage of smart devices, massive multimedia data are generated, collected, stored and shared on the Internet. This trend makes cross-modal retrieval problem become a hot issue in this years. Many existing works pay attentions on correlation learning to generate a common subspace for cross-modal correlation measurement, and others uses adversarial learning technique to abate the heterogeneity of multi-modal data. However, very few works combine correlation learning and adversarial learning to bridge the inter-modal semantic gap and diminish cross-modal heterogeneity. This paper propose a novel cross-modal retrieval method, named ALSCOR, which is an end-to-end framework to integrate cross-modal representation learning, correlation learning and adversarial. CCA model, accompanied by two representation model, VisNet and TxtNet is proposed to capture non-linear correlation. Beside, intra-modal classifier and modality classifier are used to learn intra-modal discrimination and minimize the inter-modal heterogeneity. Comprehensive experiments are conducted on three benchmark datasets. The results demonstrate that the proposed ALSCOR has better performance than the state-of-the-arts.


Author(s):  
Kanetoshi Hattori ◽  
Ritsuko Hattori

Abstract Aichi prefecture, Japan is predicted to be hit by Mega-earthquake. Aichi Prefectural Association of Midwives has been making efforts to improve disaster preparedness for pregnant women. This project aims to acquire area data of pregnant women for simulated studies of rescue activities. Number of women in census survey areas in Nagoya City was acquired from nationwide data of pregnant women by machine learning (Cascade-Correlation Learning Architecture). Quite high correlation coefficients between actual data and estimation data were observed. Rescue simulations have been carried out based on the data acquired by this study.


Sign in / Sign up

Export Citation Format

Share Document