Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Boeun Kim; Saim Shin; Hyedong Jung

doi:10.3390/app9132699

Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

Applied Sciences ◽

10.3390/app9132699 ◽

2019 ◽

Vol 9 (13) ◽

pp. 2699 ◽

Cited By ~ 4

Author(s):

Boeun Kim ◽

Saim Shin ◽

Hyedong Jung

Keyword(s):

Generative Models ◽

Research Topic ◽

Video Data ◽

Image Feature ◽

Image Captioning ◽

Multiple Image ◽

Visually Impaired People ◽

Proposed Model ◽

Variational Autoencoder ◽

Latent Distribution

Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people. Previous research on image captioning has been focused on generating one caption per image. However, to increase usability in applications, it is necessary to generate several different captions that contain various representations for an image. We propose a method to generate multiple captions using a variational autoencoder, which is one of the generative models. Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution. In addition, methods for the evaluation of multiple image captioning tasks are proposed that have not yet been actively researched. The proposed model outperforms in the aspect of diversity compared with the base model when the accuracy is comparable. Moreover, it is verified that the model using CAM generates detailed captions describing various content in the image.

Download Full-text

Image Captioning Using Motion-CNN with Object Detection

Sensors ◽

10.3390/s21041270 ◽

2021 ◽

Vol 21 (4) ◽

pp. 1270

Author(s):

Kiyohiko Iwamura ◽

Jun Younes Louhi Kasahara ◽

Alessandro Moro ◽

Atsushi Yamashita ◽

Hajime Asama

Keyword(s):

Visually Impaired ◽

Image Features ◽

The Internet ◽

Image Captioning ◽

Visually Impaired People ◽

Proposed Model ◽

Impaired People ◽

Motion Features ◽

Image Caption Generation ◽

Image Caption

Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.

Download Full-text

Equivariant Adversarial Network for Image-to-image Translation

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3458280 ◽

2021 ◽

Vol 17 (2s) ◽

pp. 1-14

Author(s):

Masoumeh Zareapoor ◽

Jie Yang

Keyword(s):

State Of The Art ◽

Generative Models ◽

Generative Model ◽

Target Domain ◽

Adversarial Network ◽

Proposed Model ◽

Image Translation ◽

Great Performance ◽

Representative Model ◽

The Ideal

Image-to-Image translation aims to learn an image from a source domain to a target domain. However, there are three main challenges, such as lack of paired datasets, multimodality, and diversity, that are associated with these problems and need to be dealt with. Convolutional neural networks (CNNs), despite of having great performance in many computer vision tasks, they fail to detect the hierarchy of spatial relationships between different parts of an object and thus do not form the ideal representative model we look for. This article presents a new variation of generative models that aims to remedy this problem. We use a trainable transformer, which explicitly allows the spatial manipulation of data within training. This differentiable module can be augmented into the convolutional layers in the generative model, and it allows to freely alter the generated distributions for image-to-image translation. To reap the benefits of proposed module into generative model, our architecture incorporates a new loss function to facilitate an effective end-to-end generative learning for image-to-image translation. The proposed model is evaluated through comprehensive experiments on image synthesizing and image-to-image translation, along with comparisons with several state-of-the-art algorithms.

Download Full-text

A Hindi Image Caption Generation Framework Using Deep Learning

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3432246 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-19

Author(s):

Santosh Kumar Mishra ◽

Rijul Dhir ◽

Sriparna Saha ◽

Pushpak Bhattacharyya

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

English Language ◽

Image Captioning ◽

Textual Description ◽

Proposed Model ◽

Hindi Language ◽

The Given

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .

Download Full-text

Chinese Image Captioning via Fuzzy Attention-based DenseNet-BiLSTM

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3422668 ◽

2021 ◽

Vol 17 (1s) ◽

pp. 1-18 ◽

Cited By ~ 1

Author(s):

Huimin Lu ◽

Rui Yang ◽

Zhenrong Deng ◽

Yonglin Zhang ◽

Guangwei Gao ◽

...

Keyword(s):

Feature Extraction ◽

Contextual Information ◽

Image Features ◽

Context Information ◽

Global Information ◽

Image Description ◽

Image Captioning ◽

Image Content ◽

Single Feature ◽

Proposed Model

Chinese image description generation tasks usually have some challenges, such as single-feature extraction, lack of global information, and lack of detailed description of the image content. To address these limitations, we propose a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method in this article. In the proposed method, we first improve the densely connected network to extract features of the image at different scales and to enhance the model’s ability to capture the weak features. At the same time, a bidirectional LSTM is used as the decoder to enhance the use of context information. The introduction of an improved fuzzy attention mechanism effectively improves the problem of correspondence between image features and contextual information. We conduct experiments on the AI Challenger dataset to evaluate the performance of the model. The results show that compared with other models, our proposed model achieves higher scores in objective quantitative evaluation indicators, including BLEU , BLEU , METEOR, ROUGEl, and CIDEr. The generated description sentence can accurately express the image content.

Download Full-text

A Survey on Variational Autoencoders from a Green AI Perspective

SN Computer Science ◽

10.1007/s42979-021-00702-9 ◽

2021 ◽

Vol 2 (4) ◽

Author(s):

Andrea Asperti ◽

Davide Evangelista ◽

Elena Loli Piccolomini

Keyword(s):

Architectural Design ◽

Mathematical Formulation ◽

Representation Learning ◽

Generative Models ◽

Model Description ◽

Energetic Efficiency ◽

Detailed Model ◽

Latent Distribution ◽

Quantitative Results ◽

Generation Problem

AbstractVariational Autoencoders (VAEs) are powerful generative models that merge elements from statistics and information theory with the flexibility offered by deep neural networks to efficiently solve the generation problem for high-dimensional data. The key insight of VAEs is to learn the latent distribution of data in such a way that new meaningful samples can be generated from it. This approach led to tremendous research and variations in the architectural design of VAEs, nourishing the recent field of research known as unsupervised representation learning. In this article, we provide a comparative evaluation of some of the most successful, recent variations of VAEs. We particularly focus the analysis on the energetic efficiency of the different models, in the spirit of the so-called Green AI, aiming both to reduce the carbon footprint and the financial cost of generative techniques. For each architecture, we provide its mathematical formulation, the ideas underlying its design, a detailed model description, a running implementation and quantitative results.

Download Full-text

A two-stage analogue model for real-time urban flood forecasting

10.5194/egusphere-egu21-15645 ◽

2021 ◽

Author(s):

Chris Onof ◽

Yuting Chen ◽

Li-Pen Wang ◽

Amy Jones ◽

Susana Ochoa Rodriguez

Keyword(s):

Real Time ◽

Flood Forecasting ◽

Image Feature ◽

Two Stage ◽

Analogue Model ◽

Flood Prediction ◽

Urban Flood ◽

Radar Images ◽

Proposed Model ◽

Flood Maps

In this work a two-stage (rainfall nowcasting + flood prediction) analogue model for real-time urban flood forecasting is presented. The proposed approach accounts for the complexities of urban rainfall nowcasting while avoiding the expensive computational requirements of real-time urban flood forecasting.The model has two consecutive stages:<ul><li>(1) Rainfall nowcasting: 0-6h lead time ensemble rainfall nowcasting is achieved by means of an analogue method, based on the assumption that similar climate condition will define similar patterns of temporal evolution of the rainfall. The framework uses the NORA analogue-based forecasting tool (Panziera et al., 2011), consisting of two layers. In the first layer, the 120 historical atmospheric (forcing) conditions most similar to the current atmospheric conditions are extracted, with the historical database consisting of ERA5 reanalysis data from the ECMWF and the current conditions derived from the US Global Forecasting System (GFS). In the second layer, twelve historical radar images most similar to the current one are extracted from amongst the historical radar images linked to the aforementioned 120 forcing analogues. Lastly, for each of the twelve analogues, the rainfall fields (at resolution of 1km/5min) observed after the present time are taken as one ensemble member. Note that principal component analysis (PCA) and uncorrelated multilinear PCA methods were tested for image feature extraction prior to applying the nearest neighbour technique for analogue selection.</li> <li>(2) Flood prediction: we predict flood extent using the high-resolution rainfall forecast from Stage 1, along with a database of pre-run flood maps at 1x1 km2 solution from 157 catalogued historical flood events. A deterministic flood prediction is obtained by using the averaged response from the twelve flood maps associated to the twelve ensemble rainfall nowcasts, where for each gridded area the median value is adopted (assuming flood maps are equiprobabilistic). A probabilistic flood prediction is obtained by generating a quantile-based flood map. Note that the flood maps were generated through rolling ball-based mapping of the flood volumes predicted at each node of the InfoWorks ICM sewer model of the pilot area.</li> </ul>The Minworth catchment in the UK (~400 km2) was used to demonstrate the proposed model. Cross&#8209;assessment was undertaken for each of 157 flooding events by leaving one event out from training in each iteration and using it for evaluation. With a focus on the spatial replication of flood/non-flood patterns, the predicted flood maps were converted to binary (flood/non-flood) maps. Quantitative assessment was undertaken by means of a contingency table. An average accuracy rate (i.e. proportion of correct predictions, out of all test events) of 71.4% was achieved, with individual accuracy rates ranging from 57.1% to 78.6%). Further testing is needed to confirm initial findings and flood mapping refinement will be pursued.The proposed model is fast, easy and relatively inexpensive to operate, making it suitable for direct use by local authorities who often lack the expertise on and/or capabilities for flood modelling and forecasting.References: Panziera et al. 2011. NORA&#8211;Nowcasting of Orographic Rainfall by means of Analogues. Quarterly Journal of the Royal Meteorological Society. 137, 2106-2123.

Download Full-text

Graph-Driven Generative Models for Heterogeneous Multi-Task Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i01.5446 ◽

2020 ◽

Vol 34 (01) ◽

pp. 979-988

Author(s):

Wenlin Wang ◽

Hongteng Xu ◽

Zhe Gan ◽

Bai Li ◽

Guoyin Wang ◽

...

Keyword(s):

Generative Models ◽

Healthcare Applications ◽

Convolutional Network ◽

Learning Tasks ◽

Proposed Model ◽

Heterogeneous Learning ◽

Clinical Topic ◽

Generative Processes ◽

Admission Type ◽

Uniform Manner

We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational autoencoders, thus embedding the nodes of the graph (i.e., samples for the tasks) in a uniform manner, while specializing their organization and usage to different tasks. With a focus on healthcare applications (tasks), including clinical topic modeling, procedure recommendation and admission-type prediction, we demonstrate that our method successfully leverages information across different tasks, boosting performance in all tasks and outperforming existing state-of-the-art approaches.

Download Full-text

Collective dynamics of repeated inference in variational autoencoder rapidly find cluster structure

Scientific Reports ◽

10.1038/s41598-020-72593-4 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Yoshihiro Nagano ◽

Ryo Karakida ◽

Masato Okada

Keyword(s):

Activity Pattern ◽

Latent Variables ◽

Cluster Structure ◽

Generative Models ◽

Cluster Center ◽

Specific Data ◽

Global Cluster ◽

Latent Space ◽

Variational Autoencoder ◽

Low Dimensional

Abstract Deep neural networks are good at extracting low-dimensional subspaces (latent spaces) that represent the essential features inside a high-dimensional dataset. Deep generative models represented by variational autoencoders (VAEs) can generate and infer high-quality datasets, such as images. In particular, VAEs can eliminate the noise contained in an image by repeating the mapping between latent and data space. To clarify the mechanism of such denoising, we numerically analyzed how the activity pattern of trained networks changes in the latent space during inference. We considered the time development of the activity pattern for specific data as one trajectory in the latent space and investigated the collective behavior of these inference trajectories for many data. Our study revealed that when a cluster structure exists in the dataset, the trajectory rapidly approaches the center of the cluster. This behavior was qualitatively consistent with the concept retrieval reported in associative memory models. Additionally, the larger the noise contained in the data, the closer the trajectory was to a more global cluster. It was demonstrated that by increasing the number of the latent variables, the trend of the approach a cluster center can be enhanced, and the generalization ability of the VAE can be improved.

Download Full-text

Blockchain-Based Data Sharing and Trading Model for the Connected Car

Sensors ◽

10.3390/s20113141 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3141

Author(s):

Byeong-Gyu Jeong ◽

Taek-Young Youn ◽

Nam-Su Jho ◽

Sang Uk Shin

Keyword(s):

Data Sharing ◽

Video Data ◽

Vehicle Data ◽

Attribute Based Encryption ◽

Data Owner ◽

Proposed Model ◽

Trading Model ◽

Efficient Data ◽

Vehicle Networks ◽

Smart Cars

Currently, “connected cars” are being actively designed over smart cars and autonomous cars, to establish a two-way communication network between the vehicle and all infrastructure. Additionally, because vehicle black boxes are becoming more common, specific processes for secure and efficient data sharing and transaction via vehicle networks must be developed. In this paper, we propose a Blockchain-based vehicle data marketplace platform model, along with a data sharing scheme, using Blockchain-based data-owner-based attribute-based encryption (DO-ABE). The proposed model achieves the basic requirements such as data confidentiality, integrity, and privacy. The proposed system securely and effectively handles large-capacity and privacy-sensitive black box video data by storing the metadata on Blockchain (on-chain) and encrypted raw data on off-chain (external) storage, and adopting consortium Blockchain. Furthermore, the data owners of the proposed model can control their own data by applying the Blockchain-based DO-ABE and owner-defined access control lists.

Download Full-text

A Content Based Image Retrieval Model Using Feature Space Dividing

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.596.388 ◽

2014 ◽

Vol 596 ◽

pp. 388-393

Author(s):

Guan Huang

Keyword(s):

Image Retrieval ◽

Fractional Order ◽

Feature Space ◽

Image Feature ◽

Content Based Image Retrieval ◽

Time Cost ◽

Retrieval Model ◽

Matching Accuracy ◽

Query Image ◽

Proposed Model

This paper introduces a model for content based image retrieval. The proposed model extracts image color, texture and shape as feature vectors; and then the image feature space is divided into a group of search zones; during the image searching phase, the fractional order distance is utilized to evaluate the similarity between images. As the query image vector only needs to compare with library image vectors located in the same search zone, the time cost is largely reduced. Further more the fractional order distance is utilized to improve the vector matching accuracy. The experimental results demonstrated that the proposed model provides more accurate retrieval results with less time cost compared with other methods.

Download Full-text