scholarly journals Variational Autoencoder-Based Multiple Image Captioning Using a Caption Attention Map

2019 ◽  
Vol 9 (13) ◽  
pp. 2699 ◽  
Author(s):  
Boeun Kim ◽  
Saim Shin ◽  
Hyedong Jung

Image captioning is a promising research topic that is applicable to services that search for desired content in a large amount of video data and a situation explanation service for visually impaired people. Previous research on image captioning has been focused on generating one caption per image. However, to increase usability in applications, it is necessary to generate several different captions that contain various representations for an image. We propose a method to generate multiple captions using a variational autoencoder, which is one of the generative models. Because an image feature plays an important role when generating captions, a method to extract a Caption Attention Map (CAM) of the image is proposed, and CAMs are projected to a latent distribution. In addition, methods for the evaluation of multiple image captioning tasks are proposed that have not yet been actively researched. The proposed model outperforms in the aspect of diversity compared with the base model when the accuracy is comparable. Moreover, it is verified that the model using CAM generates detailed captions describing various content in the image.

Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1270
Author(s):  
Kiyohiko Iwamura ◽  
Jun Younes Louhi Kasahara ◽  
Alessandro Moro ◽  
Atsushi Yamashita ◽  
Hajime Asama

Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.


Author(s):  
Masoumeh Zareapoor ◽  
Jie Yang

Image-to-Image translation aims to learn an image from a source domain to a target domain. However, there are three main challenges, such as lack of paired datasets, multimodality, and diversity, that are associated with these problems and need to be dealt with. Convolutional neural networks (CNNs), despite of having great performance in many computer vision tasks, they fail to detect the hierarchy of spatial relationships between different parts of an object and thus do not form the ideal representative model we look for. This article presents a new variation of generative models that aims to remedy this problem. We use a trainable transformer, which explicitly allows the spatial manipulation of data within training. This differentiable module can be augmented into the convolutional layers in the generative model, and it allows to freely alter the generated distributions for image-to-image translation. To reap the benefits of proposed module into generative model, our architecture incorporates a new loss function to facilitate an effective end-to-end generative learning for image-to-image translation. The proposed model is evaluated through comprehensive experiments on image synthesizing and image-to-image translation, along with comparisons with several state-of-the-art algorithms.


Author(s):  
Santosh Kumar Mishra ◽  
Rijul Dhir ◽  
Sriparna Saha ◽  
Pushpak Bhattacharyya

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .


Author(s):  
Huimin Lu ◽  
Rui Yang ◽  
Zhenrong Deng ◽  
Yonglin Zhang ◽  
Guangwei Gao ◽  
...  

Chinese image description generation tasks usually have some challenges, such as single-feature extraction, lack of global information, and lack of detailed description of the image content. To address these limitations, we propose a fuzzy attention-based DenseNet-BiLSTM Chinese image captioning method in this article. In the proposed method, we first improve the densely connected network to extract features of the image at different scales and to enhance the model’s ability to capture the weak features. At the same time, a bidirectional LSTM is used as the decoder to enhance the use of context information. The introduction of an improved fuzzy attention mechanism effectively improves the problem of correspondence between image features and contextual information. We conduct experiments on the AI Challenger dataset to evaluate the performance of the model. The results show that compared with other models, our proposed model achieves higher scores in objective quantitative evaluation indicators, including BLEU , BLEU , METEOR, ROUGEl, and CIDEr. The generated description sentence can accurately express the image content.


2021 ◽  
Vol 2 (4) ◽  
Author(s):  
Andrea Asperti ◽  
Davide Evangelista ◽  
Elena Loli Piccolomini

AbstractVariational Autoencoders (VAEs) are powerful generative models that merge elements from statistics and information theory with the flexibility offered by deep neural networks to efficiently solve the generation problem for high-dimensional data. The key insight of VAEs is to learn the latent distribution of data in such a way that new meaningful samples can be generated from it. This approach led to tremendous research and variations in the architectural design of VAEs, nourishing the recent field of research known as unsupervised representation learning. In this article, we provide a comparative evaluation of some of the most successful, recent variations of VAEs. We particularly focus the analysis on the energetic efficiency of the different models, in the spirit of the so-called Green AI, aiming both to reduce the carbon footprint and the financial cost of generative techniques. For each architecture, we provide its mathematical formulation, the ideas underlying its design, a detailed model description, a running implementation and quantitative results.


2021 ◽  
Author(s):  
Chris Onof ◽  
Yuting Chen ◽  
Li-Pen Wang ◽  
Amy Jones ◽  
Susana Ochoa Rodriguez

<p>In this work a two-stage (rainfall nowcasting + flood prediction) analogue model for real-time urban flood forecasting is presented. The proposed approach accounts for the complexities of urban rainfall nowcasting while avoiding the expensive computational requirements of real-time urban flood forecasting.</p><p>The model has two consecutive stages:</p><ul><li><strong>(1) Rainfall nowcasting: </strong>0-6h lead time ensemble rainfall nowcasting is achieved by means of an analogue method, based on the assumption that similar climate condition will define similar patterns of temporal evolution of the rainfall. The framework uses the NORA analogue-based forecasting tool (Panziera et al., 2011), consisting of two layers. In the <strong>first layer, </strong>the 120 historical atmospheric (forcing) conditions most similar to the current atmospheric conditions are extracted, with the historical database consisting of ERA5 reanalysis data from the ECMWF and the current conditions derived from the US Global Forecasting System (GFS). In the <strong>second layer</strong>, twelve historical radar images most similar to the current one are extracted from amongst the historical radar images linked to the aforementioned 120 forcing analogues. Lastly, for each of the twelve analogues, the rainfall fields (at resolution of 1km/5min) observed after the present time are taken as one ensemble member. Note that principal component analysis (PCA) and uncorrelated multilinear PCA methods were tested for image feature extraction prior to applying the nearest neighbour technique for analogue selection.</li> <li><strong>(2) Flood prediction: </strong>we predict flood extent using the high-resolution rainfall forecast from Stage 1, along with a database of pre-run flood maps at 1x1 km<sup>2</sup> solution from 157 catalogued historical flood events. A deterministic flood prediction is obtained by using the averaged response from the twelve flood maps associated to the twelve ensemble rainfall nowcasts, where for each gridded area the median value is adopted (assuming flood maps are equiprobabilistic). A probabilistic flood prediction is obtained by generating a quantile-based flood map. Note that the flood maps were generated through rolling ball-based mapping of the flood volumes predicted at each node of the InfoWorks ICM sewer model of the pilot area.</li> </ul><p>The Minworth catchment in the UK (~400 km<sup>2</sup>) was used to demonstrate the proposed model. Cross‑assessment was undertaken for each of 157 flooding events by leaving one event out from training in each iteration and using it for evaluation. With a focus on the spatial replication of flood/non-flood patterns, the predicted flood maps were converted to binary (flood/non-flood) maps. Quantitative assessment was undertaken by means of a contingency table. An average accuracy rate (i.e. proportion of correct predictions, out of all test events) of 71.4% was achieved, with individual accuracy rates ranging from 57.1% to 78.6%). Further testing is needed to confirm initial findings and flood mapping refinement will be pursued.</p><p>The proposed model is fast, easy and relatively inexpensive to operate, making it suitable for direct use by local authorities who often lack the expertise on and/or capabilities for flood modelling and forecasting.</p><p><strong>References: </strong>Panziera et al. 2011. NORA–Nowcasting of Orographic Rainfall by means of Analogues. Quarterly Journal of the Royal Meteorological Society. 137, 2106-2123.</p>


2020 ◽  
Vol 34 (01) ◽  
pp. 979-988
Author(s):  
Wenlin Wang ◽  
Hongteng Xu ◽  
Zhe Gan ◽  
Bai Li ◽  
Guoyin Wang ◽  
...  

We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational autoencoders, thus embedding the nodes of the graph (i.e., samples for the tasks) in a uniform manner, while specializing their organization and usage to different tasks. With a focus on healthcare applications (tasks), including clinical topic modeling, procedure recommendation and admission-type prediction, we demonstrate that our method successfully leverages information across different tasks, boosting performance in all tasks and outperforming existing state-of-the-art approaches.


2020 ◽  
Vol 10 (1) ◽  
Author(s):  
Yoshihiro Nagano ◽  
Ryo Karakida ◽  
Masato Okada

Abstract Deep neural networks are good at extracting low-dimensional subspaces (latent spaces) that represent the essential features inside a high-dimensional dataset. Deep generative models represented by variational autoencoders (VAEs) can generate and infer high-quality datasets, such as images. In particular, VAEs can eliminate the noise contained in an image by repeating the mapping between latent and data space. To clarify the mechanism of such denoising, we numerically analyzed how the activity pattern of trained networks changes in the latent space during inference. We considered the time development of the activity pattern for specific data as one trajectory in the latent space and investigated the collective behavior of these inference trajectories for many data. Our study revealed that when a cluster structure exists in the dataset, the trajectory rapidly approaches the center of the cluster. This behavior was qualitatively consistent with the concept retrieval reported in associative memory models. Additionally, the larger the noise contained in the data, the closer the trajectory was to a more global cluster. It was demonstrated that by increasing the number of the latent variables, the trend of the approach a cluster center can be enhanced, and the generalization ability of the VAE can be improved.


Sensors ◽  
2020 ◽  
Vol 20 (11) ◽  
pp. 3141
Author(s):  
Byeong-Gyu Jeong ◽  
Taek-Young Youn ◽  
Nam-Su Jho ◽  
Sang Uk Shin

Currently, “connected cars” are being actively designed over smart cars and autonomous cars, to establish a two-way communication network between the vehicle and all infrastructure. Additionally, because vehicle black boxes are becoming more common, specific processes for secure and efficient data sharing and transaction via vehicle networks must be developed. In this paper, we propose a Blockchain-based vehicle data marketplace platform model, along with a data sharing scheme, using Blockchain-based data-owner-based attribute-based encryption (DO-ABE). The proposed model achieves the basic requirements such as data confidentiality, integrity, and privacy. The proposed system securely and effectively handles large-capacity and privacy-sensitive black box video data by storing the metadata on Blockchain (on-chain) and encrypted raw data on off-chain (external) storage, and adopting consortium Blockchain. Furthermore, the data owners of the proposed model can control their own data by applying the Blockchain-based DO-ABE and owner-defined access control lists.


2014 ◽  
Vol 596 ◽  
pp. 388-393
Author(s):  
Guan Huang

This paper introduces a model for content based image retrieval. The proposed model extracts image color, texture and shape as feature vectors; and then the image feature space is divided into a group of search zones; during the image searching phase, the fractional order distance is utilized to evaluate the similarity between images. As the query image vector only needs to compare with library image vectors located in the same search zone, the time cost is largely reduced. Further more the fractional order distance is utilized to improve the vector matching accuracy. The experimental results demonstrated that the proposed model provides more accurate retrieval results with less time cost compared with other methods.


Sign in / Sign up

Export Citation Format

Share Document