scholarly journals Learn and Tell: Learning Priors for Image Caption Generation

2020 ◽  
Vol 10 (19) ◽  
pp. 6942
Author(s):  
Pei Liu ◽  
Dezhong Peng ◽  
Ming Zhang

In this work, we propose a novel priors-based attention neural network (PANN) for image captioning, which aims at incorporating two kinds of priors, i.e., the probabilities being mentioned for local region proposals (PBM priors) and part-of-speech clues for caption words (POS priors), into a visual information extraction process at each word prediction. This work was inspired by the intuitions that region proposals have different inherent probabilities for image captioning, and that the POS clues bridge the word class (part-of-speech tag) with the categories of visual features. We propose new methods to extract these two priors, in which the PBM priors are obtained by computing the similarities between the caption feature vector and local feature vectors, while the POS priors are predicated at each step of word generation by taking the hidden state of the decoder as input. After that, these two kinds of priors are further incorporated into the PANN module of the decoder to help the decoder extract more accurate visual information for the current word generation. In our experiments, we qualitatively analyzed the proposed approach and quantitatively evaluated several captioning schemes with our PANN on the MS-COCO dataset. Experimental results demonstrate that our proposed method could achieve better performance as well as the effectiveness of the proposed network for image captioning.

2019 ◽  
Vol 9 (16) ◽  
pp. 3260 ◽  
Author(s):  
Jiangyun Li ◽  
Peng Yao ◽  
Longteng Guo ◽  
Weicun Zhang

Image captioning attempts to generate a description given an image, usually taking Convolutional Neural Network as the encoder to extract the visual features and a sequence model, among which the self-attention mechanism has achieved advanced progress recently, as the decoder to generate descriptions. However, this predominant encoder-decoder architecture has some problems to be solved. On the encoder side, without the semantic concepts, the extracted visual features do not make full use of the image information. On the decoder side, the sequence self-attention only relies on word representations, lacking the guidance of visual information and easily influenced by the language prior. In this paper, we propose a novel boosted transformer model with two attention modules for the above-mentioned problems, i.e., “Concept-Guided Attention” (CGA) and “Vision-Guided Attention” (VGA). Our model utilizes CGA in the encoder, to obtain the boosted visual features by integrating the instance-level concepts into the visual features. In the decoder, we stack VGA, which uses the visual information as a bridge to model internal relationships among the sequences and can be an auxiliary module of sequence self-attention. Quantitative and qualitative results on the Microsoft COCO dataset demonstrate the better performance of our model than the state-of-the-art approaches.


Author(s):  
Chaitrali Prasanna Chaudhari ◽  
Satish Devane

“Image Captioning is the process of generating a textual description of an image”. It deploys both computer vision and natural language processing for caption generation. However, the majority of the image captioning systems offer unclear depictions regarding the objects like “man”, “woman”, “group of people”, “building”, etc. Hence, this paper intends to develop an intelligent-based image captioning model. The adopted model comprises of few steps like word generation, sentence formation, and caption generation. Initially, the input image is subjected to the Deep learning classifier called Convolutional Neural Network (CNN). Since the classifier is already trained in the relevant words that are related to all images, it can easily classify the associated words of the given image. Further, a set of sentences is formed with the generated words using Long-Short Term Memory (LSTM) model. The likelihood of the formed sentences is computed using the Maximum Likelihood (ML) function, and the sentences with higher probability are taken, which is further used for generating the visual representation of the scene in terms of image caption. As a major novelty, this paper aims to enhance the performance of CNN by optimally tuning its weight and activation function. This paper introduces a new enhanced optimization algorithm Rider with Randomized Bypass and Over-taker update (RR-BOU) for this optimal selection. In the proposed RR-BOU is the enhanced version of the Rider Optimization Algorithm (ROA). Finally, the performance of the proposed captioning model is compared over other conventional models with respect to statistical analysis.


eLife ◽  
2017 ◽  
Vol 6 ◽  
Author(s):  
Ivan Larderet ◽  
Pauline MJ Fritsch ◽  
Nanae Gendre ◽  
G Larisa Neagu-Maier ◽  
Richard D Fetter ◽  
...  

Visual systems transduce, process and transmit light-dependent environmental cues. Computation of visual features depends on photoreceptor neuron types (PR) present, organization of the eye and wiring of the underlying neural circuit. Here, we describe the circuit architecture of the visual system of Drosophila larvae by mapping the synaptic wiring diagram and neurotransmitters. By contacting different targets, the two larval PR-subtypes create two converging pathways potentially underlying the computation of ambient light intensity and temporal light changes already within this first visual processing center. Locally processed visual information then signals via dedicated projection interneurons to higher brain areas including the lateral horn and mushroom body. The stratified structure of the larval optic neuropil (LON) suggests common organizational principles with the adult fly and vertebrate visual systems. The complete synaptic wiring diagram of the LON paves the way to understanding how circuits with reduced numerical complexity control wide ranges of behaviors.


Author(s):  
Adam Csapo ◽  
Barna Resko ◽  
Morten Lind ◽  
Peter Baranyi

The computerized modeling of cognitive visual information has been a research field of great interest in the past several decades. The research field is interesting not only from a biological perspective, but also from an engineering point of view when systems are developed that aim to achieve similar goals as biological cognitive systems. This article introduces a general framework for the extraction and systematic storage of low-level visual features. The applicability of the framework is investigated in both unstructured and highly structured environments. In a first experiment, a linear categorization algorithm originally developed for the classification of text documents is used to classify natural images taken from the Caltech 101 database. In a second experiment, the framework is used to provide an automatically guided vehicle with obstacle detection and auto-positioning functionalities in highly structured environments. Results demonstrate that the model is highly applicable in structured environments, and also shows promising results in certain cases when used in unstructured environments.


Robotics ◽  
2020 ◽  
Vol 9 (2) ◽  
pp. 40
Author(s):  
Hirokazu Madokoro ◽  
Hanwool Woo ◽  
Stephanie Nix ◽  
Kazuhito Sato

This study was conducted to develop original benchmark datasets that simultaneously include indoor–outdoor visual features. Indoor visual information related to images includes outdoor features to a degree that varies extremely by time, weather, and season. We obtained time-series scene images using a wide field of view (FOV) camera mounted on a mobile robot moving along a 392-m route in an indoor environment surrounded by transparent glass walls and windows for two directions in three seasons. For this study, we propose a unified method for extracting, characterizing, and recognizing visual landmarks that are robust to human occlusion in a real environment in which robots coexist with people. Using our method, we conducted an evaluation experiment to recognize scenes divided up to 64 zones with fixed intervals. The experimentally obtained results using the datasets revealed the performance and characteristics of meta-parameter optimization, mapping characteristics to category maps, and recognition accuracy. Moreover, we visualized similarities between scene images using category maps. We also identified cluster boundaries obtained from mapping weights.


1985 ◽  
Vol 37 (4) ◽  
pp. 613-625 ◽  
Author(s):  
Andrew F. Monk

Marr and Nishihara (1978) have made certain recommendations about how representations postulated in a theory of visual information processing should be specified. Using this scheme the paper discusses representations which might be postulated in a model of visual word recognition. A representation is specified in terms of a set of primitives (e.g., word identities or visual features) in combination with a coordinate system. The coordinate systems considered are retinal, spatial (e.g., position on page) word-centred (position in word) and sentence-centred (position in sentence). Various combinations of primitives and coordinate systems are considered along with how to decide which combinations are actually generated in the process of fluent reading. A tentative model is put forward in which a single processing stage, which starts anew after each saccade, generates a representation with word identities as its primitives and sentence-centred coordinates. Evidence to support such a model which has no intermediate representation with spatial coordinates is briefly reviewed.


Author(s):  
Yiyi Zhou ◽  
Rongrong Ji ◽  
Jinsong Su ◽  
Xiangming Li ◽  
Xiaoshuai Sun

In this paper, we uncover the issue of knowledge inertia in visual question answering (VQA), which commonly exists in most VQA models and forces the models to mainly rely on the question content to “guess” answer, without regard to the visual information. Such an issue not only impairs the performance of VQA models, but also greatly reduces the credibility of the answer prediction. To this end, simply highlighting the visual features in the model is undoable, since the prediction is built upon the joint modeling of two modalities and largely influenced by the data distribution. In this paper, we propose a Pairwise Inconformity Learning (PIL) to tackle the issue of knowledge inertia. In particular, PIL takes full advantage of the similar image pairs with diverse answers to an identical question provided in VQA2.0 dataset. It builds a multi-modal embedding space to project pos./neg. feature pairs, upon which word vectors of answers are modeled as anchors. By doing so, PIL strengthens the importance of visual features in prediction with a novel dynamic-margin based triplet loss that efficiently increases the semantic discrepancies between pos./neg. image pairs. To verify the proposed PIL, we plug it on a baseline VQA model as well as a set of recent VQA models, and conduct extensive experiments on two benchmark datasets, i.e., VQA1.0 and VQA2.0. Experimental results show that PIL can boost the accuracy of the existing VQA models (1.56%-2.93% gain) with a negligible increase in parameters (0.85%-5.4% parameters). Qualitative results also reveal the elimination of knowledge inertia in the existing VQA models after implementing our PIL.


Sign in / Sign up

Export Citation Format

Share Document