Middle-Level Attribute-Based Language Retouching for Image Caption Generation

Zhibin Guan; Kang Liu; Yan Ma; Xu Qian; Tongkai Ji

doi:10.3390/app8101850

Middle-Level Attribute-Based Language Retouching for Image Caption Generation

Applied Sciences ◽

10.3390/app8101850 ◽

2018 ◽

Vol 8 (10) ◽

pp. 1850 ◽

Cited By ~ 1

Author(s):

Zhibin Guan ◽

Kang Liu ◽

Yan Ma ◽

Xu Qian ◽

Tongkai Ji

Keyword(s):

Natural Language ◽

Language Processing ◽

Middle Level ◽

Generation Model ◽

Image Description ◽

Image Captioning ◽

Benchmark Datasets ◽

Intermediate Image ◽

Image Caption Generation ◽

Image Caption

Image caption generation is attractive research which focuses on generating natural language sentences to describe the visual content of a given image. It is an interdisciplinary subject combining computer vision (CV) and natural language processing (NLP). The existing image captioning methods are mainly focused on generating the final image caption directly, which may lose significant identification information of objects contained in the raw image. Therefore, we propose a new middle-level attribute-based language retouching (MLALR) method to solve this problem. Our proposed MLALR method uses the middle-level attributes predicted from the object regions to retouch the intermediate image description, which is generated by our language generation model. The advantage of our MLALR method is that it can correct descriptive errors in the intermediate image description and make the final image caption more accurate. Moreover, evaluation using benchmark datasets—MSCOCO, Flickr8K, and Flickr30K—validated the impressive performance of our MLALR method with evaluation metrics—BLEU, METEOR, ROUGE-L, CIDEr, and SPICE.

Download Full-text

Image Caption Generation Model Based on Object Detector

Mathematical Problems of Computer Science ◽

10.51408/1963-0016 ◽

2018 ◽

pp. 5-14

Author(s):

Aghasi Poghosyan ◽

Hakob Sarukhanyan

Keyword(s):

Natural Language ◽

Object Detection ◽

Information Extraction ◽

Semantic Information ◽

Generation Model ◽

Single Model ◽

Detection Model ◽

Generator Performance ◽

Image Caption Generation ◽

Image Caption

Automated semantic information extraction from the image is a difficult task. There are works, which can extract image caption or object names and their coordinates. This work presents object detection and automated caption generation implemented via a single model. We have built an image caption generation model on top of object detection model. We have added extra layers on object detector to increase caption generator performance. We have developed a single model that can detect objects, localize them and generate image caption via natural language.

Download Full-text

Image Caption Generation and Comprehensive Comparison of Image Encoders

10.54216/fpa.040202 ◽

2021 ◽

pp. 42-55

Author(s):

Shitiz Gupta ◽

◽

...

Keyword(s):

Language Processing ◽

State Of The Art ◽

Image Feature ◽

Image Captioning ◽

Interactive Machine Learning ◽

Learning Techniques ◽

Comprehensive Comparison ◽

Image Caption Generation ◽

Image Caption ◽

Made In

Image caption generation is a stimulating multimodal task. Substantial advancements have been made in thefield of deep learning notably in computer vision and natural language processing. Yet, human-generated captions are still considered better, which makes it a challenging application for interactive machine learning. In this paper, we aim to compare different transfer learning techniques and develop a novel architecture to improve image captioning accuracy. We compute image feature vectors using different state-of-the-art transferlearning models which are fed into an Encoder-Decoder network based on Stacked LSTMs with soft attention,along with embedded text to generate high accuracy captions. We have compared these models on severalbenchmark datasets based on different evaluation metrics like BLEU and METEOR.

Download Full-text

An Overview of Image Caption Generation Methods

Computational Intelligence and Neuroscience ◽

10.1155/2020/3062706 ◽

2020 ◽

Vol 2020 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Haoran Wang ◽

Yue Zhang ◽

Xiaosheng Yu

Keyword(s):

Artificial Intelligence ◽

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Rapid Development ◽

Evaluation Criteria ◽

Arduous Task ◽

Image Caption Generation ◽

Image Caption

In recent years, with the rapid development of artificial intelligence, image caption has gradually attracted the attention of many researchers in the field of artificial intelligence and has become an interesting and arduous task. Image caption, automatically generating natural language descriptions according to the content observed in an image, is an important part of scene understanding, which combines the knowledge of computer vision and natural language processing. The application of image caption is extensive and significant, for example, the realization of human-computer interaction. This paper summarizes the related methods and focuses on the attention mechanism, which plays an important role in computer vision and is recently widely used in image caption generation tasks. Furthermore, the advantages and the shortcomings of these methods are discussed, providing the commonly used datasets and evaluation criteria in this field. Finally, this paper highlights some open challenges in the image caption task.

Download Full-text

Automatic Image Captioning Using Neural Networks

Journal of Innovations in Engineering Education ◽

10.3126/jiee.v3i1.34335 ◽

2020 ◽

Vol 3 (1) ◽

pp. 138-146

Author(s):

Subash Pandey ◽

Rabin Kumar Dhamala ◽

Bikram Karki ◽

Saroj Dahal ◽

Rama Bastola

Keyword(s):

Neural Network ◽

Artificial Intelligence ◽

Neural Networks ◽

Computer Vision ◽

Natural Language ◽

Language Processing ◽

Model Performance ◽

Image Description ◽

Image Captioning ◽

Top Down

Automatically generating a natural language description of an image is a major challenging task in the field of artificial intelligence. Generating description of an image bring together the fields: Natural Language Processing and Computer Vision. There are two types of approaches i.e. top-down and bottom-up. For this paper, we approached top-down that starts from the image and converts it into the word. Image is passed to Convolutional Neural Network (CNN) encoder and the output from it is fed further to Recurrent Neural Network (RNN) decoder that generates meaningful captions. We generated the image description by passing the real time images from the camera of a smartphone as well as tested with the test images from the dataset. To evaluate the model performance, we used BLEU (Bilingual Evaluation Understudy) score and match predicted words to the original caption.

Download Full-text

Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning

Symmetry ◽

10.3390/sym10110626 ◽

2018 ◽

Vol 10 (11) ◽

pp. 626 ◽

Cited By ~ 1

Author(s):

Zhibin Guan ◽

Kang Liu ◽

Yan Ma ◽

Xu Qian ◽

Tongkai Ji

Keyword(s):

Artificial Intelligence ◽

Visual Information ◽

Coarse Grained ◽

Image Captioning ◽

Fine Grained ◽

The Core ◽

Benchmark Datasets ◽

Coarse To Fine ◽

Image Caption Generation ◽

Image Caption

Image caption generation is a fundamental task to build a bridge between image and its description in text, which is drawing increasing interest in artificial intelligence. Images and textual sentences are viewed as two different carriers of information, which are symmetric and unified in the same content of visual scene. The existing image captioning methods rarely consider generating a final description sentence in a coarse-grained to fine-grained way, which is how humans understand the surrounding scenes; and the generated sentence sometimes only describes coarse-grained image content. Therefore, we propose a coarse-to-fine-grained hierarchical generation method for image captioning, named SDA-CFGHG, to address the two problems above. The core of our SDA-CFGHG method is a sequential dual attention that is used to fuse different grained visual information with sequential means. The advantage of our SDA-CFGHG method is that it can achieve image captioning in a coarse-to-fine-grained way and the generated textual sentence can capture details of the raw image to some degree. Moreover, we validate the impressive performance of our method on benchmark datasets—MS COCO, Flickr—with several popular evaluation metrics—CIDEr, SPICE, METEOR, ROUGE-L, and BLEU.

Download Full-text

From image to language and back again

Natural Language Engineering ◽

10.1017/s1351324918000086 ◽

2018 ◽

Vol 24 (3) ◽

pp. 325-362

Author(s):

A. BELZ ◽

T.L. BERG ◽

L. YU

Keyword(s):

Language Processing ◽

Question Answering ◽

Multimodal Learning ◽

Image Description ◽

The Past ◽

The Neural Network ◽

High Level ◽

Visual Scene Understanding ◽

Image Caption Generation ◽

Image Caption

Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).

Download Full-text

Image Caption Description of Traffic Scene Based on Deep Learning

Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University ◽

10.1051/jnwpu/20183630522 ◽

2018 ◽

Vol 36 (3) ◽

pp. 522-527

Author(s):

Shiru Qu ◽

Yuling Xi ◽

Songtao Ding

Keyword(s):

Neural Network ◽

Language Model ◽

Attention Mechanism ◽

Light Change ◽

Generation Model ◽

Semantic Description ◽

Image Captioning ◽

Benchmark Datasets ◽

Abnormal Weather ◽

Image Caption

It is a hard issue to describe the complex traffic scene accurately in computer vision. The traffic scene is changeable, which causes image captioning easily interfered by light changes and object occlusion. To solve this problem, we propose an image caption generation model based on attention mechanism. Combining convolutional neural network (CNN) and recurrent neural network (RNN) to generate an end-to-end description for traffic images. To generate a semantic description with distinct degree of discrimination, the attention mechanism is applied to language model. Using Flickr8K、Flickr30K and MS COCO benchmark datasets to validate the effectiveness of our method. The accuracy is promoted maximally by 8.6%, 12.4%, 19.3% and 21.5% in different evaluation metrics. Experiments show that our algorithm has good robustness in four different complex traffic scenarios, such as light change, abnormal weather environment, road marked target and various kinds of transportation tools.

Download Full-text

Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation

Sensors ◽

10.3390/s21031012 ◽

2021 ◽

Vol 21 (3) ◽

pp. 1012

Author(s):

Jisu Hwang ◽

Incheol Kim

Keyword(s):

Natural Language ◽

Language Processing ◽

Input Data ◽

Language Instruction ◽

Scoring Method ◽

Processing Technologies ◽

Backtracking Search ◽

Panoramic Images ◽

Benchmark Datasets ◽

Vision And Language

Due to the development of computer vision and natural language processing technologies in recent years, there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Vision-and-language navigation (VLN) require the alignment and grounding of multimodal input data to enable real-time perception of the task status on panoramic images and natural language instruction. This study proposes a novel deep neural network model (JMEBS), with joint multimodal embedding and backtracking search for VLN tasks. The proposed JMEBS model uses a transformer-based joint multimodal embedding module. JMEBS uses both multimodal context and temporal context. It also employs backtracking-enabled greedy local search (BGLS), a novel algorithm with a backtracking feature designed to improve the task success rate and optimize the navigation path, based on the local and global scores related to candidate actions. A novel global scoring method is also used for performance improvement by comparing the partial trajectories searched thus far with a plurality of natural language instructions. The performance of the proposed model on various operations was then experimentally demonstrated and compared with other models using the Matterport3D Simulator and room-to-room (R2R) benchmark datasets.

Download Full-text

A Hindi Image Caption Generation Framework Using Deep Learning

ACM Transactions on Asian and Low-Resource Language Information Processing ◽

10.1145/3432246 ◽

2021 ◽

Vol 20 (2) ◽

pp. 1-19

Author(s):

Santosh Kumar Mishra ◽

Rijul Dhir ◽

Sriparna Saha ◽

Pushpak Bhattacharyya

Keyword(s):

Computer Vision ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

English Language ◽

Image Captioning ◽

Textual Description ◽

Proposed Model ◽

Hindi Language ◽

The Given

Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision is used for understanding images, and natural language processing is used for language modeling. A lot of works have been done for image captioning for the English language. In this article, we have developed a model for image captioning in the Hindi language. Hindi is the official language of India, and it is the fourth most spoken language in the world, spoken in India and South Asia. To the best of our knowledge, this is the first attempt to generate image captions in the Hindi language. A dataset is manually created by translating well known MSCOCO dataset from English to Hindi. Finally, different types of attention-based architectures are developed for image captioning in the Hindi language. These attention mechanisms are new for the Hindi language, as those have never been used for the Hindi language. The obtained results of the proposed model are compared with several baselines in terms of BLEU scores, and the results show that our model performs better than others. Manual evaluation of the obtained captions in terms of adequacy and fluency also reveals the effectiveness of our proposed approach. Availability of resources : The codes of the article are available at https://github.com/santosh1821cs03/Image_Captioning_Hindi_Language ; The dataset will be made available: http://www.iitp.ac.in/∼ai-nlp-ml/resources.html .

Download Full-text

Improved Framework using Rider Optimization Algorithm for Precise Image Caption Generation

International Journal of Image and Graphics ◽

10.1142/s0219467822500218 ◽

2021 ◽

pp. 2250021

Author(s):

Chaitrali Prasanna Chaudhari ◽

Satish Devane

Keyword(s):

Language Processing ◽

Optimization Algorithm ◽

Short Term Memory ◽

Activation Function ◽

Input Image ◽

Image Captioning ◽

Word Generation ◽

Adopted Model ◽

Conventional Models ◽

Image Caption

“Image Captioning is the process of generating a textual description of an image”. It deploys both computer vision and natural language processing for caption generation. However, the majority of the image captioning systems offer unclear depictions regarding the objects like “man”, “woman”, “group of people”, “building”, etc. Hence, this paper intends to develop an intelligent-based image captioning model. The adopted model comprises of few steps like word generation, sentence formation, and caption generation. Initially, the input image is subjected to the Deep learning classifier called Convolutional Neural Network (CNN). Since the classifier is already trained in the relevant words that are related to all images, it can easily classify the associated words of the given image. Further, a set of sentences is formed with the generated words using Long-Short Term Memory (LSTM) model. The likelihood of the formed sentences is computed using the Maximum Likelihood (ML) function, and the sentences with higher probability are taken, which is further used for generating the visual representation of the scene in terms of image caption. As a major novelty, this paper aims to enhance the performance of CNN by optimally tuning its weight and activation function. This paper introduces a new enhanced optimization algorithm Rider with Randomized Bypass and Over-taker update (RR-BOU) for this optimal selection. In the proposed RR-BOU is the enhanced version of the Rider Optimization Algorithm (ROA). Finally, the performance of the proposed captioning model is compared over other conventional models with respect to statistical analysis.

Download Full-text