Image Caption Generation via Unified Retrieval and Generation-Based Method

Shanshan Zhao; Lixiang Li; Haipeng Peng; Zihang Yang; Jiaxuan Zhang

doi:10.3390/app10186235

Image Caption Generation via Unified Retrieval and Generation-Based Method

Applied Sciences ◽

10.3390/app10186235 ◽

2020 ◽

Vol 10 (18) ◽

pp. 6235

Author(s):

Shanshan Zhao ◽

Lixiang Li ◽

Haipeng Peng ◽

Zihang Yang ◽

Jiaxuan Zhang

Keyword(s):

State Of The Art ◽

Target Language ◽

Visual Features ◽

Source Image ◽

Image Captioning ◽

Data Set ◽

Advantages And Disadvantages ◽

Benchmark Data ◽

Textual Features ◽

Almost All

Image captioning is a multi-modal transduction task, translating the source image into the target language. Numerous dominant approaches primarily employed the generation-based or the retrieval-based method. These two kinds of frameworks have their advantages and disadvantages. In this work, we make the best of their respective advantages. We adopt the retrieval-based approach to search the visually similar image and their corresponding captions for each queried image in the MSCOCO data set. Based on the retrieved similar sequences and the visual features of the queried image, the proposed de-noising module yielded a set of attended textual features which brought additional textual information for the generation-based model. Finally, the decoder makes use of not only the visual features but also the textual features to generate the output descriptions. Additionally, the incorporated visual encoder and the de-noising module can be applied as a preprocessing component for the decoder-based attention mechanisms. We evaluate the proposed method on the MSCOCO benchmark data set. Extensive experiment yields state-of-the-art performance, and the incorporated module raises the baseline models in terms of almost all the evaluation metrics.

Download Full-text

A General Approach to Multimodal Document Quality Assessment

Journal of Artificial Intelligence Research ◽

10.1613/jair.1.11647 ◽

2020 ◽

Vol 68 ◽

pp. 607-632

Author(s):

Aili Shen ◽

Bahar Salehi ◽

Jianzhong Qi ◽

Timothy Baldwin

Keyword(s):

Quality Assessment ◽

State Of The Art ◽

Feature Learning ◽

Structural Features ◽

Joint Model ◽

Visual Features ◽

General Applicability ◽

Textual Features ◽

Text Content

The perceived quality of a document is affected by various factors, including grammat- icality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators that are not present in the document text — such as images, font choices, and visual layout — we propose a joint model that combines the text content with a visual rendering of the document for document qual- ity assessment. Our joint model achieves state-of-the-art results over five datasets in two domains (Wikipedia and academic papers), which demonstrates the complementarity of textual and visual features, and the general applicability of our model. To examine what kinds of features our model has learned, we further train our model in a multi-task learning setting, where document quality assessment is the primary task and feature learning is an auxiliary task. Experimental results show that visual embeddings are better at learning structural features while textual embeddings are better at learning readability scores, which further verifies the complementarity of visual and textual features.

Download Full-text

Boosted Transformer for Image Captioning

Applied Sciences ◽

10.3390/app9163260 ◽

2019 ◽

Vol 9 (16) ◽

pp. 3260 ◽

Cited By ~ 1

Author(s):

Jiangyun Li ◽

Peng Yao ◽

Longteng Guo ◽

Weicun Zhang

Keyword(s):

Visual Information ◽

State Of The Art ◽

The Self ◽

Visual Features ◽

Image Captioning ◽

Decoder Architecture ◽

Semantic Concepts ◽

Transformer Model ◽

Internal Relationships ◽

Auxiliary Module

Image captioning attempts to generate a description given an image, usually taking Convolutional Neural Network as the encoder to extract the visual features and a sequence model, among which the self-attention mechanism has achieved advanced progress recently, as the decoder to generate descriptions. However, this predominant encoder-decoder architecture has some problems to be solved. On the encoder side, without the semantic concepts, the extracted visual features do not make full use of the image information. On the decoder side, the sequence self-attention only relies on word representations, lacking the guidance of visual information and easily influenced by the language prior. In this paper, we propose a novel boosted transformer model with two attention modules for the above-mentioned problems, i.e., “Concept-Guided Attention” (CGA) and “Vision-Guided Attention” (VGA). Our model utilizes CGA in the encoder, to obtain the boosted visual features by integrating the instance-level concepts into the visual features. In the decoder, we stack VGA, which uses the visual information as a bridge to model internal relationships among the sequences and can be an auxiliary module of sequence self-attention. Quantitative and qualitative results on the Microsoft COCO dataset demonstrate the better performance of our model than the state-of-the-art approaches.

Download Full-text

A Narrow Deep Learning Assisted Visual Tracking with Joint Features

Mathematical Problems in Engineering ◽

10.1155/2020/8659890 ◽

2020 ◽

Vol 2020 ◽

pp. 1-9

Author(s):

Xiaoyan Qian ◽

Daihao Zhang

Keyword(s):

Visual Tracking ◽

Model Updating ◽

State Of The Art ◽

Gaussian Mixture ◽

Background Information ◽

Robust Tracking ◽

Data Set ◽

Tracking Method ◽

Benchmark Data ◽

Updating Procedure

A robust tracking method is proposed for complex visual sequences. Different from time-consuming offline training in current deep tracking, we design a simple two-layer online learning network which fuses local convolution features and global handcrafted features together to give the robust representation for visual tracking. The target state estimation is modeled by an adaptive Gaussian mixture. The motion information is used to direct the distribution of the candidate samples effectively. And meanwhile, an adaptive scale selection is addressed to avoid bringing extra background information. A corresponding object template model updating procedure is developed to account for possible occlusion and minor change. Our tracking method has a light structure and performs favorably against several state-of-the-art methods in tracking challenging scenarios on the recent tracking benchmark data set.

Download Full-text

Identifying Human Behavious Using Deep Trajectory Descriptors

10.20944/preprints201905.0350.v1 ◽

2019 ◽

Author(s):

Tauseef Ali ◽

Eissa Alreshidi

Keyword(s):

State Of The Art ◽

Research Problem ◽

Structural Relationship ◽

Data Set ◽

Human Actions ◽

Benchmark Data ◽

Complex Scenes ◽

Challenging Research ◽

Spatial Coordinates ◽

Number Of Segments

Identifying human actions in complex scenes is widely considered as a challenging research problem due to the unpredictable behaviors and variation of appearances and postures. For extracting variations in motion and postures, trajectories provide meaningful way. However, simple trajectories are normally represented by vector of spatial coordinates. In order to identify human actions, we must exploit structural relationship between different trajectories. In this paper, we propose a method that divides the video into N number of segments and then for each segment we extract trajectories. We then compute trajectory descriptor for each segment which capture the structural relationship among different trajectories in the video segment. For trajectory descriptor, we project all extracted trajectories on the canvas. This will result in texture image which can store the relative motion and structural relationship among the trajectories. We then train Convolution Neural Network (CNN) to capture and learn the representation from dense trajectories. . Experimental results shows that our proposed method out performs state of the art methods by 90.01% on benchmark data set.

Download Full-text

Platelet monitoring for PCI

Hämostaseologie ◽

10.1055/s-0037-1617137 ◽

2009 ◽

Vol 29 (04) ◽

pp. 376-380 ◽

Cited By ~ 9

Author(s):

D. Capodanno ◽

D. J. Angiolillo

Keyword(s):

Interindividual Variability ◽

State Of The Art ◽

Coronary Intervention ◽

Functional Tests ◽

Treatment Regimens ◽

Advantages And Disadvantages ◽

Coronary Syndrome ◽

Combined Use ◽

Increased Risk ◽

Percutaneous Coronary

SummaryDespite the clinical benefit associated with the combined use of aspirin and clopidogrel in patients with acute coronary syndrome or those undergoing percutaneous coronary intervention, a considerable interindividual variability in response to these drugs have been consistently reported. There is a growing interest on applying platelet functional tests with the goal of identifying patients at increased risk of recurrent ischaemic events and potentially tailoring antiplatelet treatment regimens.This manuscript will review the state of the art on the most commonly available platelet functional tests, describing their advantages and disadvantages and exploring their applicability in clinical practice.

Download Full-text

A survey: which features are required for dynamic visual simultaneous localization and mapping?

Visual Computing for Industry Biomedicine and Art ◽

10.1186/s42492-021-00086-w ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Zewen Xu ◽

Zheng Rong ◽

Yihong Wu

Keyword(s):

Simultaneous Localization And Mapping ◽

Dynamic Environments ◽

Visual Features ◽

Advantages And Disadvantages ◽

Intelligent Robots ◽

Localization And Mapping ◽

High Level ◽

Static World ◽

Robotic Applications ◽

Significant Attention

AbstractIn recent years, simultaneous localization and mapping in dynamic environments (dynamic SLAM) has attracted significant attention from both academia and industry. Some pioneering work on this technique has expanded the potential of robotic applications. Compared to standard SLAM under the static world assumption, dynamic SLAM divides features into static and dynamic categories and leverages each type of feature properly. Therefore, dynamic SLAM can provide more robust localization for intelligent robots that operate in complex dynamic environments. Additionally, to meet the demands of some high-level tasks, dynamic SLAM can be integrated with multiple object tracking. This article presents a survey on dynamic SLAM from the perspective of feature choices. A discussion of the advantages and disadvantages of different visual features is provided in this article.

Download Full-text

Efficient Rank-Based Diffusion Process with Assured Convergence

Journal of Imaging ◽

10.3390/jimaging7030049 ◽

2021 ◽

Vol 7 (3) ◽

pp. 49

Author(s):

Daniel Carlos Guimarães Pedronette ◽

Lucas Pascotti Valem ◽

Longin Jan Latecki

Keyword(s):

Diffusion Process ◽

Learning Strategies ◽

State Of The Art ◽

Representation Learning ◽

Theoretical Background ◽

High Dimensional ◽

Visual Features ◽

Learning Approaches ◽

Previous Decade ◽

Asymptotic Complexity

Visual features and representation learning strategies experienced huge advances in the previous decade, mainly supported by deep learning approaches. However, retrieval tasks are still performed mainly based on traditional pairwise dissimilarity measures, while the learned representations lie on high dimensional manifolds. With the aim of going beyond pairwise analysis, post-processing methods have been proposed to replace pairwise measures by globally defined measures, capable of analyzing collections in terms of the underlying data manifold. The most representative approaches are diffusion and ranked-based methods. While the diffusion approaches can be computationally expensive, the rank-based methods lack theoretical background. In this paper, we propose an efficient Rank-based Diffusion Process which combines both approaches and avoids the drawbacks of each one. The obtained method is capable of efficiently approximating a diffusion process by exploiting rank-based information, while assuring its convergence. The algorithm exhibits very low asymptotic complexity and can be computed regionally, being suitable to outside of dataset queries. An experimental evaluation conducted for image retrieval and person re-ID tasks on diverse datasets demonstrates the effectiveness of the proposed approach with results comparable to the state-of-the-art.

Download Full-text

A Systematic Review of Recommender Systems and Their Applications in Cybersecurity

Sensors ◽

10.3390/s21155248 ◽

2021 ◽

Vol 21 (15) ◽

pp. 5248

Author(s):

Aleksandra Pawlicka ◽

Marek Pawlicki ◽

Rafał Kozik ◽

Ryszard S. Choraś

Keyword(s):

Systematic Review ◽

Recommender Systems ◽

Recommender System ◽

State Of The Art ◽

The State ◽

Advantages And Disadvantages ◽

Comprehensive Survey ◽

Security Concerns ◽

Valuable Role

This paper discusses the valuable role recommender systems may play in cybersecurity. First, a comprehensive presentation of recommender system types is presented, as well as their advantages and disadvantages, possible applications and security concerns. Then, the paper collects and presents the state of the art concerning the use of recommender systems in cybersecurity; both the existing solutions and future ideas are presented. The contribution of this paper is two-fold: to date, to the best of our knowledge, there has been no work collecting the applications of recommenders for cybersecurity. Moreover, this paper attempts to complete a comprehensive survey of recommender types, after noticing that other works usually mention two–three types at once and neglect the others.

Download Full-text

A Benchmark and Evaluation of Non-Rigid Structure from Motion

International Journal of Computer Vision ◽

10.1007/s11263-020-01406-y ◽

2020 ◽

Author(s):

Sebastian Hoppe Nesgaard Jensen ◽

Mads Emil Brix Doest ◽

Henrik Aanæs ◽

Alessio Del Bue

Keyword(s):

Computer Vision ◽

Structure From Motion ◽

State Of The Art ◽

The State ◽

Quality Data ◽

Data Set ◽

Rigid Structure ◽

Public Data ◽

3D Information ◽

Further Development

AbstractNon-rigid structure from motion (nrsfm), is a long standing and central problem in computer vision and its solution is necessary for obtaining 3D information from multiple images when the scene is dynamic. A main issue regarding the further development of this important computer vision topic, is the lack of high quality data sets. We here address this issue by presenting a data set created for this purpose, which is made publicly available, and considerably larger than the previous state of the art. To validate the applicability of this data set, and provide an investigation into the state of the art of nrsfm, including potential directions forward, we here present a benchmark and a scrupulous evaluation using this data set. This benchmark evaluates 18 different methods with available code that reasonably spans the state of the art in sparse nrsfm. This new public data set and evaluation protocol will provide benchmark tools for further development in this challenging field.

Download Full-text

Summer Research Placements – State-of-the-Art Science by pre-University Students

MRS Advances ◽

10.1557/adv.2016.128 ◽

2016 ◽

Vol 1 (56) ◽

pp. 3715-3720 ◽

Cited By ~ 1

Author(s):

R. A. Sporea ◽

S. Lygo-Baker

Keyword(s):

University Students ◽

State Of The Art ◽

Group Learning ◽

Student Interaction ◽

Research Tool ◽

Skills Acquisition ◽

Advantages And Disadvantages ◽

Effective Training ◽

Training And Research ◽

Student Training

ABSTRACTSummer research placements are an effective training and research tool. Over three years, our group has hosted nine pre-university students over periods of four to six weeks. Apart from student training and skills acquisition, the placements have produced several peer-reviewed technical publications. Our approach relies on careful pre-planning of activities, frequent student interaction, coupled with independent and group learning. We explore the advantages and disadvantages of this manner of running summer placements.

Download Full-text