An efficient deep learning‐based video captioning framework using multi‐modal features

Olympic Games Event Recognition via Transfer Learning with Photobombing Guided Data Augmentation

Journal of Imaging ◽

10.3390/jimaging7020012 ◽

2021 ◽

Vol 7 (2) ◽

pp. 12

Author(s):

Yousef I. Mohamad ◽

Samah S. Baraheem ◽

Tam V. Nguyen

Keyword(s):

Deep Learning ◽

Transfer Learning ◽

Data Augmentation ◽

Olympic Games ◽

Event Recognition ◽

Surveillance Systems ◽

Video Captioning ◽

Practical Applications ◽

Sport Events ◽

The Olympic Games

Automatic event recognition in sports photos is both an interesting and valuable research topic in the field of computer vision and deep learning. With the rapid increase and the explosive spread of data, which is being captured momentarily, the need for fast and precise access to the right information has become a challenging task with considerable importance for multiple practical applications, i.e., sports image and video search, sport data analysis, healthcare monitoring applications, monitoring and surveillance systems for indoor and outdoor activities, and video captioning. In this paper, we evaluate different deep learning models in recognizing and interpreting the sport events in the Olympic Games. To this end, we collect a dataset dubbed Olympic Games Event Image Dataset (OGED) including 10 different sport events scheduled for the Olympic Games Tokyo 2020. Then, the transfer learning is applied on three popular deep convolutional neural network architectures, namely, AlexNet, VGG-16 and ResNet-50 along with various data augmentation methods. Extensive experiments show that ResNet-50 with the proposed photobombing guided data augmentation achieves 90% in terms of accuracy.

Download Full-text

Video Captioning Based on Both Egocentric and Exocentric Views of Robot Vision for Human-Robot Interaction

International Journal of Social Robotics ◽

10.1007/s12369-021-00842-1 ◽

2021 ◽

Author(s):

Soo-Han Kang ◽

Ji-Hyeong Han

Keyword(s):

Deep Learning ◽

Natural Language ◽

Vision System ◽

Robot Vision ◽

Learning Model ◽

Human Robot Interaction ◽

Robot Interaction ◽

Video Captioning ◽

Global Action ◽

Deep Learning Model

AbstractRobot vision provides the most important information to robots so that they can read the context and interact with human partners successfully. Moreover, to allow humans recognize the robot’s visual understanding during human-robot interaction (HRI), the best way is for the robot to provide an explanation of its understanding in natural language. In this paper, we propose a new approach by which to interpret robot vision from an egocentric standpoint and generate descriptions to explain egocentric videos particularly for HRI. Because robot vision equals to egocentric video on the robot’s side, it contains as much egocentric view information as exocentric view information. Thus, we propose a new dataset, referred to as the global, action, and interaction (GAI) dataset, which consists of egocentric video clips and GAI descriptions in natural language to represent both egocentric and exocentric information. The encoder-decoder based deep learning model is trained based on the GAI dataset and its performance on description generation assessments is evaluated. We also conduct experiments in actual environments to verify whether the GAI dataset and the trained deep learning model can improve a robot vision system

Download Full-text

Exploring Video Captioning Techniques: A Comprehensive Survey on Deep Learning Methods

SN Computer Science ◽

10.1007/s42979-021-00487-x ◽

2021 ◽

Vol 2 (2) ◽

Author(s):

Saiful Islam ◽

Aurpan Dash ◽

Ashek Seum ◽

Amir Hossain Raj ◽

Tonmoy Hossain ◽

...

Keyword(s):

Deep Learning ◽

Learning Methods ◽

Video Captioning ◽

Comprehensive Survey

Download Full-text

Deep Learning Based Video Captioning in Bengali

10.23919/icac50006.2021.9594154 ◽

2021 ◽

Author(s):

Amir Hossain Raj ◽

Ashek Seum ◽

Aurpan Dash ◽

Saiful Islam ◽

Faisal Muhammad Shah

Keyword(s):

Deep Learning ◽

Video Captioning

Download Full-text

Video Captioning using Deep Learning: An Overview of Methods, Datasets and Metrics

2019 International Conference on Communication and Signal Processing (ICCSP) ◽

10.1109/iccsp.2019.8698097 ◽

2019 ◽

Cited By ~ 1

Author(s):

M. Amaresh ◽

S. Chitrakala

Keyword(s):

Deep Learning ◽

Video Captioning

Download Full-text

Deep Learning for Video Captioning: A Review

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/877 ◽

2019 ◽

Cited By ~ 1

Author(s):

Shaoxiang Chen ◽

Ting Yao ◽

Yu-Gang Jiang

Keyword(s):

Deep Learning ◽

Natural Language ◽

Language Processing ◽

State Of The Art ◽

Future Research ◽

Research Directions ◽

Video Captioning ◽

Future Research Directions ◽

Review State ◽

Vision And Language

Deep learning has achieved great successes in solving specific artificial intelligence problems recently. Substantial progresses are made on Computer Vision (CV) and Natural Language Processing (NLP). As a connection between the two worlds of vision and language, video captioning is the task of producing a natural-language utterance (usually a sentence) that describes the visual content of a video. The task is naturally decomposed into two sub-tasks. One is to encode a video via a thorough understanding and learn visual representation. The other is caption generation, which decodes the learned representation into a sequential sentence, word by word. In this survey, we first formulate the problem of video captioning, then review state-of-the-art methods categorized by their emphasis on vision or language, and followed by a summary of standard datasets and representative approaches. Finally, we highlight the challenges which are not yet fully understood in this task and present future research directions.

Download Full-text

Towards Unified Deep Learning Model for NSFW Image and Video Captioning

Lecture Notes in Electrical Engineering - Advanced Multimedia and Ubiquitous Engineering ◽

10.1007/978-981-13-1328-8_8 ◽

2018 ◽

pp. 57-63

Author(s):

Jong-Won Ko ◽

Dong-Hyun Hwang

Keyword(s):

Deep Learning ◽

Learning Model ◽

Video Captioning ◽

Deep Learning Model

Download Full-text

Video Captioning Based on Joint Image–Audio Deep Learning Techniques

2019 IEEE 9th International Conference on Consumer Electronics (ICCE-Berlin) ◽

10.1109/icce-berlin47944.2019.8966173 ◽

2019 ◽

Cited By ~ 1

Author(s):

Chien-Yao Wang ◽

Pei-Sin Liaw ◽

Kai-Wen Liang ◽

Jai-Ching Wang ◽

Pao-Chi Chang

Keyword(s):

Deep Learning ◽

Video Captioning ◽

Learning Techniques

Download Full-text

A Deep Structured Model for Video Captioning

International Journal of Gaming and Computer-Mediated Simulations ◽

10.4018/ijgcms.2020040103 ◽

2020 ◽

Vol 12 (2) ◽

pp. 44-56

Author(s):

V. Vinodhini ◽

B. Sathiyabhama ◽

S. Sankar ◽

Ramasubbareddy Somula

Keyword(s):

Deep Learning ◽

Activity Recognition ◽

Short Term Memory ◽

Gaussian Mixture ◽

Activation Function ◽

Vital Role ◽

Quality Data ◽

Impaired Hearing ◽

Structured Model ◽

Video Captioning

Video captions help people to understand in a noisy environment or when the sound is muted. It helps people having impaired hearing to understand much better. Captions not only support the content creators and translators but also boost the search engine optimization. Many advanced areas like computer vision and human-computer interaction play a vital role as there is a successful growth of deep learning techniques. Numerous surveys on deep learning models are evolved with different methods, architecture, and metrics. Working with video subtitles is still challenging in terms of activity recognition in video. This paper proposes a deep structured model that is effective towards activity recognition, automatically classifies and caption it in a single architecture. The first process includes subtracting the foreground from the background; this is done by building a 3D convolutional neural network (CNN) model. A Gaussian mixture model is used to remove the backdrop. The classification is done using long short-term memory networks (LSTM). A hidden Markov model (HMM) is used to generate the high quality data. Next, it uses the nonlinear activation function to perform the normalization process. Finally, the video captioning is achieved by using natural language.

Download Full-text