Discovering Sentimental Interaction via Graph Convolutional Network for Visual Sentiment Prediction

Lifang Wu; Heng Zhang; Sinuo Deng; Ge Shi; Xu Liu

doi:10.3390/app11041404

Discovering Sentimental Interaction via Graph Convolutional Network for Visual Sentiment Prediction

Applied Sciences ◽

10.3390/app11041404 ◽

2021 ◽

Vol 11 (4) ◽

pp. 1404

Author(s):

Lifang Wu ◽

Heng Zhang ◽

Sinuo Deng ◽

Ge Shi ◽

Xu Liu

Keyword(s):

Sentiment Analysis ◽

State Of The Art ◽

Saliency Detection ◽

Visual Features ◽

Improve Performance ◽

Convolutional Network ◽

Visual Element ◽

High Level ◽

High Level Abstraction ◽

Interaction Characteristic

With the popularity of online opinion expressing, automatic sentiment analysis of images has gained considerable attention. Most methods focus on effectively extracting the sentimental features of images, such as enhancing local features through saliency detection or instance segmentation tools. However, as a high-level abstraction, the sentiment is difficult to accurately capture with the visual element because of the “affective gap”. Previous works have overlooked the contribution of the interaction among objects to the image sentiment. We aim to utilize interactive characteristics of objects in the sentimental space, inspired by human sentimental principles that each object contributes to the sentiment. To achieve this goal, we propose a framework to leverage the sentimental interaction characteristic based on a Graph Convolutional Network (GCN). We first utilize an off-the-shelf tool to recognize objects and build a graph over them. Visual features represent nodes, and the emotional distances between objects act as edges. Then, we employ GCNs to obtain the interaction features among objects, which are fused with the CNN output of the whole image to predict the final results. Experimental results show that our method exceeds the state-of-the-art algorithm. Demonstrating that the rational use of interaction features can improve performance for sentiment analysis.

Download Full-text

Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks

Applied Sciences ◽

10.3390/app11156975 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6975

Author(s):

Tao Zhang ◽

Lun He ◽

Xudong Li ◽

Guoqing Feng

Keyword(s):

Performance Improvement ◽

State Of The Art ◽

Error Rates ◽

Convolutional Network ◽

Convolutional Networks ◽

Sentence Level ◽

End To End ◽

High Level ◽

Improved Accuracy ◽

Talking Face

Lipreading aims to recognize sentences being spoken by a talking face. In recent years, the lipreading method has achieved a high level of accuracy on large datasets and made breakthrough progress. However, lipreading is still far from being solved, and existing methods tend to have high error rates on the wild data and have the defects of disappearing training gradient and slow convergence. To overcome these problems, we proposed an efficient end-to-end sentence-level lipreading model, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder. More importantly, the proposed architecture incorporates TCN as a feature learner to decode feature. It can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance, and this yields notable performance improvement as well as faster convergence. Experiments show that the training and convergence speed are 50% faster than the state-of-the-art method, and improved accuracy by 2.4% on the GRID dataset.

Download Full-text

Saliency Detection by Multilevel Deep Pyramid Model

Journal of Sensors ◽

10.1155/2018/8249180 ◽

2018 ◽

Vol 2018 ◽

pp. 1-11 ◽

Cited By ~ 2

Author(s):

Hai Wang ◽

Lei Dai ◽

Yingfeng Cai ◽

Long Chen ◽

Yong Zhang

Keyword(s):

Background Noise ◽

State Of The Art ◽

Saliency Detection ◽

Saliency Map ◽

Multiple Features ◽

Low Level ◽

Pyramid Model ◽

High Level ◽

Different Levels ◽

Better Than

Traditional salient object detection models are divided into several classes based on low-level features and contrast between pixels. In this paper, we propose a model based on a multilevel deep pyramid (MLDP), which involves fusing multiple features on different levels. Firstly, the MLDP uses the original image as the input for a VGG16 model to extract high-level features and form an initial saliency map. Next, the MLDP further extracts high-level features to form a saliency map based on a deep pyramid. Then, the MLDP obtains the salient map fused with superpixels by extracting low-level features. After that, the MLDP applies background noise filtering to the saliency map fused with superpixels in order to filter out the interference of background noise and form a saliency map based on the foreground. Lastly, the MLDP combines the saliency map fused with the superpixels with the saliency map based on the foreground, which results in the final saliency map. The MLDP is not limited to low-level features while it fuses multiple features and achieves good results when extracting salient targets. As can be seen in our experiment section, the MLDP is better than the other 7 state-of-the-art models across three different public saliency datasets. Therefore, the MLDP has superiority and wide applicability in extraction of salient targets.

Download Full-text

Context Modulated Dynamic Networks for Actor and Action Video Segmentation with Language Queries

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6895 ◽

2020 ◽

Vol 34 (07) ◽

pp. 12152-12159

Author(s):

Hao Wang ◽

Cheng Deng ◽

Fan Ma ◽

Yi Yang

Keyword(s):

Video Segmentation ◽

State Of The Art ◽

Dynamic Networks ◽

Specific Region ◽

Visual Features ◽

Convolutional Network ◽

Fine Grained ◽

Convolutional Networks ◽

Benchmark Datasets ◽

Context Features

Actor and action video segmentation with language queries aims to segment out the expression referred objects in the video. This process requires comprehensive language reasoning and fine-grained video understanding. Previous methods mainly leverage dynamic convolutional networks to match visual and semantic representations. However, the dynamic convolution neglects spatial context when processing each region in the frame and is thus challenging to segment similar objects in the complex scenarios. To address such limitation, we construct a context modulated dynamic convolutional network. Specifically, we propose a context modulated dynamic convolutional operation in the proposed framework. The kernels for the specific region are generated from both language sentences and surrounding context features. Moreover, we devise a temporal encoder to incorporate motions into the visual features to further match the query descriptions. Extensive experiments on two benchmark datasets, Actor-Action Dataset Sentences (A2D Sentences) and J-HMDB Sentences, demonstrate that our proposed approach notably outperforms state-of-the-art methods.

Download Full-text

R³Net: Recurrent Residual Refinement Network for Saliency Detection

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/95 ◽

2018 ◽

Cited By ~ 72

Author(s):

Zijun Deng ◽

Xiaowei Hu ◽

Lei Zhu ◽

Xuemiao Xu ◽

Jing Qin ◽

...

Keyword(s):

Saliency Detection ◽

Ground Truth ◽

Input Image ◽

Convolutional Network ◽

Low Level ◽

Saliency Maps ◽

Saliency Prediction ◽

Benchmark Datasets ◽

Salient Regions ◽

High Level

Saliency detection is a fundamental yet challenging task in computer vision, aiming at highlighting the most visually distinctive objects in an image. We propose a novel recurrent residual refinement network (R^3Net) equipped with residual refinement blocks (RRBs) to more accurately detect salient regions of an input image. Our RRBs learn the residual between the intermediate saliency prediction and the ground truth by alternatively leveraging the low-level integrated features and the high-level integrated features of a fully convolutional network (FCN). While the low-level integrated features are capable of capturing more saliency details, the high-level integrated features can reduce non-salient regions in the intermediate prediction. Furthermore, the RRBs can obtain complementary saliency information of the intermediate prediction, and add the residual into the intermediate prediction to refine the saliency maps. We evaluate the proposed R^3Net on five widely-used saliency detection benchmarks by comparing it with 16 state-of-the-art saliency detectors. Experimental results show that our network outperforms our competitors in all the benchmark datasets.

Download Full-text

Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6759 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11045-11052

Author(s):

Linjiang Huang ◽

Yan Huang ◽

Wanli Ouyang ◽

Liang Wang

Keyword(s):

Action Recognition ◽

State Of The Art ◽

The State ◽

Body Parts ◽

Convolutional Network ◽

Joint Level ◽

Convolutional Networks ◽

Level Information ◽

Benchmark Datasets ◽

High Level

Recently, graph convolutional networks have achieved remarkable performance for skeleton-based action recognition. In this work, we identify a problem posed by the GCNs for skeleton-based action recognition, namely part-level action modeling. To address this problem, a novel Part-Level Graph Convolutional Network (PL-GCN) is proposed to capture part-level information of skeletons. Different from previous methods, the partition of body parts is learnable rather than manually defined. We propose two part-level blocks, namely Part Relation block (PR block) and Part Attention block (PA block), which are achieved by two differentiable operations, namely graph pooling operation and graph unpooling operation. The PR block aims at learning high-level relations between body parts while the PA block aims at highlighting the important body parts in the action. Integrating the original GCN with the two blocks, the PL-GCN can learn both part-level and joint-level information of the action. Extensive experiments on two benchmark datasets show the state-of-the-art performance on skeleton-based action recognition and demonstrate the effectiveness of the proposed method.

Download Full-text

Deep Learning for text in limted data settings

10.36227/techrxiv.12100692 ◽

2020 ◽

Author(s):

Pathikkumar Patel ◽

Bhargav Lad ◽

Jinan Fiaidhi

Keyword(s):

Machine Learning ◽

Time Series ◽

Deep Learning ◽

Sentiment Analysis ◽

Transfer Learning ◽

Text Classification ◽

State Of The Art ◽

Time Series Forecasting ◽

Text Data ◽

Performance Levels

During the last few years, RNN models have been extensively used and they have proven to be better for sequence and text data. RNNs have achieved state-of-the-art performance levels in several applications such as text classification, sequence to sequence modelling and time series forecasting. In this article we will review different Machine Learning and Deep Learning based approaches for text data and look at the results obtained from these methods. This work also explores the use of transfer learning in NLP and how it affects the performance of models on a specific application of sentiment analysis.

Download Full-text

A survey: which features are required for dynamic visual simultaneous localization and mapping?

Visual Computing for Industry Biomedicine and Art ◽

10.1186/s42492-021-00086-w ◽

2021 ◽

Vol 4 (1) ◽

Author(s):

Zewen Xu ◽

Zheng Rong ◽

Yihong Wu

Keyword(s):

Simultaneous Localization And Mapping ◽

Dynamic Environments ◽

Visual Features ◽

Advantages And Disadvantages ◽

Intelligent Robots ◽

Localization And Mapping ◽

High Level ◽

Static World ◽

Robotic Applications ◽

Significant Attention

AbstractIn recent years, simultaneous localization and mapping in dynamic environments (dynamic SLAM) has attracted significant attention from both academia and industry. Some pioneering work on this technique has expanded the potential of robotic applications. Compared to standard SLAM under the static world assumption, dynamic SLAM divides features into static and dynamic categories and leverages each type of feature properly. Therefore, dynamic SLAM can provide more robust localization for intelligent robots that operate in complex dynamic environments. Additionally, to meet the demands of some high-level tasks, dynamic SLAM can be integrated with multiple object tracking. This article presents a survey on dynamic SLAM from the perspective of feature choices. A discussion of the advantages and disadvantages of different visual features is provided in this article.

Download Full-text

Semantic Relation Model and Dataset for Remote Sensing Scene Understanding

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi10070488 ◽

2021 ◽

Vol 10 (7) ◽

pp. 488

Author(s):

Peng Li ◽

Dezheng Zhang ◽

Aziguli Wulamu ◽

Xin Liu ◽

Peng Chen

Keyword(s):

Remote Sensing ◽

Scene Understanding ◽

Deep Understanding ◽

Remote Sensing Images ◽

Convolutional Network ◽

Scene Graph ◽

Multi Scale ◽

Relationship Extraction ◽

High Level ◽

Graph Generation

A deep understanding of our visual world is more than an isolated perception on a series of objects, and the relationships between them also contain rich semantic information. Especially for those satellite remote sensing images, the span is so large that the various objects are always of different sizes and complex spatial compositions. Therefore, the recognition of semantic relations is conducive to strengthen the understanding of remote sensing scenes. In this paper, we propose a novel multi-scale semantic fusion network (MSFN). In this framework, dilated convolution is introduced into a graph convolutional network (GCN) based on an attentional mechanism to fuse and refine multi-scale semantic context, which is crucial to strengthen the cognitive ability of our model Besides, based on the mapping between visual features and semantic embeddings, we design a sparse relationship extraction module to remove meaningless connections among entities and improve the efficiency of scene graph generation. Meanwhile, to further promote the research of scene understanding in remote sensing field, this paper also proposes a remote sensing scene graph dataset (RSSGD). We carry out extensive experiments and the results show that our model significantly outperforms previous methods on scene graph generation. In addition, RSSGD effectively bridges the huge semantic gap between low-level perception and high-level cognition of remote sensing images.

Download Full-text

Image Co-saliency Detection and Instance Co-segmentation using Attention Graph Clustering based Graph Convolutional Network

IEEE Transactions on Multimedia ◽

10.1109/tmm.2021.3054526 ◽

2021 ◽

pp. 1-1

Author(s):

Tengpeng Li ◽

Kaihua Zhang ◽

Shiwen Shen ◽

Bo Liu ◽

Qingshan Liu ◽

...

Keyword(s):

Saliency Detection ◽

Graph Clustering ◽

Convolutional Network

Download Full-text

Learning emotional word embeddings for sentiment analysis

Journal of Intelligent & Fuzzy Systems ◽

10.3233/jifs-201993 ◽

2021 ◽

pp. 1-13

Author(s):

Qingtian Zeng ◽

Xishi Zhao ◽

Xiaohui Hu ◽

Hua Duan ◽

Zhongying Zhao ◽

...

Keyword(s):

Sentiment Analysis ◽

Language Processing ◽

State Of The Art ◽

Research Problem ◽

Emotional Word ◽

Classification Model ◽

Data Sets ◽

Word Embeddings ◽

Real World Data ◽

Text Documents

Word embeddings have been successfully applied in many natural language processing tasks due to its their effectiveness. However, the state-of-the-art algorithms for learning word representations from large amounts of text documents ignore emotional information, which is a significant research problem that must be addressed. To solve the above problem, we propose an emotional word embedding (EWE) model for sentiment analysis in this paper. This method first applies pre-trained word vectors to represent document features using two different linear weighting methods. Then, the resulting document vectors are input to a classification model and used to train a text sentiment classifier, which is based on a neural network. In this way, the emotional polarity of the text is propagated into the word vectors. The experimental results on three kinds of real-world data sets demonstrate that the proposed EWE model achieves superior performances on text sentiment prediction, text similarity calculation, and word emotional expression tasks compared to other state-of-the-art models.

Download Full-text