Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition

Shikha Gupta; Krishan Sharma; Dileep Aroor Dinesh; Veena Thenkanidiyoor

doi:10.1145/3436494

Visual Semantic-Based Representation Learning Using Deep CNNs for Scene Recognition

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3436494 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-24

Author(s):

Shikha Gupta ◽

Krishan Sharma ◽

Dileep Aroor Dinesh ◽

Veena Thenkanidiyoor

Keyword(s):

Visual Information ◽

Experimental Studies ◽

Image Data ◽

Ground Truth ◽

Semantic Content ◽

Representation Learning ◽

Scene Recognition ◽

Spatially Correlated ◽

Subspace Modeling ◽

Core Part

In this work, we address the task of scene recognition from image data. A scene is a spatially correlated arrangement of various visual semantic contents also known as concepts, e.g., “chair,” “car,” “sky,” etc. Representation learning using visual semantic content can be regarded as one of the most trivial ideas as it mimics the human behavior of perceiving visual information. Semantic multinomial (SMN) representation is one such representation that captures semantic information using posterior probabilities of concepts. The core part of obtaining SMN representation is the building of concept models. Therefore, it is necessary to have ground-truth (true) concept labels for every concept present in an image. Moreover, manual labeling of concepts is practically not feasible due to the large number of images in the dataset. To address this issue, we propose an approach for generating pseudo-concepts in the absence of true concept labels. We utilize the pre-trained deep CNN-based architectures where activation maps (filter responses) from convolutional layers are considered as initial cues to the pseudo-concepts. The non-significant activation maps are removed using the proposed filter-specific threshold-based approach that leads to the removal of non-prominent concepts from data. Further, we propose a grouping mechanism to group the same pseudo-concepts using subspace modeling of filter responses to achieve a non-redundant representation. Experimental studies show that generated SMN representation using pseudo-concepts achieves comparable results for scene recognition tasks on standard datasets like MIT-67 and SUN-397 even in the absence of true concept labels.

Download Full-text

Fig Plant Segmentation from Aerial Images Using a Deep Convolutional Encoder-Decoder Network

Remote Sensing ◽

10.3390/rs11101157 ◽

2019 ◽

Vol 11 (10) ◽

pp. 1157 ◽

Cited By ~ 8

Author(s):

Jorge Fuentes-Pacheco ◽

Juan Torres-Olivares ◽

Edgar Roman-Rangel ◽

Salvador Cervantes ◽

Porfirio Juarez-Lopez ◽

...

Keyword(s):

Precision Agriculture ◽

Image Data ◽

Ground Truth ◽

Aerial Images ◽

Aerial Image ◽

Data Set ◽

Visual Appearance ◽

Aerial Robots ◽

Lighting Conditions ◽

Convolutional Encoder

Crop segmentation is an important task in Precision Agriculture, where the use of aerial robots with an on-board camera has contributed to the development of new solution alternatives. We address the problem of fig plant segmentation in top-view RGB (Red-Green-Blue) images of a crop grown under open-field difficult circumstances of complex lighting conditions and non-ideal crop maintenance practices defined by local farmers. We present a Convolutional Neural Network (CNN) with an encoder-decoder architecture that classifies each pixel as crop or non-crop using only raw colour images as input. Our approach achieves a mean accuracy of 93.85% despite the complexity of the background and a highly variable visual appearance of the leaves. We make available our CNN code to the research community, as well as the aerial image data set and a hand-made ground truth segmentation with pixel precision to facilitate the comparison among different algorithms.

Download Full-text

UAV-Driven Structural Crack Detection and Location Determination Using Convolutional Neural Networks

Sensors ◽

10.3390/s21082650 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2650

Author(s):

Daegyun Choi ◽

William Bell ◽

Donghoon Kim ◽

Jichul Kim

Keyword(s):

Visual Information ◽

Crack Detection ◽

Image Data ◽

Image Processing Techniques ◽

Location Determination ◽

Aerial Vehicle ◽

Using Data ◽

Processing Techniques ◽

Deep Learning Model ◽

Structural Cracks

Structural cracks are a vital feature in evaluating the health of aging structures. Inspectors regularly monitor structures’ health using visual information because early detection of cracks on highly trafficked structures is critical for maintaining the public’s safety. In this work, a framework for detecting cracks along with their locations is proposed. Image data provided by an unmanned aerial vehicle (UAV) is stitched using image processing techniques to overcome limitations in the resolution of cameras. This stitched image is analyzed to identify cracks using a deep learning model that makes judgements regarding the presence of cracks in the image. Moreover, cracks’ locations are determined using data from UAV sensors. To validate the system, cracks forming on an actual building are captured by a UAV, and these images are analyzed to detect and locate cracks. The proposed framework is proven as an effective way to detect cracks and to represent the cracks’ locations.

Download Full-text

Fully Data-Driven Pseudohealthy Synthesis for Planning Valve-Sparing Aortic Root Reconstruction using Conditional Variational Autoencoders

Current Directions in Biomedical Engineering ◽

10.1515/cdbme-2020-3072 ◽

2020 ◽

Vol 6 (3) ◽

pp. 284-287

Author(s):

Jannis Hagenah ◽

Mohamad Mehdi ◽

Floris Ernst

Keyword(s):

Aortic Root ◽

Similarity Index ◽

Ground Truth ◽

Representation Learning ◽

Patient Specific ◽

Ultrasound Images ◽

Specific Geometry ◽

The Individual ◽

Native Root ◽

Original Information

AbstractAortic root aneurysm is treated by replacing the dilated root by a grafted prosthesis which mimics the native root morphology of the individual patient. The challenge in predicting the optimal prosthesis size rises from the highly patient-specific geometry as well as the absence of the original information on the healthy root. Therefore, the estimation is only possible based on the available pathological data. In this paper, we show that representation learning with Conditional Variational Autoencoders is capable of turning the distorted geometry of the aortic root into smoother shapes while the information on the individual anatomy is preserved. We evaluated this method using ultrasound images of the porcine aortic root alongside their labels. The observed results show highly realistic resemblance in shape and size to the ground truth images. Furthermore, the similarity index has noticeably improved compared to the pathological images. This provides a promising technique in planning individual aortic root replacement.

Download Full-text

Knowledge-aware Multi-modal Adaptive Graph Convolutional Networks for Fake News Detection

ACM Transactions on Multimedia Computing Communications and Applications ◽

10.1145/3451215 ◽

2021 ◽

Vol 17 (3) ◽

pp. 1-23

Author(s):

Shengsheng Qian ◽

Jun Hu ◽

Quan Fang ◽

Changsheng Xu

Keyword(s):

Social Media ◽

Visual Information ◽

Representation Learning ◽

Fake News ◽

Unified Framework ◽

Model Learning ◽

Convolutional Network ◽

Textual Information ◽

Convolutional Networks ◽

Real World Datasets

In this article, we focus on fake news detection task and aim to automatically identify the fake news from vast amount of social media posts. To date, many approaches have been proposed to detect fake news, which includes traditional learning methods and deep learning-based models. However, there are three existing challenges: (i) How to represent social media posts effectively, since the post content is various and highly complicated; (ii) how to propose a data-driven method to increase the flexibility of the model to deal with the samples in different contexts and news backgrounds; and (iii) how to fully utilize the additional auxiliary information (the background knowledge and multi-modal information) of posts for better representation learning. To tackle the above challenges, we propose a novel Knowledge-aware Multi-modal Adaptive Graph Convolutional Networks (KMAGCN) to capture the semantic representations by jointly modeling the textual information, knowledge concepts, and visual information into a unified framework for fake news detection. We model posts as graphs and use a knowledge-aware multi-modal adaptive graph learning principal for the effective feature learning. Compared with existing methods, the proposed KMAGCN addresses challenges from three aspects: (1) It models posts as graphs to capture the non-consecutive and long-range semantic relations; (2) it proposes a novel adaptive graph convolutional network to handle the variability of graph data; and (3) it leverages textual information, knowledge concepts and visual information jointly for model learning. We have conducted extensive experiments on three public real-world datasets and superior results demonstrate the effectiveness of KMAGCN compared with other state-of-the-art algorithms.

Download Full-text

A New View on Grasping

Motor Control ◽

10.1123/mcj.3.3.237 ◽

1999 ◽

Vol 3 (3) ◽

pp. 237-271 ◽

Cited By ~ 349

Author(s):

Jeroen B.J. Smeets ◽

Eli Brenner

Keyword(s):

Surface Roughness ◽

Visual Information ◽

Experimental Studies ◽

Movement Speed ◽

Object Size ◽

Extensive Review ◽

Minimum Jerk ◽

Alternative Description ◽

Position And Orientation ◽

Shape And Size

Reaching out for an object is often described as consisting of two components that are based on different visual information. Information about the object's position and orientation guides the hand to the object, while information about the object's shape and size determines how the fingers move relative to the thumb to grasp it. We propose an alternative description, which consists of determining suitable positions on the object—on the basis of its shape, surface roughness, and so on—and then moving one's thumb and fingers more or less independently to these positions. We modeled this description using a minimum-jerk approach, whereby the finger and thumb approach their respective target positions approximately orthogonally to the surface. Our model predicts how experimental variables such as object size, movement speed, fragility, and required accuracy will influence the timing and size of the maximum aperture of the hand. An extensive review of experimental studies on grasping showed that the predicted influences correspond to human behavior.

Download Full-text

Benchmark Dataset Based on Category Maps with Indoor–Outdoor Mixed Features for Positional Scene Recognition by a Mobile Robot

Robotics ◽

10.3390/robotics9020040 ◽

2020 ◽

Vol 9 (2) ◽

pp. 40

Author(s):

Hirokazu Madokoro ◽

Hanwool Woo ◽

Stephanie Nix ◽

Kazuhito Sato

Keyword(s):

Mobile Robot ◽

Visual Information ◽

Indoor Environment ◽

Scene Recognition ◽

Visual Features ◽

Wide Field ◽

Mixed Features ◽

Wide Field Of View ◽

Benchmark Datasets ◽

Unified Method

This study was conducted to develop original benchmark datasets that simultaneously include indoor–outdoor visual features. Indoor visual information related to images includes outdoor features to a degree that varies extremely by time, weather, and season. We obtained time-series scene images using a wide field of view (FOV) camera mounted on a mobile robot moving along a 392-m route in an indoor environment surrounded by transparent glass walls and windows for two directions in three seasons. For this study, we propose a unified method for extracting, characterizing, and recognizing visual landmarks that are robust to human occlusion in a real environment in which robots coexist with people. Using our method, we conducted an evaluation experiment to recognize scenes divided up to 64 zones with fixed intervals. The experimentally obtained results using the datasets revealed the performance and characteristics of meta-parameter optimization, mapping characteristics to category maps, and recognition accuracy. Moreover, we visualized similarities between scene images using category maps. We also identified cluster boundaries obtained from mapping weights.

Download Full-text

Multi-View Partial Multi-Label Learning with Graph-Based Disambiguation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5761 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3553-3560 ◽

Cited By ~ 1

Author(s):

Ze-Sen Chen ◽

Xuan Wu ◽

Qing-Guo Chen ◽

Yao Hu ◽

Min-Ling Zhang

Keyword(s):

Predictive Model ◽

Clustering Analysis ◽

Experimental Studies ◽

Ground Truth ◽

Two Stage ◽

Feature Vectors ◽

Similarity Graph ◽

Training Examples ◽

Guided Clustering

In multi-view multi-label learning (MVML), each training example is represented by different feature vectors and associated with multiple labels simultaneously. Nonetheless, the labeling quality of training examples is tend to be affected by annotation noises. In this paper, the problem of multi-view partial multi-label learning (MVPML) is studied, where the set of associated labels are assumed to be candidate ones and only partially valid. To solve the MVPML problem, a two-stage graph-based disambiguation approach is proposed. Firstly, the ground-truth labels of each training example are estimated by disambiguating the candidate labels with fused similarity graph. After that, the predictive model for each label is learned from embedding features generated from disambiguation-guided clustering analysis. Extensive experimental studies clearly validate the effectiveness of the proposed approach in solving the MVPML problem.

Download Full-text

Automated single particle detection and tracking for large microscopy datasets

Royal Society Open Science ◽

10.1098/rsos.160225 ◽

2016 ◽

Vol 3 (5) ◽

pp. 160225 ◽

Cited By ~ 11

Author(s):

Rhodri S. Wilson ◽

Lei Yang ◽

Alison Dun ◽

Annya M. Smyth ◽

Rory R. Duncan ◽

...

Keyword(s):

Single Molecule ◽

Single Particle ◽

Image Data ◽

Ground Truth ◽

Detection Algorithm ◽

Large Datasets ◽

Single Particle Tracking ◽

Synthetic Image ◽

Particle Detection ◽

Very Large Datasets

Recent advances in optical microscopy have enabled the acquisition of very large datasets from living cells with unprecedented spatial and temporal resolutions. Our ability to process these datasets now plays an essential role in order to understand many biological processes. In this paper, we present an automated particle detection algorithm capable of operating in low signal-to-noise fluorescence microscopy environments and handling large datasets. When combined with our particle linking framework, it can provide hitherto intractable quantitative measurements describing the dynamics of large cohorts of cellular components from organelles to single molecules. We begin with validating the performance of our method on synthetic image data, and then extend the validation to include experiment images with ground truth. Finally, we apply the algorithm to two single-particle-tracking photo-activated localization microscopy biological datasets, acquired from living primary cells with very high temporal rates. Our analysis of the dynamics of very large cohorts of 10 000 s of membrane-associated protein molecules show that they behave as if caged in nanodomains. We show that the robustness and efficiency of our method provides a tool for the examination of single-molecule behaviour with unprecedented spatial detail and high acquisition rates.

Download Full-text

Hand-Held Stereovision System for Image Updating in Open Spine Surgery

Operative Neurosurgery ◽

10.1093/ons/opaa057 ◽

2020 ◽

Vol 19 (4) ◽

pp. 461-470

Author(s):

Xiaoyao Fan ◽

Maxwell S Durtschi ◽

Chen Li ◽

Linton T Evans ◽

Songbai Ji ◽

...

Keyword(s):

Computed Tomography ◽

Spine Surgery ◽

Surgical Navigation ◽

Image Data ◽

Ground Truth ◽

Acquisition Time ◽

Fiducial Markers ◽

Image Pair ◽

3 Dimensional ◽

Surgical Field

Abstract BACKGROUND Image guidance in open spinal surgery is compromised by changes in spinal alignment between preoperative images and surgical positioning. We evaluated registration of stereo-views of the surgical field to compensate for vertebral alignment changes. OBJECTIVE To assess accuracy and efficiency of an optically tracked hand-held stereovision (HHS) system to acquire images of the exposed spine during surgery. METHODS Standard midline posterior approach exposed L1 to L6 in 6 cadaver porcine spines. Fiducial markers were placed on each vertebra as “ground truth” locations. Spines were positioned supine with accentuated lordosis, and preoperative computed tomography (pCT) was acquired. Spines were re-positioned in a neutral prone posture, and locations of fiducials were acquired with a tracked stylus. Intraoperative stereovision (iSV) images were acquired and 3-dimensional (3D) surfaces of the exposed spine were reconstructed. HHS accuracy was assessed in terms of distances between reconstructed fiducial marker locations and their tracked counterparts. Level-wise registrations aligned pCT with iSV to account for changes in spine posture. Accuracy of updated computed tomography (uCT) was assessed using fiducial markers and other landmarks. RESULTS Acquisition time for each image pair was <1 s. Mean reconstruction time was <1 s for each image pair using batch processing, and mean accuracy was 1.2 ± 0.6 mm across 6 cases. Mean errors of uCT were 3.1 ± 0.7 and 2.0 ± 0.5 mm on the dorsal and ventral sides, respectively. CONCLUSION Results suggest that a portable HHS system offers potential to acquire accurate image data from the surgical field to facilitate surgical navigation during open spine surgery.

Download Full-text

Brain tumor classification and segmentation using sparse coding and dictionary learning

Biomedical Engineering / Biomedizinische Technik ◽

10.1515/bmt-2015-0071 ◽

2016 ◽

Vol 61 (4) ◽

pp. 413-429 ◽

Cited By ~ 8

Author(s):

Saif Dawood Salman Al-Shaikhli ◽

Michael Ying Yang ◽

Bodo Rosenhahn

Keyword(s):

Brain Tumor ◽

Dictionary Learning ◽

Sparse Coding ◽

Image Data ◽

Ground Truth ◽

Tumor Classification ◽

Training Data ◽

Tumor Segmentation ◽

Feature Dictionary ◽

The Brain

AbstractThis paper presents a novel fully automatic framework for multi-class brain tumor classification and segmentation using a sparse coding and dictionary learning method. The proposed framework consists of two steps: classification and segmentation. The classification of the brain tumors is based on brain topology and texture. The segmentation is based on voxel values of the image data. Using K-SVD, two types of dictionaries are learned from the training data and their associated ground truth segmentation: feature dictionary and voxel-wise coupled dictionaries. The feature dictionary consists of global image features (topological and texture features). The coupled dictionaries consist of coupled information: gray scale voxel values of the training image data and their associated label voxel values of the ground truth segmentation of the training data. For quantitative evaluation, the proposed framework is evaluated using different metrics. The segmentation results of the brain tumor segmentation (MICCAI-BraTS-2013) database are evaluated using five different metric scores, which are computed using the online evaluation tool provided by the BraTS-2013 challenge organizers. Experimental results demonstrate that the proposed approach achieves an accurate brain tumor classification and segmentation and outperforms the state-of-the-art methods.

Download Full-text