scholarly journals Deep Image Similarity Measurement Based on the Improved Triplet Network with Spatial Pyramid Pooling

Information ◽  
2019 ◽  
Vol 10 (4) ◽  
pp. 129 ◽  
Author(s):  
Xinpan Yuan ◽  
Qunfeng Liu ◽  
Jun Long ◽  
Lei Hu ◽  
Yulou Wang

Image similarity measurement is a fundamental problem in the field of computer vision. It is widely used in image classification, object detection, image retrieval, and other fields, mostly through Siamese or triplet networks. These networks consist of two or three identical branches of convolutional neural network (CNN) and share their weights to obtain the high-level image feature representations so that similar images are mapped close to each other in the feature space, and dissimilar image pairs are mapped far from each other. Especially, the triplet network is known as the state-of-the-art method on image similarity measurement. However, the basic CNN can only handle fixed-size images. If we obtain a fixed size image via cutting or scaling, the information of the image will be lost and the recognition accuracy will be reduced. To solve the problem, this paper has proposed the triplet spatial pyramid pooling network (TSPP-Net) through combing the triplet convolution neural network with the spatial pyramid pooling. Additionally, we propose an improved triplet loss function, so that the network model can realize twice distance learning by only inputting three samples at one time. Through the theoretical analysis and experiments, it is proved that the TSPP-Net model and the improved triple loss function can improve the generalization ability and the accuracy of image similarity measurement algorithm.

2020 ◽  
Vol 2020 ◽  
pp. 1-10 ◽  
Author(s):  
Tanvir Ahmad ◽  
Yinglong Ma ◽  
Muhammad Yahya ◽  
Belal Ahmad ◽  
Shah Nazir ◽  
...  

In the field of object detection, recently, tremendous success is achieved, but still it is a very challenging task to detect and identify objects accurately with fast speed. Human beings can detect and recognize multiple objects in images or videos with ease regardless of the object’s appearance, but for computers it is challenging to identify and distinguish between things. In this paper, a modified YOLOv1 based neural network is proposed for object detection. The new neural network model has been improved in the following ways. Firstly, modification is made to the loss function of the YOLOv1 network. The improved model replaces the margin style with proportion style. Compared to the old loss function, the new is more flexible and more reasonable in optimizing the network error. Secondly, a spatial pyramid pooling layer is added; thirdly, an inception model with a convolution kernel of 1 ∗ 1 is added, which reduced the number of weight parameters of the layers. Extensive experiments on Pascal VOC datasets 2007/2012 showed that the proposed method achieved better performance.


Content-Based Image Retrieval (CBIR) is extensively used technique for image retrieval from large image databases. However, users are not satisfied with the conventional image retrieval techniques. In addition, the advent of web development and transmission networks, the number of images available to users continues to increase. Therefore, a permanent and considerable digital image production in many areas takes place. Quick access to the similar images of a given query image from this extensive collection of images pose great challenges and require proficient techniques. From query by image to retrieval of relevant images, CBIR has key phases such as feature extraction, similarity measurement, and retrieval of relevant images. However, extracting the features of the images is one of the important steps. Recently Convolutional Neural Network (CNN) shows good results in the field of computer vision due to the ability of feature extraction from the images. Alex Net is a classical Deep CNN for image feature extraction. We have modified the Alex Net Architecture with a few changes and proposed a novel framework to improve its ability for feature extraction and for similarity measurement. The proposal approach optimizes Alex Net in the aspect of pooling layer. In particular, average pooling is replaced by max-avg pooling and the non-linear activation function Maxout is used after every Convolution layer for better feature extraction. This paper introduces CNN for features extraction from images in CBIR system and also presents Euclidean distance along with the Comprehensive Values for better results. The proposed framework goes beyond image retrieval, including the large-scale database. The performance of the proposed work is evaluated using precision. The proposed work show better results than existing works.


2018 ◽  
Vol 10 (12) ◽  
pp. 115
Author(s):  
Wanli Yang ◽  
Yimin Chen ◽  
Chen Huang ◽  
Mingke Gao

In recent years, the application of deep neural networks to human behavior recognition has become a hot topic. Although remarkable achievements have been made in the field of image recognition, there are still many problems to be solved in the area of video. It is well known that convolutional neural networks require a fixed size image input, which not only limits the network structure but also affects the recognition accuracy. Although this problem has been solved in the field of images, it has not yet been broken through in the field of video. To address the input problem of fixed size video frames in video recognition, we propose a three-dimensional (3D) densely connected convolutional network based on spatial pyramid pooling (3D-DenseNet-SPP). As the name implies, the network structure is mainly composed of three parts: 3DCNN, DenseNet, and SPPNet. Our models were evaluated on a KTH dataset and UCF101 dataset separately. The experimental results showed that our model has better performance in the field of video-based behavior recognition in comparison to the existing models.


2021 ◽  
Vol 1 (1) ◽  
pp. 29-31
Author(s):  
Mahmood Haithami ◽  
Amr Ahmed ◽  
Iman Yi Liao ◽  
Hamid Jalab

In this paper, we aim to enhance the segmentation capabilities of DeeplabV3 by employing Gated Recurrent Neural Network (GRU). A 1-by-1 convolution in DeeplabV3 was replaced by GRU after the Atrous Spatial Pyramid Pooling (ASSP) layer to combine the input feature maps. The convolution and GRU have sharable parameters, though, the latter has gates that enable/disable the contribution of each input feature map. The experiments on unseen test sets demonstrate that employing GRU instead of convolution would produce better segmentation results. The used datasets are public datasets provided by MedAI competition.


Sensors ◽  
2020 ◽  
Vol 20 (12) ◽  
pp. 3539 ◽  
Author(s):  
Chang-Cheng Lo ◽  
Ching-Hung Lee ◽  
Wen-Cheng Huang

This study aimed to propose a prognostic method based on a one-dimensional convolutional neural network (1-D CNN) with clustering loss by classification training. The 1-D CNN was trained by collecting the vibration signals of normal and malfunction data in hybrid loss function (i.e., classification loss in output and clustering loss in feature space). Subsequently, the obtained feature was adopted to estimate the status for prognosis. The open bearing dataset and established gear platform were utilized to validate the functionality and feasibility of the proposed model. Moreover, the experimental platform was used to simulate the gear mechanism of the semiconductor robot to conduct a practical experiment to verify the accuracy of the model estimation. The experimental results demonstrate the performance and effectiveness of the proposed method.


Entropy ◽  
2020 ◽  
Vol 22 (9) ◽  
pp. 1058
Author(s):  
Zhanghui Liu ◽  
Yudong Zhang ◽  
Yuzhong Chen ◽  
Xinwen Fan ◽  
Chen Dong

Domain generation algorithms (DGAs) use specific parameters as random seeds to generate a large number of random domain names to prevent malicious domain name detection. This greatly increases the difficulty of detecting and defending against botnets and malware. Traditional models for detecting algorithmically generated domain names generally rely on manually extracting statistical characteristics from the domain names or network traffic and then employing classifiers to distinguish the algorithmically generated domain names. These models always require labor intensive manual feature engineering. In contrast, most state-of-the-art models based on deep neural networks are sensitive to imbalance in the sample distribution and cannot fully exploit the discriminative class features in domain names or network traffic, leading to decreased detection accuracy. To address these issues, we employ the borderline synthetic minority over-sampling algorithm (SMOTE) to improve sample balance. We also propose a recurrent convolutional neural network with spatial pyramid pooling (RCNN-SPP) to extract discriminative and distinctive class features. The recurrent convolutional neural network combines a convolutional neural network (CNN) and a bi-directional long short-term memory network (Bi-LSTM) to extract both the semantic and contextual information from domain names. We then employ the spatial pyramid pooling strategy to refine the contextual representation by capturing multi-scale contextual information from domain names. The experimental results from different domain name datasets demonstrate that our model can achieve 92.36% accuracy, an 89.55% recall rate, a 90.46% F1-score, and 95.39% AUC in identifying DGA and legitimate domain names, and it can achieve 92.45% accuracy rate, a 90.12% recall rate, a 90.86% F1-score, and 96.59% AUC in multi-classification problems. It achieves significant improvement over existing models in terms of accuracy and robustness.


Author(s):  
Guoxian Dai ◽  
Jin Xie ◽  
Yi Fang

Learning a 3D shape representation from a collection of its rendered 2D images has been extensively studied. However, existing view-based techniques have not yet fully exploited the information among all the views of projections. In this paper, by employing recurrent neural network to efficiently capture features across different views, we propose a siamese CNN-BiLSTM network for 3D shape representation learning. The proposed method minimizes a discriminative loss function to learn a deep nonlinear transformation, mapping 3D shapes from the original space into a nonlinear feature space. In the transformed space, the distance of 3D shapes with the same label is minimized, otherwise the distance is maximized to a large margin. Specifically, the 3D shapes are first projected into a group of 2D images from different views. Then convolutional neural network (CNN) is adopted to extract features from different view images, followed by a bidirectional long short-term memory (LSTM) to aggregate information across different views. Finally, we construct the whole CNN-BiLSTM network into a siamese structure with contrastive loss function. Our proposed method is evaluated on two benchmarks, ModelNet40 and SHREC 2014, demonstrating superiority over the state-of-the-art methods.


2020 ◽  
Vol 10 (21) ◽  
pp. 7898
Author(s):  
Akm Ashiquzzaman ◽  
Hyunmin Lee ◽  
Kwangki Kim ◽  
Hye-Young Kim ◽  
Jaehyung Park ◽  
...  

Current deep learning convolutional neural network (DCNN) -based hand gesture detectors with acute precision demand incredibly high-performance computing power. Although DCNN-based detectors are capable of accurate classification, the sheer computing power needed for this form of classification makes it very difficult to run with lower computational power in remote environments. Moreover, classical DCNN architectures have a fixed number of input dimensions, which forces preprocessing, thus making it impractical for real-world applications. In this research, a practical DCNN with an optimized architecture is proposed with DCNN filter/node pruning, and spatial pyramid pooling (SPP) is introduced in order to make the model input dimension-invariant. This compact SPP-DCNN module uses 65% fewer parameters than traditional classifiers and operates almost 3× faster than classical models. Moreover, the new improved proposed algorithm, which decodes gestures or sign language finger-spelling from videos, gave a benchmark highest accuracy with the fastest processing speed. This proposed method paves the way for various practical and applied hand gesture input-based human-computer interaction (HCI) applications.


Sign in / Sign up

Export Citation Format

Share Document