scholarly journals Transformers in Pedestrian Image Retrieval and Person Re-Identification in a Multi-Camera Surveillance System

2021 ◽  
Vol 11 (19) ◽  
pp. 9197
Author(s):  
Muhammad Tahir ◽  
Saeed Anwar

Person Re-Identification is an essential task in computer vision, particularly in surveillance applications. The aim is to identify a person based on an input image from surveillance photographs in various scenarios. Most Person re-ID techniques utilize Convolutional Neural Networks (CNNs); however, Vision Transformers are replacing pure CNNs for various computer vision tasks such as object recognition, classification, etc. The vision transformers contain information about local regions of the image. The current techniques take this advantage to improve the accuracy of the tasks underhand. We propose to use the vision transformers in conjunction with vanilla CNN models to investigate the true strength of transformers in person re-identification. We employ three backbones with different combinations of vision transformers on two benchmark datasets. The overall performance of the backbones increased, showing the importance of vision transformers. We provide ablation studies and show the importance of various components of the vision transformers in re-identification tasks.

2020 ◽  
Vol 8 (6) ◽  
pp. 3992-3995

Object recognition the use deep neural networks has been most typically used in real applications. We propose a framework for identifying items in pics of very low decision through collaborative studying of two deep neural networks. It includes photo enhancement network object popularity networks. The picture correction community seeks to decorate images of much lower decision faster and more informative images with the usages of collaborative gaining knowledge of indicatores from object recognition networks. Object popularity networks actively participate in the mastering of photograph enhancement networks, with skilled weights for photographs of excessive resolution. It uses output from photograph enhancement networks as augmented studying recordes to reinforce the overall performance of its identity on a very low decision object. We esablished that the proposed method can improve photograph reconstruction and classification overall performance


2021 ◽  
Author(s):  
Weihao Zhuang ◽  
Tristan Hascoet ◽  
Xunquan Chen ◽  
Ryoichi Takashima ◽  
Tetsuya Takiguchi ◽  
...  

Abstract Currently, deep learning plays an indispensable role in many fields, including computer vision, natural language processing, and speech recognition. Convolutional Neural Networks (CNNs) have demonstrated excellent performance in computer vision tasks thanks to their powerful feature extraction capability. However, as the larger models have shown higher accuracy, recent developments have led to state-of-the-art CNN models with increasing resource consumption. This paper investigates a conceptual approach to reduce the memory consumption of CNN inference. Our method consists of processing the input image in a sequence of carefully designed tiles within the lower subnetwork of the CNN, so as to minimize its peak memory consumption, while keeping the end-to-end computation unchanged. This method introduces a trade-off between memory consumption and computations, which is particularly suitable for high-resolution inputs. Our experimental results show that MobileNetV2 memory consumption can be reduced by up to 5.3 times with our proposed method. For ResNet50, one of the most commonly used CNN models in computer vision tasks, memory can be optimized by up to 2.3 times.


2021 ◽  
Author(s):  
Ghassan Dabane ◽  
Laurent Perrinet ◽  
Emmanuel Daucé

Convolutional Neural Networks have been considered the go-to option for object recognition in computer vision for the last couple of years. However, their invariance to object’s translations is still deemed as a weak point and remains limited to small translations only via their max-pooling layers. One bio-inspired approach considers the What/Where pathway separation in Mammals to overcome this limitation. This approach works as a nature-inspired attention mechanism, another classical approach of which is Spatial Transformers. These allow an adaptive endto-end learning of different classes of spatial transformations throughout training. In this work, we overview Spatial Transformers as an attention-only mechanism and compare them with the What/Where model. We show that the use of attention restricted or “Foveated” Spatial Transformer Networks, coupled alongside a curriculum learning training scheme and an efficient log-polar visual space entry, provides better performance when compared to the What/Where model, all this without the need for any extra supervision whatsoever.


Algorithms ◽  
2020 ◽  
Vol 13 (7) ◽  
pp. 167 ◽  
Author(s):  
Dan Malowany ◽  
Hugo Guterman

Computer vision is currently one of the most exciting and rapidly evolving fields of science, which affects numerous industries. Research and development breakthroughs, mainly in the field of convolutional neural networks (CNNs), opened the way to unprecedented sensitivity and precision in object detection and recognition tasks. Nevertheless, the findings in recent years on the sensitivity of neural networks to additive noise, light conditions, and to the wholeness of the training dataset, indicate that this technology still lacks the robustness needed for the autonomous robotic industry. In an attempt to bring computer vision algorithms closer to the capabilities of a human operator, the mechanisms of the human visual system was analyzed in this work. Recent studies show that the mechanisms behind the recognition process in the human brain include continuous generation of predictions based on prior knowledge of the world. These predictions enable rapid generation of contextual hypotheses that bias the outcome of the recognition process. This mechanism is especially advantageous in situations of uncertainty, when visual input is ambiguous. In addition, the human visual system continuously updates its knowledge about the world based on the gaps between its prediction and the visual feedback. CNNs are feed forward in nature and lack such top-down contextual attenuation mechanisms. As a result, although they process massive amounts of visual information during their operation, the information is not transformed into knowledge that can be used to generate contextual predictions and improve their performance. In this work, an architecture was designed that aims to integrate the concepts behind the top-down prediction and learning processes of the human visual system with the state-of-the-art bottom-up object recognition models, e.g., deep CNNs. The work focuses on two mechanisms of the human visual system: anticipation-driven perception and reinforcement-driven learning. Imitating these top-down mechanisms, together with the state-of-the-art bottom-up feed-forward algorithms, resulted in an accurate, robust, and continuously improving target recognition model.


Author(s):  
Brijesh Verma ◽  
Siddhivinayak Kulkarni

This chapter introduces neural networks for Content-Based Image Retrieval (CBIR) systems. It presents a critical literature review of both the traditional and neural network based techniques that are used in retrieving the images based on their content. It shows how neural networks and fuzzy logic can be used in interpretation of queries, feature extraction and classification of features by describing a detailed research methodology. It investigates a neural network based technique in conjunction with fuzzy logic to improve the overall performance of the CBIR systems. The results of the investigation on a benchmark database with a comparative analysis are presented in this chapter. The methodologies and results presented in this chapter will allow researchers to improve and compare their methods and it will also allow system developers to understand and implement the neural network and fuzzy logic based techniques for content based image retrieval.


2018 ◽  
Author(s):  
Rodrigo A. Rebouças ◽  
Elcio H. Shiguemori ◽  
Lamartine N. F. Guimarães

Drone use has grown with the use of image processing and computer vision techniques, such as autonomous image navigation, mosaic generation, elevation modeling, 3D reconstruction, and object recognition. In all techniques, an important step is an extraction of features, such as methods of interest points. This work addresses the modes of application of interest points, such as BRISK, ORB, FREAK, AKAZE and LATCH with the parameters configured automatically using the optimization method for images with different textures. This process is one of the pieces of final software that selects the use of a meta heuristic the best parameters automatically according to an input image.


2021 ◽  
pp. 1143-1146
Author(s):  
A.V. Lysenko ◽  
◽  
◽  
M.S. Oznobikhin ◽  
E.A. Kireev ◽  
...  

Abstract. This study discusses the problem of phytoplankton classification using computer vision methods and convolutional neural networks. We created a system for automatic object recognition consisting of two parts: analysis and primary processing of phytoplankton images and development of the neural network based on the obtained information about the images. We developed software that can detect particular objects in images from a light microscope. We trained a convolutional neural network in transfer learning and determined optimal parameters of this neural network and the optimal size of using dataset. To increase accuracy for these groups of classes, we created three neural networks with the same structure. The obtained accuracy in the classification of Baikal phytoplankton by these neural networks was up to 80%.


Author(s):  
Emmanuel Udoh

Computer vision or object recognition complements human or biological vision using techniques from machine learning, statistics, scene reconstruction, indexing and event analysis. Object recognition is an active research area that implements artificial vision in software and hardware. Some application examples are autonomous robots, surveillance, indexing databases of pictures and human computer interaction. This visual aid is beneficial to users, because humans remember information with greater accuracy when it is presented visually than when it originates in writing, speech or in kinesthetic form. Linguistic indexing adds another dimension to computer vision by automatically assigning words or textual descriptions to images. This augments content-based image retrieval (CBIR) that extracts or searches for digital images in large databases. According to Li and Wang (2003), most of the existing CBIR projects are general-purpose image retrieval systems that search images visually similar to a query sketch. Current CBIR systems are incapable of assigning words automatically to images due to the inherent difficulty of recognizing numerous objects at once. This current situation is stimulating several research endeavors that seek to assign text to images, thereby improving image retrieval in large databases. To enhance information processing using object recognition techniques, current research has focused on automatic linguistic indexing of digital images (ALIDI). ALIDI requires a combination of mathematical, statistical, computational, and graphical backgrounds. Many researchers have focused on various aspects of linguistic processing such as CBIR (Ghosal, Ircing, & Khudanpur, 2005; Iqbal & Aggarwal, 2002, Wang, 2001) machine learning techniques (Iqbal & Aggarwal, 2002), digital library (Witen & Bainbridge, 2003) and statistical modeling (Li, Gray, & Olsen, 20004, Li & Wang, 2003). A growing approach is the utilization of statistical models as demonstrated by Li and Wang (2003). It entails building databases of images to be used for supervised learning. A trained system is used to recognize and identify new images with statistical error margin. This statistical modeling approach uses a hidden Markov model to extract representative information about any category of images analyzed. However, in using computer to recognize images with textual description, some of the researchers employ solely text-based approaches. In this article, the focus is on the computational and graphical aspects of ALIDI in a system that uses Web-based access in order to enable wider usage (Ntoulas, Chao, & Cho, 2005). This system uses image composition (primary hue and saturation) in the linguistic indexing of digital images or pictures.


2021 ◽  
Vol 2021 ◽  
pp. 1-14
Author(s):  
Sandeep Kumar ◽  
Arpit Jain ◽  
Ambuj Kumar Agarwal ◽  
Shilpa Rani ◽  
Anshu Ghimire

Day by day, all the research communities have been focusing on digital image retrieval due to more internet and social media uses. In this paper, a U-Net-based neural network is proposed for the segmentation process and Haar DWT and lifting wavelet schemes are used for feature extraction in content-based image retrieval (CBIR). Haar wavelet is preferred as it is easy to understand, very simple to compute, and the fastest. The U-Net-based neural network (CNN) gives more accurate results than the existing methodology because deep learning techniques extract low-level and high-level features from the input image. For the evaluation process, two benchmark datasets are used, and the accuracy of the proposed method is 93.01% and 88.39% on Corel 1K and Corel 5K. U-Net is used for the segmentation purpose, and it reduces the dimension of the feature vector and feature extraction time by 5 seconds compared to the existing methods. According to the performance analysis, the proposed work has proven that U-Net improves image retrieval performance in terms of accuracy, precision, and recall on both the benchmark datasets.


2021 ◽  
Author(s):  
Ghassan Dabane ◽  
Laurent Perrinet ◽  
Emmanuel Daucé

Convolutional Neural Networks have been considered the go-to option for object recognition in computer vision for the last couple of years. However, their invariance to object’s translations is still deemed as a weak point and remains limited to small translations only via their max-pooling layers. One bio-inspired approach considers the What/Where pathway separation in Mammals to overcome this limitation. This approach works as a nature-inspired attention mechanism, another classical approach of which is Spatial Transformers. These allow an adaptive endto-end learning of different classes of spatial transformations throughout training. In this work, we overview Spatial Transformers as an attention-only mechanism and compare them with the What/Where model. We show that the use of attention restricted or “Foveated” Spatial Transformer Networks, coupled alongside a curriculum learning training scheme and an efficient log-polar visual space entry, provides better performance when compared to the What/Where model, all this without the need for any extra supervision whatsoever.


Sign in / Sign up

Export Citation Format

Share Document