scholarly journals What has been missed for predicting human attention in viewing driving clips?

PeerJ ◽  
2017 ◽  
Vol 5 ◽  
pp. e2946 ◽  
Author(s):  
Jiawei Xu ◽  
Shigang Yue ◽  
Federica Menchinelli ◽  
Kun Guo

Recent research progress on the topic of human visual attention allocation in scene perception and its simulation is based mainly on studies with static images. However, natural vision requires us to extract visual information that constantly changes due to egocentric movements or dynamics of the world. It is unclear to what extent spatio-temporal regularity, an inherent regularity in dynamic vision, affects human gaze distribution and saliency computation in visual attention models. In this free-viewing eye-tracking study we manipulated the spatio-temporal regularity of traffic videos by presenting them in normal video sequence, reversed video sequence, normal frame sequence, and randomised frame sequence. The recorded human gaze allocation was then used as the ‘ground truth’ to examine the predictive ability of a number of state-of-the-art visual attention models. The analysis revealed high inter-observer agreement across individual human observers, but all the tested attention models performed significantly worse than humans. The inferior predictability of the models was evident from indistinguishable gaze prediction irrespective of stimuli presentation sequence, and weak central fixation bias. Our findings suggest that a realistic visual attention model for the processing of dynamic scenes should incorporate human visual sensitivity with spatio-temporal regularity and central fixation bias.

Sensors ◽  
2021 ◽  
Vol 21 (9) ◽  
pp. 3099
Author(s):  
V. Javier Traver ◽  
Judith Zorío ◽  
Luis A. Leiva

Temporal salience considers how visual attention varies over time. Although visual salience has been widely studied from a spatial perspective, its temporal dimension has been mostly ignored, despite arguably being of utmost importance to understand the temporal evolution of attention on dynamic contents. To address this gap, we proposed Glimpse, a novel measure to compute temporal salience based on the observer-spatio-temporal consistency of raw gaze data. The measure is conceptually simple, training free, and provides a semantically meaningful quantification of visual attention over time. As an extension, we explored scoring algorithms to estimate temporal salience from spatial salience maps predicted with existing computational models. However, these approaches generally fall short when compared with our proposed gaze-based measure. Glimpse could serve as the basis for several downstream tasks such as segmentation or summarization of videos. Glimpse’s software and data are publicly available.


2009 ◽  
Vol 101 (2) ◽  
pp. 917-925 ◽  
Author(s):  
A. T. Smith ◽  
P. L. Cotton ◽  
A. Bruno ◽  
C. Moutsiana

The pulvinar region of the thalamus has repeatedly been linked with the control of attention. However, the functions of the pulvinar remain poorly characterized, both in human and in nonhuman primates. In a functional MRI study, we examined the relative contributions to activity in the human posterior pulvinar made by visual drive (the presence of an unattended visual stimulus) and attention (covert spatial attention to the stimulus). In an event-related design, large optic flow stimuli were presented to the left and/or right of a central fixation point. When unattended, the stimuli robustly activated two regions of the pulvinar, one medial and one dorsal with respect to the lateral geniculate. The activity in both regions shows a strong contralateral bias, suggesting retinotopic organization. Primate physiology suggests that the two regions could be two portions of the same double map of the visual field. In our paradigm, attending to the stimulus enhanced the response by about 20%. Thus attention is not necessary to activate the human pulvinar and the degree of attentional enhancement matches, but does not exceed, that seen in the cortical regions with which the posterior pulvinar connects.


Author(s):  
А. Axyonov ◽  
D. Ryumin ◽  
I. Kagirov

Abstract. This paper presents a new method for collecting multimodal sign language (SL) databases, which is distinguished by the use of multimodal video data. The paper also proposes a new method of multimodal sign recognition, which is distinguished by the analysis of spatio-temporal visual features of SL units (i.e. lexemes). Generally, gesture recognition is a processing of a video sequence, which helps to extract information on movements of any articulator (a part of the human body) in time and space. With this approach, the recognition accuracy of isolated signs was 88.92%. The proposed method, due to the extraction and analysis of spatio-temporal data, makes it possible to identify more informative features of signs, which leads to an increase in the accuracy of SL recognition.


eLife ◽  
2020 ◽  
Vol 9 ◽  
Author(s):  
Christopher A Henry ◽  
Mehrdad Jazayeri ◽  
Robert M Shapley ◽  
Michael J Hawken

Complex scene perception depends upon the interaction between signals from the classical receptive field (CRF) and the extra-classical receptive field (eCRF) in primary visual cortex (V1) neurons. Although much is known about V1 eCRF properties, we do not yet know how the underlying mechanisms map onto the cortical microcircuit. We probed the spatio-temporal dynamics of eCRF modulation using a reverse correlation paradigm, and found three principal eCRF mechanisms: tuned-facilitation, untuned-suppression, and tuned-suppression. Each mechanism had a distinct timing and spatial profile. Laminar analysis showed that the timing, orientation-tuning, and strength of eCRF mechanisms had distinct signatures within magnocellular and parvocellular processing streams in the V1 microcircuit. The existence of multiple eCRF mechanisms provides new insights into how V1 responds to spatial context. Modeling revealed that the differences in timing and scale of these mechanisms predicted distinct patterns of net modulation, reconciling many previous disparate physiological and psychophysical findings.


2020 ◽  
Vol 34 (07) ◽  
pp. 11278-11286 ◽  
Author(s):  
Soo Ye Kim ◽  
Jihyong Oh ◽  
Munchurl Kim

Super-resolution (SR) has been widely used to convert low-resolution legacy videos to high-resolution (HR) ones, to suit the increasing resolution of displays (e.g. UHD TVs). However, it becomes easier for humans to notice motion artifacts (e.g. motion judder) in HR videos being rendered on larger-sized display devices. Thus, broadcasting standards support higher frame rates for UHD (Ultra High Definition) videos (4K@60 fps, 8K@120 fps), meaning that applying SR only is insufficient to produce genuine high quality videos. Hence, to up-convert legacy videos for realistic applications, not only SR but also video frame interpolation (VFI) is necessitated. In this paper, we first propose a joint VFI-SR framework for up-scaling the spatio-temporal resolution of videos from 2K 30 fps to 4K 60 fps. For this, we propose a novel training scheme with a multi-scale temporal loss that imposes temporal regularization on the input video sequence, which can be applied to any general video-related task. The proposed structure is analyzed in depth with extensive experiments.


Sign in / Sign up

Export Citation Format

Share Document