Motivation :
When humans look at scene, then they don't watch it in steadiness but instead use their saccadic eye movements to direct fixation towards the interesting points in the scene. They form a high level representation of the scene in the form of foveated area which is helpful in visual scene memory, object recognition and classification.[5]
In case of computer vision, several machine learning techniques are exploited by training on datasets (wholly or partially annotated by humans) to give state of the art performances in action classification, object recognition, segmentation etc. But this is still way behind the performance of humans in similar tasks.
By studying and using the features based on the "saccadic motion & fixation" of eyes in action classification in visuals, we try to provide platform to bridge the gap between human and computer vision techniques.
Related Work :
In computer vision community, several techniques have been devised to point out the interesting point in images/videos. The descriptive set of features is calculated around these points in order to perform action recognition tasks using classifiers; for instance one of such successful approaches involve Harris spatiotemporal cornerness operator [3][4].
Mathe et al. [1] provides large scale dynamic human eye movement dataset captured over the videos in computer vision datasets like Hollywood-2 [6] and UCF Sports [7]. This dataset provides the set of coordinates along with the timestamp for each video, categorised into fixation or saccadic motion of eye for 16 subjects/per video.
Proposed Methodology :
Through this project, we would like to study the role of human eye fixated regions in a video, in determining the action present in it. Thus trying to check the relevance of visual information found in eye fixated regions.
Since the code/detailed implementation is not publicly available, so we propose the following method for visual action recognition depending on eye gaze data:
Experimentation :
We would like to compare the result of our proposed approach with other state of the art performances so as to explore how informative the foveated area formed by eye-fixated regions of entire video is in the task of action classification.