Impact of eye fixated regions in visual action recognition


Guide : Prof. Amitabha Mukerjee

Deepak Pathak
{deepakp@iitk.ac.in}



Motivation :

When humans look at scene, then they don't watch it in steadiness but instead use their saccadic eye movements to direct fixation towards the interesting points in the scene. They form a high level representation of the scene in the form of foveated area which is helpful in visual scene memory, object recognition and classification.[5]
In case of computer vision, several machine learning techniques are exploited by training on datasets (wholly or partially annotated by humans) to give state of the art performances in action classification, object recognition, segmentation etc. But this is still way behind the performance of humans in similar tasks.

By studying and using the features based on the "saccadic motion & fixation" of eyes in action classification in visuals, we try to provide platform to bridge the gap between human and computer vision techniques.

Related Work :

In computer vision community, several techniques have been devised to point out the interesting point in images/videos. The descriptive set of features is calculated around these points in order to perform action recognition tasks using classifiers; for instance one of such successful approaches involve Harris spatiotemporal cornerness operator [3][4].

Mathe et al. [1] provides large scale dynamic human eye movement dataset captured over the videos in computer vision datasets like Hollywood-2 [6] and UCF Sports [7]. This dataset provides the set of coordinates along with the timestamp for each video, categorised into fixation or saccadic motion of eye for 16 subjects/per video.


Heat Maps generated from eye-fixated areas {Image Credits: Mathe et al. [2]}


Proposed Methodology :

Through this project, we would like to study the role of human eye fixated regions in a video, in determining the action present in it. Thus trying to check the relevance of visual information found in eye fixated regions.

Since the code/detailed implementation is not publicly available, so we propose the following method for visual action recognition depending on eye gaze data:

  1. The eye gaze fixated points in videos i.e. (x,y,t) are to be treated as interest-points, and then we will get some standard descriptor centered at these interest points.[4]
  2. We will then try to map these descriptors to the some vocabulary leading to bag of words formation.
  3. This bag of word representation will be further used in action classification in a video using some non-linear multiclass classifier.

Experimentation :

We would like to compare the result of our proposed approach with other state of the art performances so as to explore how informative the foveated area formed by eye-fixated regions of entire video is in the task of action classification.




Dataset : Click here.

References :
  1. Mathe, Stefan, and Cristian Sminchisescu. "Dynamic eye movement datasets and learnt saliency models for visual action recognition." Computer Vision-ECCV 2012. Springer Berlin Heidelberg, 2012. 842-856.
  2. Mathe, Stefan, and Cristian Sminchisescu. Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. Technical report, Institute of Mathematics of the Romanian Academy and University of Bonn (February 2012), 2012.
  3. Laptev, Ivan. "On space-time interest points." International Journal of Computer Vision 64.2 (2005): 107-123.
  4. Klaser, Alexander, and Marcin Marszalek. "A spatio-temporal descriptor based on 3D-gradients." (2008).
  5. Henderson, John M., and Andrew Hollingworth. "Eye movements and visual memory: Detecting changes to saccade targets in scenes." Attention, Perception, & Psychophysics 65.1 (2003): 58-71.
  6. Marszalek, Marcin, Ivan Laptev, and Cordelia Schmid. "Actions in context." Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009.
  7. Rodriguez, M.D.; Ahmed, J.; Shah, M., "Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition," Computer Vision and Pattern Recognition, 2008.