A computational model for top down visual attention

Abhijit Sharang

Motivation

Visual attention is an important ability by which we reduce the information load in visual stimuli processing.Knowledge of important elements in the scene is instrumental for this selective processing,the acquisition of which is done by gaze fixation.
There are two approaches for modelling visual attention.The first is bottom up,which is data driven,and the second is top down,which focusses on the specificity of the task involved.The combination of the two models is involved for controlling the gaze in real time scenarios.
In the area of Computer Vision,bottom up saliency has been researched in great detail.However,a lot of the scene input in our everyday activities is goal directed,where bottom up attention is not of much help.Hence it is imperative to develop a robust computational model for top down visual attention.Apart from helping our understanding of the human visual attention process,such a model is also useful in human-computer interaction tasks like automated vehicle driving.

Related work

Top down saliency can be modelled in a Bayesian formulation using features in the scene(both bottom up and task driven) which can influence the gaze pattern.Torralba et al.[3] employed a discriminative model to maxmimise the probability P(O=1,X|L,G) ,where O=1 indicates the object presence,X is the location of the gaze in the image,and L and G denote the local and global features respectively.Zhang et al.[4] used Independent Component Analysis and Difference of Gaussian learned over a large number of images to estimate the probability P(L|G) in the expansion of the above mentioned probability by Bayes' theorem.NavalPakkam and Itti[5] proposed to maximise the signal to noise ratio in the image for object detection.In these approaches,visual search is an important component.
Peters and Itti[6] developed a spatial model by mapping the global features to eye fixation in navigation and exploration tasks.Cagli et al.[7] used a discriminative model for sensory-motor coordination in drawing tasks.Yi and Ballard[8] developed a dynamic Bayesian network for sandwich making task.
The methods mentioned are either task specific or use visual search.In the approaches involving the latter idea,defining reward functions and object importance is a challenging tasks,while the former cannot be expanded to a general top down model.The goal of this project is experimenting with the features which can help overcome the restrictions brought by these approaches.

Methodology

Modelling top down visual attention involves predicting the next object ,and the next location in the scene which will be attended.Hence,we need to maximise the probability P(O_t+1|S_t+1) and P(X_t+1|S_t+1),where S_t is the current state.
The current state is not directly accessible and should hence be modelled with the help of the features calculated from the scene.Besides,the current gaze is also expected to depend on the previous gaze location(P(X_t+1|X_t).Exploiting the information,a dynamic Bayesian network can be constructed with the model parameters V=(m,Θ),where m represents the structure of the network(the number of nodes and the dependencies between them in the graph) and Θ includes the state transition matrix,the observation matrix and the matrix of the initial probability distribution.The network can be trained to maximise P(O|V).
Since a dynamic bayesian network is equal in computational power to a hidden markov model and a kalman filter,these can be developed as well for the comparison of accuracy.

Experimentation

The computational model is intended to be developed using the game dataset released by the authors of [1]. It includes eye tracking data for 5 participants who played 3 video games(Hot Dog ambush, 3D driving school and Top Gun) for several minutes.Since the amount of data is huge,game data for only the first game is intended to be used.

References

Borji, Ali, Dicky N. Sihite, and Laurent Itti. "An Object-Based Bayesian Framework for Top-Down Visual Attention." Twenty-Sixth AAAI Conference on Artificial Intelligence. 2012.

Borji, A., Dicky N. Sihite, and L. Itti. "Probabilistic learning of task-specific visual attention." Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.

Oliva, Aude, and Antonio Torralba. "The role of context in object recognition." Trends in cognitive sciences 11.12 (2007): 520-527.

Hou, Xiaodi, and Liqing Zhang. "Saliency detection: A spectral residual approach." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.

Navalpakkam, Vidhya, and Laurent Itti. "Modeling the influence of task on attention." Vision research 45.2 (2005): 205-231.

Peters, Robert J., and Laurent Itti. "Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention." Computer Vision and Pattern Recognition, 2007. CVPR'07. IEEE Conference on. IEEE, 2007.

Coen-Cagli, Ruben, et al. "Visuomotor characterization of eye movements in a drawing task." Vision research 49.8 (2009): 810-818.

WEILIE, YI, and Dana Ballard. "Recognizing behavior in hand-eye coordination patterns." International Journal of Humanoid Robotics 6.03 (2009): 337-359.