Objectives

  • Implement a framework that is able to associate both deictic gestures and iconic gestures with language

  • Use this framework to generate gestures helpful in route descriptions and assembly instructions

Motivation and Past Work

People are explaining things to each other all the time. They use words, gestures, diagrams, props - anything to get the point across. Each method has its pros and cons and more often than not, they work in tandem.[1]

Social robots are on the rise. A lot of them have human-like appearances but not one is free from the curse of the uncanny valley. Apart from the candour of emotions presented by these robots, multi-modal interaction with humans can go a long way in overcoming the aforementioned barrier.

Researchers are looking at ways to integrate speech and gesture production in humanoid robots and study how multimodality affects a robot's interaction with humans.[2] Experiments have also been carried out regarding the design aspects of a robot’s utterance, gesture and timing.[3]

Robot giving directions [3]

Humans use gestures and diagrams inadvertently and excessively in two situations - while giving out route directions and while explaining to someone how to assemble or use an object. Words, by themselves, are meaningless symbols and it is these gestures and drawings that help transfer complicated thoughts about objects and actions because of visual similarity. This redundancy of information is necessary to both the person who needs the explanation as well as the person who is explaining. The other advantage that gestures provide is that they are a more universal means of communication. This project will focus on the joint use of gestures and words in route or assembly learning scenarios.

There are three kind of gestures: 1) deictic gestures that point to or indicate things in the environment; 2) iconic gestures that resemble what they are meant to convey and 3) beat gestures that are used in conjunction with speech and emphasize certain words or phrases.[1] While the importance of deictic gestures has been well established and modelled in a direction giving situation[4], the importance of iconic gestures needs to be studied more. They play a minor role in the route learning task because one needs iconic representation only for landmarks that are unknown to the asker. Many a time it is possible for the direction giver to provide the correct path without referring to any landmarks. However, in case of the parts assembly scenario iconic gestures are key. People usually represent the size, the shape and the function of various parts using iconic gestures.

Associating deictic gestures with words is not very involved because the set of actions available is very less - turn right/left and go straight. However, associating iconic gestures with words in an unsupervised manner is an interesting problem, which this project will attempt to deal with. An embodied cognition agent(ECA) will attempt to then reproduce the gestures along with instructions to a human listener and expect him to be able to perform the task at hand. Another interesting problem is the use of gestures to provide routes for indoor navigation. This also involves planning 3D routes and localizing himself in the environment. The ECA can then orient itself in different directions and give routes or point to the place if it is in that very room.

In the long run, robots should be able to pick up new gestures from new interactions and be able to use this newly learnt gesture later on. Matuszek et al[5] have successfully associated physical attributes with language. Similarly, this project aims at grounding words associated with actions and objects with their related gestures. Doing so would automate the process of an automatic agent learning gestures and implement it without any further human supervision.

Methodology

Gesture generation pipeline [2]



This project is mostly about steps 1 and 3 in the diagram - learning gestures and conveying them. It is assumed that there is an ECA that can plan how these gestures are to be actually performed and carry them out. Previously, robots and animated characters have been used as ECAs.

Plan:

  • Record and represent gestures in front of a Kinect along with words and instructions

  • Implement an Expectation Maximization algorithm to ground words with their gesture representations( both deictic and iconic) similar to the the work by Matuszek et al.[5]

  • Map these gestures to actions in a robot or animated character to humans

  • Record the success/failure of the human in performing the tasks of route or assembly learning.

References

[1] Tversky, Barbara, et al. "Explanations in gesture, diagram, and word." Spatial Language and Dialogue, Coventry, KR, Tenbrink, T., and Bateman, J.(Eds.), Oxford University Press, Oxford (2009).
[2] Kopp, Stefan, Kirsten Bergmann, and Ipke Wachsmuth. "Multimodal communication from multimodal thinking—towards an integrated model of speech and gesture production." International Journal of Semantic Computing 2.01 (2008): 115-136.
[3] Okuno, Yusuke, et al. "Providing route directions: design of robot's utterance, gesture, and timing." Human-Robot Interaction (HRI), 2009 4th ACM/IEEE International Conference on. IEEE, 2009.
[4] Striegnitz, Kristina, et al. "Knowledge representation for generating locating gestures in route directions." Proceedings of Workshop on Spatial Language and Dialogue (5th Workshop on Language and Space). 2005.
[5] Matuszek, Cynthia, et al. "A Joint Model of Language and Perception for Grounded Attribute Learning." arXiv preprint arXiv:1206.6423 (2012).