Welcome
Welcome to the homepage of the Computer Vision Group of IIT Kanpur. We are a group of
faculty and students working on exciting problems on the recently very popular area of
Computer Vision and applied Machine Learning as well as on the intersection of those with
Signal Processing and Robotics. We are interested primarily in the research problems and
general directions given below, but are also adaptable and receptive to new interesting
problems that may come up in the near future.
This webpage is under construction so keep checking back for more information.
Vision and Language
The progress in visual and textual processing and understanding has happened in
relatively distinct threads, traditionally. More recently vision and language methods
and algorithms have been combined towards applications such as image captioning and
visual question answering. Eg. the image on the left would be captioned
automatically as A dog chasing a ball or a question could be posed
with that image as What color ball is the dog chasing? with a
possible answer as 'white'. We are interested in such problems where the
complementarity of vision and language models is exploited and novel algorithms and
problems are designed to address relevant challenges and applications.
Face and Human Analysis
Visual data is increasing at a very high rate — everyone has a camera in her/his
pocket and an internet connection to share picture and videos. Most of such human
generated visual data has, in turn, humans as the main subjects. Hence, analysis and
understanding of human centered visual data is an important part of Computer Vision.
We are interested in many problems focusing on humas such as (i) facial analysis:
predicting identity, emotions, intent from faces, (ii) human attributes prediction:
the kind of clothes the person is wearing, the accessories the person is wearing,
(iii) pose estimation, (iv) action/activity prediction.
Human Behavior Analysis
Our research direction on human behavior analysis lies in the intersection of
Computer Vision, Signal Processing and Machine Learning. Human behavior is
inherently multimodal, and hence requires combining information from other
modalities (speech or language, for example) with vision. Through the confluence of
these techniques, our goal is to provide a quantitative understanding of individual,
group and social human behavior, in domains, such as, media, education, and health.
Perception/Vision for Robotics
Today, robots are used to perform challenging tasks that were not possible few years ago
because of limited computational and sensor resources. In order to perform these complex
tasks, robots need to sense and understand the environment around them. Depending upon
the task at hand, robots are often equipped with different sensors to perceive their
environment. Two important categories of perception sensors mounted on a robotic
platform are: (i) Range Sensors — 3D/2D lidars, radars, sonars, etc. (ii)
Cameras — perspective, stereo, omnidirectional, etc. With the recent
advancements in these sensing technologies, the capabilities of robots to perform
difficult tasks has been greatly extended. The computer vision group in IIT Kanpur
is interested in research problems related to sensing for robotics applications. One
such example is autonomous navigation of robots where techniques from computer
vision are used for localization of robots and for obstacle detection and
classification.
Assistive Computer Vision
In this research direction, we investigate the scope of computer vision for
assisting human beings in day-to-day life. We develop algorithms for a class of
related problems in computer vision. This area is becoming more and more practical
with the popularity of wearable cameras (eg. google glasses) and lightweight
computing devices (eg. mobile phones). Our use case centers around a wearable or
portable camera capturing the world around as still images or video streams. The
goal is to provide appropriate inputs to the human to help in a specific set of
tasks. For instance, a visually challenged person uses a wearable camera to know the
surroundings. The automated understanding of the content from the images/videos, is
then used as an alternate input to enrich the interaction with the external world.
For example, this person can read a text or sign, locate a specific object of
interest, anticipate the pose of the object for manipulation, know the identity of a
person around and appreciate the facial expressions of the people around.