Document Clustering : Similarity Measures

Shouvik Sachdeva (11693)
Bhupendra Kastore (11204)

Abstract

Document clustering is a method to classify the documents into a small number of coherent groups or clusters by using appropriate similarity measures. Document clustering plays a vital role in document organization, topic extraction and information retrieval. With the ever increasing number of high dimensional datasets over the internet, the need for efficient clustering algorithms has risen. A lot of these documents share a large proportion of lexically equivalent terms. We will exploit this feature by using a “bag of words" model to represent the content of a document. We will group “similar" documents together to form a coherent cluster. This “similarity" can be defined in various ways. In the vector space, it is closely related to the notion of distance which can be defined in several ways. We will try to test which similarity measure performs the best across various domains of text articles in English and Hindi.

Downloads

Project Proposal
Slides
Poster
Report
Code