Abstract
Document clustering is a method to classify the documents into a small number of coherent groups or clusters by using appropriate similarity measures. Document clustering plays a vital role in document organization, topic extraction and information retrieval. With the ever increasing number of high dimensional datasets over the internet, the need for efficient clustering algorithms has risen. A lot of these documents share a large proportion of lexically equivalent terms. We will exploit this feature by using a “bag of words" model to represent the content of a document. We will group “similar" documents together to form a coherent cluster. This “similarity" can be defined in various ways. In the vector space, it is closely related to the notion of distance which can be defined in several ways. We will try to test which similarity measure performs the best across various domains of text articles in English and Hindi.
Downloads
Project ProposalSlides
Poster
Report
Code