CS671: Natural Language Processing

Department of Computer Science & Engineering, IIT Kanpur

Jul - Nov 2015

Home      |      Course Info     |      Assignments      |      Students     |      Resources     |      Projects

Homework 1 - Finding Syllables in Indic Languages

Problem Statement

In this homework you need to
  1. create a corpus of 1MB in your your chosen Indian language. To identify sources in the corpus, keep headers at the top of each file (or over each break in a single file) indicating the source. This is because results can be affected considerably by the type of data. Keep a header of this type:
        %%%%% WEBPAGE / SOURCE
        %%%%% CS671 : "your email"  YYMMDD (date)
        [text] 
    
  2. implement an algorithm for detecting syllables in multiple languages via a flexible "grammar". If possible, use a command line switch of the type -hindi, -latinEnglish, -latinDevnagSansk, -telugu etc.
  3. apply your algorithm to
    1. your corpus in your language; list the top syllables and their frequencies. Also, please give a dense plot showing the decreasing log frequency (y-axis) vs the top 1000 syllables (x-axis)
    2. a second unicode language. (you can try the latinDevanagari font as in bgita.txt; or if your language in part-a was not Hindi, then you can try the Hindi in hwiki.txt)
    3. Optional: Run it on a small English corpus.
If possible, please compare your syllable detection with the results obtained by this FSA which was created some years ago for Hindi syllables by Nikhil Joshi. (Do not use this blindly, there can be many syllable grammars.)

Submission

Please create an index.html file with all images of the plots. All submissions will be online via your home.iitk web area accessible through the URLs of the following kind: http://home.iitk.ac.in/~YOUR_USER_ID/cs671/hw1/. See the Submissions column in the Students page. The file "index.html" in this area should report
  1. list of top syllables and syllable bigrams from c(i) and c(ii) above.
  2. a plot of the log-frequency distribution for the top 1000 syllables
  3. link to the corpus created by you
  4. code.zip should be uploaded and linked from your index.html 2 days AFTER the due date.
NOTE: Please DO NOT use global paths as in <a href="home.iitk.ac.in/~USERID/cs365/FILE"> to refer to something within your home area. Instead use local paths as in <a href="FILE">.

Due date: Aug 7, 2015, 10 PM

Relevant Resources