Homework 1 - Finding Syllables in Indic Languages
Problem Statement
In this homework you need to
- create a corpus of 1MB in your your chosen Indian language. To identify
sources in the corpus, keep headers at the top of each file (or over each
break in a single file) indicating the source. This is because results can
be affected considerably by the type of data. Keep a header of this
type:
%%%%% WEBPAGE / SOURCE %%%%% CS671 : "your email" YYMMDD (date) [text]
- implement an algorithm for detecting syllables in multiple
languages via a flexible "grammar". If possible, use a command line
switch of the type
-hindi, -latinEnglish, -latinDevnagSansk, -telugu
etc. - apply your algorithm to
- your corpus in your language; list the top syllables and their frequencies. Also, please give a dense plot showing the decreasing log frequency (y-axis) vs the top 1000 syllables (x-axis)
- a second unicode language. (you can try the latinDevanagari font as in bgita.txt; or if your language in part-a was not Hindi, then you can try the Hindi in hwiki.txt)
- Optional: Run it on a small English corpus.
Submission
Please create an index.html file with all images of the plots.
All submissions will be online via your home.iitk web area accessible
through the URLs of the following kind:
http://home.iitk.ac.in/~YOUR_USER_ID/cs671/hw1/
.
See the Submissions column in the Students page.
The file "index.html" in this area should report
- list of top syllables and syllable bigrams from c(i) and c(ii) above.
- a plot of the log-frequency distribution for the top 1000 syllables
- link to the corpus created by you
- code.zip should be uploaded and linked from your index.html 2 days AFTER the due date.
Relevant Resources
- Monojit Choudhury; Rule-based grapheme to phoneme mapping for hindi speech synthesis; Proceedings of 90th Indian Science Congress of the International Speech Communication Association (ISCA), Bangalore, India; 2003
- Pramod Pandey; Akshara-to-sound rules for Hindi; In Writing Systems Research 2014, Vol 6, No. 1, pp. 54-72.