CS671A - Assignment 1

For Hindi


The following is the Syllable log Frequency plot for a 1MB Hindi corpus made by a collection of several Wikipedia articles:
The corpus is available here
This is the list of top 1000 syllables in the corpus and this is the list of top 1000 bigrams.



For Marathi


The following is the Syllable log Frequency plot for a small Marathi corpus made by a collection of several Wikipedia articles:
The corpus is available here
This is the list of top 1000 syllables in the corpus and this is the list of top 1000 bigrams.



A list of top 1000 syllables for hwiki.txt can be found here and the bigrams list for this corpus is present here .


The code can be downloaded from here.