CS671

Homework 1

Syllabification for Hindi in Devnagari Script

Link to code


Corpus created using Premchand’s novel “ गोदान ”

Corpus Statistics

Words

88,005

Characters with spaces

289,923

Characters excluding spaces

197,901

After using Monojit Choudury’s algorithm [1] for syllabification the following results were obtained:

Syllabification Statistics

Number of total syllables:

139,466

Number of distinct syllables:

3,030

Number of total bigrams:

139,465

Number of distinct bigrams:

38178

The following table enlists relevant files:

File links

1 Mb hindi corpus

Corpus with syllable boundaries marked

Unigram frequencies

Bigram Frequency

Unigram Frequency plot:

Unigram Log Frequecy plot:


Bigram frequency plot (top 1000 bigrams):

Bigram log frequency plot (top 1000 bigrams):

[1] Choudhury, Monojit. "Rule-based grapheme to phoneme mapping for hindi speech synthesis." 90th Indian Science Congress of the International Speech Communication Association (ISCA), Bangalore, India. 2003.