Syllabification for Hindi in Devnagari Script
Corpus created using Premchand’s novel “ गोदान ”
Corpus Statistics | |
Words | 88,005 |
Characters with spaces | 289,923 |
Characters excluding spaces | 197,901 |
After using Monojit Choudury’s algorithm [1] for syllabification the following results were obtained:
Syllabification Statistics | |
Number of total syllables: | 139,466 |
Number of distinct syllables: | 3,030 |
Number of total bigrams: | 139,465 |
Number of distinct bigrams: | 38178 |
The following table enlists relevant files:
File links | |
Unigram Frequency plot:
Unigram Log Frequecy plot:
Bigram frequency plot (top 1000 bigrams):
Bigram log frequency plot (top 1000 bigrams):
[1] Choudhury, Monojit. "Rule-based grapheme to phoneme mapping for hindi speech synthesis." 90th Indian Science Congress of the International Speech Communication Association (ISCA), Bangalore, India. 2003.