Corpus :
A hindi language text has been used as a corpus. The text corresponds to a Hindi novel. The text file is available [here]
Plot of syllable frequency vs syllable index for Hindi Language :
Plot of bigram frequency vs syllable index for Hindi Language :
Most frequently occuring syllables:
उ 3253 |
अ 2874 |
ए 2376 |
आ 2220 |
या 1988 |
से 1870 |
है 1721 |
का 1626 |
हो 1609 |
मे 1598 |
के 1488 |
ने 1400 |
की 1358 |
था 1342 |
कि 1315 |
औ 1287 |
Most frequently occuring bigrams:
जॉन 768 |
देख 599 |
क्य 509 |
प्र 503 |
नेक 452 |
कुछ 445 |
बोल 424 |
फिर 386 |
जान 384 |
लेक 383 |
तुम 346 |
स्त 315 |
किय 301 |
हूव 299 |
किस 290 |
मेर 287 |
Results on Second Language-Sanskrit:
The corporus can be found
[here]
Plot of syllable frequency vs syllable index for Sanskrit Language Text:
Most popular syllables:
अ 680 |
त 646 |
या 546 |
वा 414 |
ष 378 |
वि 375 |
र 371 |
ना 357 |
ता 320 |
ति 316 |
रा 294 |
उ 287 |
पा 267 |
क 243 |
न 210 |
नि 198 |
Plot of bigram frequency vs syllable index for Sanskrit Language Text:
Most popular bigrams:
त्र 371 |
र्व 287 |
स्य 274 |
त्य 207 |
क्ष 201 |
न्त 195 |
याय 168 |
त्त 164 |
ष्ट 151 |
त्व 134 |
क्त 129 |
ध्य 125 |
श्र 112 |
न्य 112 |
वान 102 |
ण्ड 99 |
The code for finding syllables can be found
[here]
The code for finding bigrams can be found
[here]