This page lists the results obtained by running the code through 2 texts; one in Bengali, the other in Hindi.
The plots are logarithmic in nature with y-axis representing log10(frequency). The numbers on top of each bar represent the actual frequencies.
The algorithm for finding syllables in similar to the FSA found here.
However the code uses different character lists (independents - vowels and consonants, dependents - matras and some special characters, and a halant/hasant character)
The algorithm however is not very accurate and results in several errors.
The code is also able to handle Latin script but uses naive algorithm for finding syllables in them.
The code also includes a syllable splitter based on Monojit Choudhury's algorithm of schwa deletion.
Currently only Bengali splitter is implemented using this. It also skips punctuation symbols.
The results obtained using this splitter on the Bengali corpus are also provided.
Please note that the implementation may not be exactly Choudhury's algorithm.
Corpus used : Self-compiled. corpus_bengali.txt
Results (Top 1000) :
Text files : words_bengali.txt, letters_bengali.txt, syllables_bengali.txt (using independent/dependent list algorithm), syllables_bengali_new.txt (using Choudhury's schwa deletion algorithm)
Plots :
Words
Letters
Syllables (using independent/dependent list algorithm)
Syllables (using Choudhury's schwa deletion algorithm)
Corpus used : hwiki.txt
Results (Top 1000) :
Text files : words_hindi.txt, letters_hindi.txt, syllables_hindi.txt
Plots :
Words
Letters
Syllables
corpus_bengali.txt
Contains 3 of Ranbindranath Tagore's novels : Rajarshi, Chaturanga, Char Adhyay. Size : 1.07 MB
NLP_HW1.zip
Contains source code and required files. Please keep all files in the same directory as the compiled program when running.
USAGE: compiled_exec input_file -o output_folder -l script -e true/false
output_folder must exist (in windows format is C:\\folder\\, in linux /home/user/folder/)
-e specifies whether to use Choudhury splitter or not ... by default it is true.
script can be latin, bengali or devanagari