CS671 : NLP Assigment 1

Finding Syllables in Indic Languages

This page lists the results obtained by running the code through 2 texts; one in Bengali, the other in Hindi.
The plots are logarithmic in nature with y-axis representing log₁₀(frequency). The numbers on top of each bar represent the actual frequencies.

The algorithm for finding syllables in similar to the FSA found here.
However the code uses different character lists (independents - vowels and consonants, dependents - matras and some special characters, and a halant/hasant character)
The algorithm however is not very accurate and results in several errors.
The code is also able to handle Latin script but uses naive algorithm for finding syllables in them.

The code also includes a syllable splitter based on Monojit Choudhury's algorithm of schwa deletion.
Currently only Bengali splitter is implemented using this. It also skips punctuation symbols.
The results obtained using this splitter on the Bengali corpus are also provided.
Please note that the implementation may not be exactly Choudhury's algorithm.

Language : Bengali

Corpus used : Self-compiled. corpus_bengali.txt

Results (Top 1000) :
Text files : words_bengali.txt, letters_bengali.txt, syllables_bengali.txt (using independent/dependent list algorithm), syllables_bengali_new.txt (using Choudhury's schwa deletion algorithm)
Plots :
Words

Letters

Syllables (using independent/dependent list algorithm)

Syllables (using Choudhury's schwa deletion algorithm)

Language : Hindi

Corpus used : hwiki.txt

Results (Top 1000) :
Text files : words_hindi.txt, letters_hindi.txt, syllables_hindi.txt
Plots :
Words

Letters

Syllables

Bengali corpus

corpus_bengali.txt
Contains 3 of Ranbindranath Tagore's novels : Rajarshi, Chaturanga, Char Adhyay. Size : 1.07 MB

Code

NLP_HW1.zip
Contains source code and required files. Please keep all files in the same directory as the compiled program when running.
USAGE: compiled_exec input_file -o output_folder -l script -e true/false
output_folder must exist (in windows format is C:\\folder\\, in linux /home/user/folder/)
-e specifies whether to use Choudhury splitter or not ... by default it is true.
script can be latin, bengali or devanagari