We have defined the freuency-time vector as a two dimensional array which contains the amplitudes of the discrete frequency components of the sound signal over small sampling windows, called the hamming windows. We have taken rectangular hamming windows for simplicity, i.e. the weightage of the signal component over the whole window is same. Other types of hamming windows can be gaussian, triangular etc. We have taken the length of the hamming window to be 0.05 seconds, i.e. 400 samples/window, since sampling rate was 8000 samples/second.
The frequency time array a[x][y] for xth frequency component, and yth hamming window was obtained using the discrete time fourier formula
a[x][y] = abs(Summation over i (exp (j*2*PI*f[x]*i/Sampling_rate) ))
Now we could obtain the frequency time vector in many ways. It could be variable length, the length depending upon the time taken by the speaker to speak the word, or could be fixed length. When we take the fixed length frequency vector, we sample the frequencies of the word at larger intervals of time if the length of the word is larger, and do otherwise if the length is smaller. In this way we get the fixed length frequency vector.
Now we obtain the reference frequency vectors, after training the data vectors through a neural network (discussed later), and store them in a reference data file matrix.dat. When we are running our program for word recognition, we find out the frequency-time vector of the input word, and then take its inner product with the reference vectors. The output is the word corresponding to the maximum inner product value.
The results obtained in this way were found out to be very unsatisfactory, even for single speaker ( some observations ), primarily because when we are segmenting the words, due to the large variation in the initial silence period, the corresponding parts of the words were not coinciding in the in the frequency time vector.
So, we are taking another approach: find the frequency vectors after
coinciding the positions corresponding to peak amplitude value of the words.
In this way we are getting better results, but our analysis is still
incomplete. But for some speakers the peak amplitude in the word doesn't
correspond to the same syllable each time the word is spoken, which may
create a problem.