Learning Algorithm

The Learning Algorithm used is a modification of the Q-learning algorithm used for the cart-pole balancing algorithm(Anderson). The following changes have been made :

The number of input lines to the neural net have been changed from four(cart-pole problem) to 9 ( 3 for ball coordinates, 3 for ball velocity and 3 for bat coordinates) in the first approach and 12 in the second approach (3 more than first approach for bat velocity).

The number of Q-values, that is, the output lines have been changed from 2 in cart-pole problem to 27. The 27 Q-values correspond to 27 elements in action set. Each element of action set is a set of three increments delta_x , delta y , delta_z corresponding to the three increments in bat coordinates x,y and z in the first approach and delta_vx , delta_vy , delta_vz corresponding to the three increments in the x,y and z compoments of bat velocity in second approach. Each increment can have three values - zero, +delta and -delta. So we have 3*3*3 = 27 actions in action set. The delta value is different for x,y and z components. The delta value increases from y to z to x , that is in the same order as the dimension of the coordinate increases, the reason being that the greater the dimension of the coordinate (x.y or z) the greater the increments required to reach the ball. Initially we tried an action set having 8-actions (no zero value in this case) but as expected this approch gave comparitively poorer results. The zero value turned out to be especially important in the case of second approach to avoid large increase in bat velocity.

The number of nodes in the hidden layers have been changed from 40 in cart-pole problem to 70 in first approach and 90 in the second approach

The Back-propogation paramters have been modified after tests on a few generated serves.

The presence of 27 Q-value introduces another problem - how to explore new moves. In the cart pole problem each move is chosen with a probability of 0.5 . In our case at each move we choose the action with maximum q-value with a probability p that increases with time ( the function used is 1 - 0.5*(lambda exp t) where lambda = 0.95) and to explore we choose ,one of actions from the remaining 26 with equal probability, with a probbility of 1-p.

Back to index