Learning Algorithm
The Learning Algorithm used is a modification of the
Q-learning algorithm used for the cart-pole balancing algorithm(Anderson).
The following changes have been made :
The number of input lines to the neural net have been
changed from four(cart-pole problem) to 9 ( 3 for ball coordinates, 3 for
ball velocity and 3 for bat coordinates) in the first approach and 12 in
the second approach (3 more than first approach for bat velocity).
-
The number of Q-values, that is, the output lines
have been changed from 2 in cart-pole problem to 27. The 27 Q-values correspond
to 27 elements in action set. Each element of action set is
a set of three increments delta_x , delta y , delta_z corresponding
to the three increments in bat coordinates x,y and z in the first approach
and delta_vx , delta_vy , delta_vz corresponding to the three increments
in the x,y and z compoments of bat velocity in second approach. Each increment
can have three values - zero, +delta and -delta. So we have 3*3*3
= 27 actions in action set. The delta value is different for x,y
and z components. The delta value increases from y to z to x , that is
in the same order as the dimension of the coordinate increases, the reason
being that the greater the dimension of the coordinate (x.y or z) the greater
the increments required to reach the ball. Initially we tried
an action set having 8-actions (no zero value in this case) but as expected
this approch gave comparitively poorer results. The zero value turned
out to be especially important in the case of second approach to avoid
large increase in bat velocity.
-
The number of nodes in the hidden layers have
been changed from 40 in cart-pole problem to 70 in first approach and 90
in the second approach
-
The Back-propogation paramters have been modified
after tests on a few generated serves.
-
The presence of 27 Q-value introduces another problem
- how to explore new moves. In the cart pole problem each move is
chosen with a probability of 0.5 . In our case at each move we choose the
action with maximum q-value with a probability p that increases with time
( the function used is 1 - 0.5*(lambda exp t) where lambda
= 0.95) and to explore we choose ,one of actions from the remaining
26 with equal probability, with a probbility of 1-p.
Back to top
Back to index