This phase involves describing the environment and the static and dynamic objects present in the system namely the ball , the table and the net. The physics of the real game are simulated to adapt the requirements of a real-time virtual environment. These include
The Trajectory of the ball
The ball has fixed dimensions and a constant weight. The motion is governed by
The value of AIR_DRAG_CONST for the TT- Ball = 0.07 - 0.15 .
The simulator
calculates the trajectory of the ball by updating the state of the
ball
( coordinates
& velocity ) using these equations after a fixed time interval
which
we call TIMESTEP
.
NOTE
Design of the Graphics Interface
Performance measure for a shot : To train the algorithm
for a specific shots. for instance , hitting the ball deep, along the lines
- the reward function for Q-learning algorithm can be defined
over the 2-d space of the table .
The Learning Algorithm Used
The State Decomposition Approach
Underlying Assumptions : We assume that the velocity vector and coordinates for the bat are independent of each other - under this assumption we can decompose the problem of determing coordinates and velocity of the bat into two individual and unrelated parts.Sub-division of task into 2 - subtasks and hence the decomposition of state space
First Part : Reaching the ball
Basic Idea
- Bat velocity not a concern in this approach
- Input at each step to agent is ball velocity, ball position & bat positon
- Output at each step is increments in bat coordinates .
where,
S is the problem state
A is the action set for the state S.
X ,Y,Z are coordinates
V is velocity
The superscripts b and r refer to the ball and racket respectively.
The subscripts denote the components
terms supercripted with ' are determined by the simulator
terms superscripted by * are determined by the agent action
Each State is Characterised by :
- Ball position
- Ball velocity
- Bat position
Action Set
- Increments in bat coordinates ( 3 values for 3 components => 27 actions)
- Discrete values for x coordinate -0.5 , 0 , + 0.5
- Discrete values for y coordinate -0.2 , 0 , +0.2
- Discrete values for z coordinate -0.3, 0 , +0.3
Reward Function
R(s) = + 1 for intercepting the ball
Second Part: Determining Shot to play after intersecting ball- 1 for missing the ball
0 for other states
Salient Features:
- Supervised Learning using neural nets
- A Separate Program for generating a training set
- Use of back-propagation algorithm & neural net learning
Details of the MLP (Multi Layer Perceptron) Neural Net usedNOTE: The model has been kept simple :
- 6 input lines ( 3 for coordinates of the hit point + 3 for ball velocity )
- 3 output lines ( these give ball velocities after the collision directly)
Actually what should have been done was to output eitherand then calculate the return velocity from it using the law of dynamics.
- velocity of bat while contact with ball or
- force applied on bat while contact with ball
The size of the training sets is 4000 records ie. input output paiirs
Each state is characterised by -where,S is the problem state
A is the action set for the state S.
X ,Y,Z are coordinates
V is velocity
The superscripts b and r refer to the ball and racket respectively.
The subscripts denote the components
terms supercripted with ' are determined by the simulator
terms superscripted by * are determined by the agent action
However results obtained using such complex reward
function were not
very satisfactory.
Q-Learning Network
and Algorithm
The Neural Net (Multi Layer Perceptron) in this case has
- 12 input lines
- 90 nodes in hidden layer
- 27 output lines
- Back - Propagation of TD-error as explained.