Different Phases in our project :

Building the 3D simulator
Learning Racquet moves and shots ( use of Reinforcement Learning and & Neural Networks )

Design a graphics interface to display the results.

Building the 3D Simulator

This phase involves describing the environment and the static and dynamic objects present in the system namely the ball , the table and the net. The physics of the real game are simulated to adapt the requirements of a real-time virtual environment. These include

The trajectory of the ball , under the effect of gravity and air friction,
The bounce on the table, which is a case of inelastic collision
The modelling of the collision between the bat and the ball

Finally deciding on how the simulator interacts with the agents - the racquets.

The Trajectory of the ball

The ball has fixed dimensions and a constant weight. The motion is governed by

the equations of motion of bodies under the effect of gravity

the effect of air friction.

ball.accl.x = - AIR_DRAG_CONST * ball.speed * ball.velocity.x

ball.accl.y = - AIR_DRAG_CONST * ball.speed * ball.velocity.y

ball.accl.z = - AIR_DRAG_CONST * ball.speed * ball.velocity.z

The value of AIR_DRAG_CONST for the TT- Ball = 0.07 - 0.15 .

       The simulator calculates the trajectory of the ball by updating the state of the ball
        ( coordinates & velocity ) using these equations after a fixed time interval which
        we call TIMESTEP .

NOTE

Value of TIMESTEP used in the simulator = 0.10 seconds
Note that we are neglecting spin.

The modelling of collision when bat hits ball

The physics involved has been kept simple .
Equations of conservation of momentum is used
The collision here is inelastic

An alternative model for the bat hitting the ball

Generating a force for the bat instead of a velocity
Calculating the resulting kinematics using Newtons's Laws of Motion

The interaction between the simulator and the agent

The simulator hides all details from the agents except the state of the ball and that of the bat.
Similarly the agent selects an action and informs the simulator , which implements the action. The agent itself does not have to worry about the implementation.
The agent can be replaced without modifying the simulator.

Generating coordinates at each time step : At present the simulator generates the coordinates - taking into account the physics involved, also to account for the errors in real-time tracking using vision, we introduce an error in coordinates at each time step. To corelate with human responses, the error function must decrease as the ball approaches the player

Design of the Graphics Interface

Created in OpenGL
Viewpoint provided from different directions
This phase is once again part of the simulator independent from agent.
The figure below shows the 3D view of the table with lighting enabled.
However the lighting has been disabled since it reduces the speed.

Dynamic components of the simulator

The bats are the agents - learning their moves. Complications like top-spin and sliced shots have been dropped from our model with one simplifying assumption - the normal to the face of the bat is the same as the velocity vector direction.
Tables, ground surfaces and the net: These components determine the performance measure of the shot. A shot which hits the net, lands on the ground or lands on the player's side after hitting the bat is a bad shot.

Learning Racquet Moves and Shots

(Use of Reinforcement Learning & Neural Nets)

We chose to try out two different approaches since initially we were not sure which would produce better results
The two approaches use the same reinforcement learning algorithm but models the problem and the state space differently as described below.

Reinforcement Learning Algorithm :

A Q-learning algorithm with error-back propagation using neural nets is used for reinforcement learning in both the approaches . The essence of Q-learning is the learning and use of a Q-function Q(x,a) that is a prediction of a weighted sum of future reinforcements given that action a is taken when the system is in the state represented by "x".

The state of the problem can be described in terms of some state variables . At each time-step the simulator generates coordinates and velocity for the ball and the problem boils down to determining an action which may/may not change state variables - so that at the point of intercept the racket has same coordinates as the ball and the desired parameters to hit a good shot. The performance measure for a good shot determines the training of the algorithm.

Performance measure for a shot : To train the algorithm for a specific shots. for instance , hitting the ball deep, along the lines - the reward function for Q-learning algorithm can be defined over the 2-d space of the table .

A Possible Reward Function R(x,z)

The Learning Algorithm Used

The State Decomposition Approach

Underlying Assumptions : We assume that the velocity vector and coordinates for the bat are independent of each other - under this assumption we can decompose the problem of determing coordinates and velocity of the bat into two individual and unrelated parts.
Sub-division of task into 2 - subtasks and hence the decomposition of state space

First Part : Reaching the ball
Basic Idea

Bat velocity not a concern in this approach

Input at each step to agent is ball velocity, ball position & bat positon

Output at each step is increments in bat coordinates .

where,
S is the problem state
A is the action set for the state S.
X ,Y,Z are coordinates
V is velocity
The superscripts b and r refer to the ball and racket respectively.
The subscripts denote the components
terms supercripted with ' are determined by the simulator
terms superscripted   by * are determined by the agent action

Each State is Characterised by :

Ball position

Ball velocity

Bat position

Action Set

Increments in bat coordinates ( 3 values for 3 components => 27 actions)

Discrete values for x coordinate -0.5 , 0 , + 0.5

Discrete values for y coordinate -0.2 , 0 , +0.2

Discrete values for z coordinate -0.3, 0 , +0.3

Reward Function

R(s) = + 1     for intercepting the ball
- 1     for missing the ball
    0     for other states

Second Part: Determining Shot to play after intersecting ball

Salient Features:

Supervised Learning using neural nets

A Separate Program for generating a training set

Use of back-propagation algorithm & neural net learning

Details of the MLP (Multi Layer Perceptron) Neural Net used

6 input lines ( 3 for coordinates of the hit point + 3 for ball velocity )

3 output lines ( these give ball velocities after the collision directly)

NOTE: The model has been kept simple :
Actually what should have been done was to output either

    velocity of bat while contact with ball    or

    force applied on bat while contact with ball

and then calculate the return velocity from it using the law of dynamics.
The size of the training sets is 4000 records ie. input output paiirs

Unified Approach

Salient Features:

No state space decomposition , modelling is not modular
The same reinforcement learning algorithm used
State-space description , action set , reward function and performance measure of shots different.

In this case at each timestep the action set comprises of increments in the velocity components, which using the coordinates of the bat in the previous state determines the next state.

where,
S is the problem state
A is the action set for the state S.
X ,Y,Z are coordinates
V is velocity
The superscripts b and r refer to the ball and racket respectively.
The subscripts denote the components
terms supercripted with ' are determined by the simulator
terms superscripted by * are determined by the agent action

Each state is characterised by -

ball position,
ball velocity,
bat position ,
bat velocity.

The action set is -

Increments in bat velocity ( 3 components) :
Discrete values : -1.0 , 0.0 , +1.0

Reward Function -

R(s) = -1 for missing the ball altogether

= 1 for intercepting ball successfully

= 2 for intercepting ball and also returning it correctly

We experimented with different reward functions where we changed the relative values 1, 2 etc as mentioned above
We also tried a reward function r(x,z) when the ball is returned and (x,z) is the point where the ball lands on the table (see fig.) A simple example is

Train to hit to corner of table -

In that case

r(x,z) = 10 for (x,z) within corner square( 1/8th table area approx

= 5 for (x,z) within rest of the table top

However results obtained using such complex reward function were not
very satisfactory.

Q-Learning Network and Algorithm

As explained previously the Neural Net is used to approximate the Q(s,a) value function. Actions with the highest Q-value are chosen with a probability that increases with time.

The Neural Net (Multi Layer Perceptron) in this case has

12 input lines

90 nodes in hidden layer

27 output lines

Back - Propagation of TD-error as explained.

Back to top
Back to index