SUPERVISED LEARNING OF OPTIMAL GOAL SCORING BEHAVIOURS

FOR A SIMULATED SOCCER ENVIRONMENT

Sreangsu Acharyya 9810540

Indian Institute of Technology - Kanpur : August 1998

Introduction
Example: Some-Sample-Task
Past Work
Methodology (etc.)
Sample Input-Output
Bibliography

Introduction

Informal

Consider that you are a foot baller. A pass has just been sent your way ,so you have to decide quickly what to do next. i.e whether to

shoot to goal
Pass to another team member
dribble toward goal

etc. But to do all these things you need knowledge about these actions and should have learnt to execute them in a hostile noisy environment.

Consider the utopian case of a static ball before an untended goal :

If the environment is strictly deterministic you can aim anywhere between the goal posts . But real worlds are noisy and hostile so you have to select an aiming point such that the probability of scoring the goal is maximum considering the uncertainity in your ability of directing the ball exactly and those defenders out their to stop you etc.

The problem is not only to get the best shooting solution because their are other action options.So one needs to consider the best of all these actions and select the one which is the most relevant. So the problem has an outer and an inner loop

The problem can be approached in several ways.

One is to have a thorough domain knowledge and reason the moves online from scratch.
Second assume no domain knowledge and learn through reinforcements.
Listen to your coach and learn from him.

The selection of the method has to be based on the relaiability and the ease of hand coding the domain knowledge. If one does have a substantial knowledge about the domain reinforcement learning will spend a lot of time reinventing the wheel .While on the other hand it is dangerous to keep faith on faulty domain knowledge or wrong coaching.

Past Work

A substantial volume of work on this field emanates from the CMUnited robosoccer team . There arhitecture is based on Peter Stone and Manuela Veloso's approach of layered learning.

They have reported using simple feed -forward neural networks for learning shooting to an un-tended goal . Where the aiming point within the goal was fixed apriori. [1].

They have also used memory based learning techniques to learn shooting behaviour in presence of a goalie that revolves on a circle of a preassigned radius with a fixed velocity in front of the goal , goal scoring accuracy obtained was of the order of 60%.[2]

Decision tree learning algorithms have also been used to estimate the probability of succes of a pass to a tem member. This information then can be fed forward for strategiic planning. [3]

The CMUnited team is currently working on a re-inforcement learning model that can be applied to hidden state or opaque markov chain models.[4]

METHODOLOGY

The approach will be following Stone & Veloso's paradigm of layered learning , in which strategies are developed in a bottom up hierarchy of increasingly complex behaviours.

At the ground level will be the most basic: like the locomotive behaviours.

Turn to \theta
Move to (x,y)

These will be used as building blocks to generate more informative behaviours forming the mid level

Shoot to goal
Pass to team member
Dribble towards goal
Block ball or opponent
Intercept ball
Occupy some strategic location

Note : It would be interesting to see if such complex behaviour can be discovered autonomusly from the building blocks .

The still higher level will include

High level planning
Cooperative division of labour
conflict resolution
Dynamically changing strategies etc.
Opponent modelling & outguessing.

etc

2 Methodologies of imparting Behaviours

Reinforcement learning from "0" domain knowledge.
Hard coding an assumed domain

Considering the present problem the implementation should be somewhere in the continuum spanned by these 2 extremes.Learning can take account of obscure domain knowledge or a changing environment but it will be slow.
Hard coded strategies assume some domain knowledge thus are not robust in a changing environment.

The solution will be to incorporate both depending upon heir applicability
To impart a substantial amount of domain knowledge that one is confident about and is not expected to change and leave the rest to learning. So as to minimise re-inventing the wheel and still get an effective system.

CONFIGURATION OR STATE SPACE

X, Y ,\THETA are the externally supplied variables internally we may need to keep internal representation as
X , Y, \THETA ,V_{x ,}V_Y for each player and ball in addition we may have a boolean variable signifying which team has posession.

ABSTRACTION QUANTIZATION AND

THE CURSE OF DIMENSIONALITY

To facillitate analysis and reduce the search space we need to quantise the search space.
If Number of position states( x, y , \theta) = P
      Number of velocity states =                       V
Asuming one position state can be occupied by a single player the state space size will be the order of
p(p-1)(p-2)...(p-2n) v²ⁿ *2     [where n is the size of each team ]
Which no doubt is a huge workspace and the curse of dimensionality is very real.
REDUCTION:
    i ) Through fuzzification of the variables like
                                        penalty , forward , midfield
                                        left flank right flank , centre.
    ii)    Consider symmetry i.e tranlation and rotation invariant variables and rules. like representing position as a relative to position of other players and goal,using abstracted informaton like direction{North, East,West , South} & accesibility window{broad,Mid , Narrow} .etc.

CHOOSING THE BEHAVIOUR

        This requires 2 tiered optimisation.i.e.

An inner level optimisation that generates the best parameters for a fixed behaviour. Like selecting the best shooting angle .This inner loop has to be run for all the available behaviours to generate their individual best parameters .

Make the most intelligent chice from among those optimised strategies,from the outer level.

PROBABILISTIC BASIS OF DECISION

Let   C = Current state or configuration.
Action i = A_i

_{i) COARSE
MODEL}

_{P(success/C)=\sum P(sucess/Ai
).P(Ai
/C)}

Where   _{P(sucess/Ai)
approximately
represents the skill of an agent as it is the probability that the action
i will be executed succesfully.}
_{On the other hand   P(Ai
/C)
approximately
represents the intelligence of the agent as it is the probability
that the action Ai will be selected in the state c. The aim of the
project would be to learn an apprppriate P(Ai
/C)}
Before presenting the objective function 2 types of failures on the soccer field needs to be distinguished.
i) Losing control to the opponent team F1
ii) Losing control but not to the opponent team i.e no man's land.F2
Obviously F1 is more damaging than F2
Thus we can define

P(win/C) = [ \alpha. _{P(success/C)
+ \beta . P(F2/C) ] -\alpha}_oppnent[_P(success_{opponent/Ai
)}
where
\alpha   is the gross probability of winning when in control.
\beta      is the probability of winning when ball goes into no mans land(this will be very small)

The above will form the objective function to be maximised with _{P(Ai /C) as
the decision variable.}
ii) More Detailed Model

Here we introduce another term
P(w/A_i ,C ) =   Probability of win if action Ai is executed succesfully in the state C.This will bring bout relative importance of different actions at a given configuration.
For instance P(w/shoot to goal,C) is expected to be very near 1
Now
P(win/C) =\sum P(sucess/Ai )_.P(w/A_i ,C ) ._P(A_{i /C)}

_{The above expression will serve as our new objective
function to be optimised with respect to P(Ai
/C)}

RE-DEFINITION OF THE PROJECT
IN A MORE TANGIBLE FORM

In this project it is   intended to train a neural network so that it returns the optimum parameters for a particular behaviour. In effect it is the inner layer optimisation. Online optimisation is ruled out with the present resources as the values have to be generated in real time.Thus an algorithm for numerical optimisation will train the network offline.It is hoped that the network will generalise sufficiently well to provide the optimal values for given configuration.
It is also intended that the network returns the value of the optimised objective function.
Thus the network will be learning P(sucess/Ai )_{function as a supervised learning
the numerival optimisation module acting as the coach or the teacher.}

THE MATHEMATICAL MODEL OF THE SOCCER FIELD

FOR SHOOTING BEHAVIOUR

The system model that is being considered adds uniformly distributed noise with exponentially decreasing magnitude to the ball and player motions. These are guided by the following equations.

W here r is the added noise.

The amount of acceleration a player can impart to the ball is limited by his stamina. The velocity of the ball is the vector summation of a decayed fraction of the past corrupted velocity with an additive noise whose range is proportional to the magnitude of the velocity. The new positiion after unit time step is determined by incrementing the past position with the current velocity.

Since the nature of the noise that is being added to the horizontal and vertical components of the velocities are known it is possible to deduce analytically the probability distribution of the angle \theta . Knowing this distribution it is possible to estimate the probability that the shot remains within the goal boundary by integrating it within appropriate limits. This can be understood from the foolowing diagram.

Real world has another complication ;that of defenders trying to stop the ball from entering the goal. This the probability of goal can be defined as

P(Goal) = P(shot remains within goal) && P(the ball un-intercepted).

From the formulation of the angle made with the x axis we get

The first part of the above equation has already been derived as.

Now one has to model the P(ball un-intercepted) term. Because of the complicated model it is not faesible to model the velocity distribution of the players because of there highly nonlinear nature and the abscence of an applicable inverse function. So for this project simulation techniques were used to model the second term.

The governing equations are : for the path along the shot

For the ball to remain unintercepted the time taken by the defender

has to be greater than the time taken by the ball.

One can combine these 2 probabilities to arrive at the probability of goal given the configuration of the field and the shooting angle.

Some of the plots obtained through this modelling are shown.

/TO VIEW SOME OF THE PLOTS GENERATED.

RADIAL BASIS FUNCTION NEURAL NETWORKS AND FUZZY LOGIC

The topology of the network is as shown.

Radial basis network formulation forms an alternative to modelling with feed-forward sigmoid neural networks. The network consits of a single hidden layer of neurons having radial activation functions . Here Gaussian function is used. The output of the units are given as

One can note this is same as a fuzzy rule base system using gaussian membership functions to model the individual components of the vector input.. Thus this property of radial basis neural networks enables us to understand the encoded rules.

Here the outputs are calculated as

The final out put is obtained as

DETERMINATION OF FREE PARAMETERS

The frre parameters that are to be determined are the

centres C
widths \sigma and
The output weights.

In terms of fuzzy logic paradigm it amounts to the optimal placement and shaping of the gaussian membership functions and the de fuzzified consequent values.

Detrmination Of centres:
The centres of the radial basis units can be obtained either by supervised techniques ( gradient decent) or by unsupervised techniques (clustering). Though supervised learning is expected to result in further minimisation
of error here only unsupervised techniques are being used as they are much faster .

Unsupervised learning can also be divided into 2 categories

Hard Competitive learning and
Soft Competitive learning.

HARD COMPETITIVE LEARNING:

Hard competitive learning essentially involves clustering of data vectors into optimum number of classes and finding a prototype to represent those clusters.

The algorithm is same as that of K-means Clustering , usually only the input vectors are clustered ;however in this implementation an extended vector containing all the input components and some viatal output components have also been included to avoid the possibility of the algorithm to return sets well clustered in the input space but poorly clustered in the output space. Moreover to account for the fact that the means of the individual components are differrnt by orders of magnitude they have been apprpriately scaled.
The algorithm can be described as.

Initialise n centres c_{i i = 1
to n}
Find the voronoi tesselation using those centres.
find the mean of those voronoi cells
update the centres to the means of the voronoi centres plus an exponentially decreasing noise.
iterate till convergence.

The added noise is initially high and allows one to escape local mimima and is similar to the simulated annealing approach.

SOFT COMPETITIVE LEARNING:

Unlike hard competitive learning here a training instance may be assigned to different centres with varying degree of membership depending on the distance from th proposed centers. The membership values can range from [0 1]

The algorithm can be summarised as

Initialise n centres
consider an input and find its distance from the proposed centres.

Find the membership of the trainiong set to each of the centres.
update the centres using a weighted mean over all the training instances.

It has been found that hard competitive learning often results in empty clusters, soft clustering is largely immune to this problem but suffers from the problem of repeated or very close centers which poses problems for determining the weight values of the output nodes.

DETERMINATION OF WIDTH SIGMA:

The sigma parameter or the width of the radial basis units may also be determined by supervised learning. But here they have been determined heuristically using two rules. Later those estimates wee improved by supervised learning or gradient decent.
The rules followed for determining the sigma values are

sigma values are fixed as some proportion of the average width of the voronoi cells.
Fixed as some multiilication factor times the nearest neighbour distance.
Updated by gradient decent.

It was found the 1st rule is better suited for hard clustering whereas the second gave better results with soft clustering.

DETERMINATION OF WEIGHTS

The weights are obtained by the minimisationof the squred error with respect to the training set. The minimisation is done by equating the derivative of the error to 0. Though gradient decent may be used for the minimisation this can be solved analytically as it results in a system of linear equations.

GENERALISATION ERROR:

For good results it is not enough to minimise the error for the training set only. The error should be llow even for cases not contained by the training set. This is the ability of the network to generalise.

This can be implemented by splitting up the data set into training and validation sets. The obvious drawback being that the entire data set cannot be utilised as some of the data has to be sacrificed for validation.

There are other methods by which one can utilise the entire training set and still ensure a good degree of generalisation. eg

leave P out cross validation : In this the entire training set is divided into N-P training data and P validation data in all possible combinations and the summed error is then minimised.
Generalised cross-validation : Since leave P out cross validation is difficult to handle computationally Generalised cross validation is more popular as an established measure of generalisation error and was used for the current term project.

Input and Output

The project consists of two parts

Generation of optimal shooting solutions from off line calculations
Using the above module to generate a training set for a Radial basis neural network whose free parameters are tob determined.

The input to the first part will consist of the current configuartion of th soccer field. Here one can have atmost 2 defenders and a golie.
The configuration vector contains the distance from the centre of the goal and the angle made with the centreline for each player. It also needs input about the shooting velocity and the velocity of the defenders and the parameters of the associated noise model.

Using these inputs the code trainx2.c can generate the probability of scoring a goal for different shooting angles; and thus identify the optimal shooting angle.

Some of the outputs generated can be seen HERE

The second part of the project is devoted to the finding of the free parameters of the radial basis network . It includes

The determination of the centres of the radial basis units
Determining the sigma or the width parameters of the radial basis units
Determination of the output weights.

The input vector to the network is of the form

Ds,Dg,Dd1,Dd2,Theta_s,Theta_g,Theta_d1,Theta_d2.

RESULTS OBTAINED ,CONCLUSIONS & FUTURE EXTENSION

It can be seen that the dimension of the input vector is eight. Thus even with two fuzzy linguistic variable attached to these variables number of rules necessary is quite high i.e 2^8 .

Till now networks with 10 ~ 25 radial basis units have been tried and the error has been reduced to the order of 0.3. Thus it can be seen that with considerably less number of rules the error over the entire training set has been reduced.

However 0.3 is still not adequate for precise control of the shooting trajectory as envisioned during prposition of the project. Thus the current trained network can be used to identify 3 broad zones within the goal post where to shoot and not the exact optimal angle.

Since only heuristic approaches where taken to locate the centres it is expected that the performance will be better if those were selected through optimisation procedures. Gentic Algorithms being an obvious choice which will allow similar centres to compete with each other and the dissimilar to co-operate thereby effectively mapping the input and the output space.

Once an aceptable error margin has been crossed this network itself can then be used to generate the optimal defensive strategies.
That will be possible because the nework model being continuous and differentiable one can consider the free parameters to be fixed and differentiate the output with respect to the goalie and defender location and equate the deriavative to zero to obtain the locations of the goalie and the defenders. Thus the network will be used as a domain knowledge. One advantage of this will be the resolution of conflict between the defenders so that each of them do not run to the same spot.

1. Peter Stone, Manuela Veloso, Patrick Riley "Learning low-level Multi agent behaviours" available at http://www.cs.cmu.edu/afs/cs/usr/pstone/public/papers/

2. Peter Stone & Manuela Veloso "A layered approach to Learning Client Behaviours in Robocup soccer" AAI vol 12 -1998
available at http://www.cs.cmu.edu/afs/cs/usr/pstone/public/papers/

3. Peter Stone & Manuela Veloso " Beating a Defender in a Robotic soccer: Memory based learning of continuous functions" Neural Information Processing Systems -1995, avilable at http://www.cs.cmu.edu/afs/cs/usr/pstone/public/papers/

4. Peter Stone & Manuela Veloo " Using Decision Tree confidence Factors for Multi-agent control" In "RoboCup-97: The First Robot World Cup Soccer Games and Conferences", H. Kitano (ed.), 1998. Springer Verlag, Berlin. available at http://www.cs.cmu.edu/afs/cs/usr/pstone/public/papers/

5. Peter Stone and Manuela Veloso " Task Decomposition and Dynamic Role Assignment for Real-Time Strategic Teamwork." Fifth International Workshop on Agent Theories, Architectures, and Languages (ATAL'98 ) avialble at http://www.cs.cmu.edu/afs/cs/usr/pstone/public/papers/

Annotated Bibliography