Robot A: 
   Takes random choices of moves(LRDU). Whenever finds the dirt, sucks it. A macro THRESHHOLD is used to decide when to pick up the dirt. The value is kept 0 presently.
   
Robot B: follows greedy path randomly choosing the next move when more than one optimal value is sensed as neighbour.

Robot C:
This robot keeps knowledge of all the seen states and its own moves. The robot performs the task of dirt collection in phases. It collects dirt for K moves and waits and thinks about how it should plan the next k moves.It assumes semi-continuous distribution of dirt.

Thinking phase: It has memory of previously seen values. According to it, it generates a view of the environment noting which of the locaitons are not known. Now as per the assumption, it stabalises the dirt distribution thus predicting the unknnown values. But it does so pessimistically in the sense that it assumes the known values as the maximum in their local neighbourhood. Once it comes up with the expected grid (this is the expected initial distribution of the environment) it makes changes according to the squares it cleaned.
	Now the Robot calculates the optimum sequence of actions for the next K steps. 
	
Dirt colleciton: NOw the robot goes as it planned and collects the dirt also noting the new values and remembering what it cleaned.

Note: if K = 1 this results in Robot like B
We need to choose a good value of K. If we choose a small value of K then we are moving towards a greedy approach. choosing a large value doesn't make sense as we are depending too much on our predictions. Presently the value of K is kept 3 and it would be nice if different values of K are tried on various inputs to come up with a good value of K.