Project book: special project (project c)

Artificial Intelligence via Reinforcement Learning

Intellibot -

Reinforcement Learning in RoboCode

By

Avishay Livne

Instructor

Tomer Merkado

Spring semester 2006

1Table of contents

2Figures and tables list

3Introduction

3.1Abstract

3.2Theoretic Background

4Project goal

5Project design

5.1The RoboCode environment

5.2Problem formulation

6System architecture

7Simulations results

8Summary

8.1Conclusions

8.2Future work

9Appendices

9.1Result graphs

9.2RoboCode user manual

9.2.1Setting up the environment

9.2.2Creating a battle

9.2.3Adding robots to RoboCode

10Bibliography

Introduction

Abstract

This project is a research on the subject of Machine Learning, with focus on Reinforcement Learning algorithms (RL).

The first part of the project is the theoretic part, and includes literary review of papers and academic material on the subject.

The second part of the project is the practical part. In this part the theoretical topics from the first part were implemented in order to solve a real problem.

The environment that was used for implementing the RL algorithms is called RoboCode. RoboCode is an environment that was developed by IBM as a tool for learning Java. In RoboCode Java objects (called robots) battle each other under a set of given rules. Every programmer can develop his own robot, using the interface the environment supplies. Most of the robots use different movement patterns (in order to dodge bullets) and patterns for aiming (predicting the place of the enemy in the near future). It is empirically proven that robots which use patterns are better than robots which act randomly).

In this project robots that aim to learn such patterns were developed. These robots were designed to improve by playing against existing robots.

In order to compare different RL algorithms, several different robots were developed.

After the implementation phase the robots were battled against existing robots in order to train their agent and improve. A few aspects were investigated during the simulations – how the learning of a robot against robot A affects the performance of the robot against robot A, against a learning robot that learned to play against robot B and against some other non-learning robot.

Moreover a comparison between the learning times of the different robots was done. The learning time of a given robot is the time that passes since the time a robot starts to learn until its winning percentage against the robot from which it's learning becomes stable.

In some of the result graphs it's possible to see a real improvement while in other the outcome isn't so unequivocal.

Theoretic Background

Reinforcement learning – survey with focus on Temporal Differences and Evolutionary Algorithms

Introduction

Reinforcement learning (RL) is an approach to automating goal-directed learning and decision-making. This approach comes to solve problems in which an agent interacts with an environment. The agent's actions receive a reward from the environment (usually represented as a real number). RL algorithms aim to find a policy that maximizes the cumulative reward for the agent's actions.

The basic RL model consists of:

  1. Set of environment states.
  2. Set of possible agent actions
  3. Set of real scalar rewards.

In a given time the agent perceives its state and a set of possible actions. It decides an action and receives a reward and a new state. Based on this interaction with the environment the agent should develop a policy that maps states into actions, such that the cumulative reward for the agent's actions is maximal.

RL has two main classes of methods for RL: methods that search in the space of value functions and methods that search in the space of policies. The evolutionary algorithms approach is example for the second class, where the policy is modified as a whole. Temporal difference (TD) is an example for the first class, in which the attempt is to learn the value function (that determines the rewards). The function is learnt through trial and error and exploring the environment.

Difficulties and tradeoffs in RL

RL has some main problems:

  1. The reward lets the agent know how good its action was, but not if it was the best action or how good was it compared to other possible actions. The agent has to evaluate the rewards over time and develop a policy based on its evaluations, which could be wrong.
  2. Exploitation vs. Exploration: In a given state the agent has to decide on one possible action. Sometimes it'll have a favorable action, for which it received high reward in the past. The agent could pick this favorable action, but then he waives the chance to explore other actions, which could lead to better reward in the future. Exploitation is using the accumulated knowledge to choose a greedy action. This behavior comes to maximize the reward for a single play. Exploration is choosing a random action. This way the agent learns more about the environment and its database becomes more accurate. In terms genetic algorithms exploitation is analogical to recombination, while exploration is analogical to mutation. Since the agent can't explore and exploit on the same step this tradeoff is one of the main problems of RL.
  3. Large state space: in some systems the space of states is too large to be explored and the agent can't develop a policy that would map every state into action. In cases like this the agent should be able to apply a single rule to set of states.
  4. Incomplete state information: the agent bases its actions on the way it perceives the environment, but sometimes it doesn't aware to all the information in the environment. The agent can think its current state is while actually the state is some different, and the agent can't separate between the states because of limited perception. This situation will lead the agent to select inappropriate action.

Methods and approaches in RL

As mentioned above RL has 2 main classes of methods.

  1. Methods that search in the policy-space, such as Evolutionary Algorithms, perform the learning after a complete execution. These methods don't refer to reward given in every single step during the run, but examine only the cumulative reward after completion.
  2. Methods that search in value function space, such as Temporal Differences, consider the reward received in each of the steps. Every approach has its own advantages and disadvantages, so choosing an approach should be based on the specific problem itself. Moreover, some methods consider both step reward and final cumulative reward, so there's no strict distinction.

Policy space case: Evolutionary Algorithms

In order to find an optimal solution (algorithm) to a given RL problem, the EA method search for a solution based on the idea of evolution by natural selection. In the initialization step one algorithm or more is created randomly or user-controlled. In every step of the iteration the next generation of algorithms is created and evaluated. Most of the new algorithms will be produced by recombination of two elder algorithms with high received high evaluation. Few of the algorithm's substructures are created by mutation of old substructures. After the new generation is created every algorithms is evaluated, usually by simulating the algorithms and ranking them according to their performance.

The rational behind recombination is taking the best from parents that perform well comparing to other algorithms. The rational behind mutation is to keep the exploring of new genes and policies. In the mathematical representation of the problem this is a way to avoid local maximums.

The pseudo code for a general EA:

procedure EA

begin

initialize P(t); t = 0;

evaluate structures in P(t);

while termination condition not satisfied do

begin

t = t + 1;

select P(t) from P(t-1);

alter structures in P(t);

evaluate structures in P(t);

end

end

Major design decisions are how to represent the space of policies using chromosomes, and how to evaluate the population of algorithms. In addition the user has to set different control parameters like population size, recombination rate, mutation rate and parent selection rules.

Value function space case: Temporal Difference

TD is a method to predict the future reward. For every step the agent has reached a reward-prediction is made and whenever an observation is available the prediction is adjusted to better match the observation. The main difference between TD and other RL methods is in TD prediction changes not only according to the actual value that is found at the end of the time segment, but also according to the more recent predictions, which are assumed to be more accurate. A simple example is the weather prediction – suppose a weatherman attempts to predict on each day of the week whether it'll rain on Saturday. The conventional approach is to compare each prediction to the actual outcome (whether or not it rain on Saturday). In TD approach we compare each day's prediction with the prediction of the following day. If on Monday we predicted 50% it'll rain and on Tuesday 75% chance predicted, then a TD method would increase the prediction for days similar to Monday.

Because TD methods use the gathered experience more efficiently they tend to converge faster.

The formula for updating the prediction, which is also known as the recursive Bellman equation, is . We update to prediction of each step according to its previous estimation and the new temporal difference and the immediate reward . represents the learning rate.

In the Q-function viewpoint, where we evaluate every (state, action) pair this can be written as . The decision to take maximal , is justified by the desire to increase convergence speed.

The case described above is known as the special case , while the general case weights the recent n steps. In calculate the reward using the future reward of the following steps. We the future reward could tell us more accurately whether our decisions were correct, therefore the reward applied to a given decision would be , where is the weight parameter. Alternatively it's possible to use so the value of the next n-th step is considered as well. With the new reward we can reevaluate the state value . The tradeoff of using over simple is more demanding computation and extensive memory consumption, against faster convergence and more accurate prediction.

The pseudo code for the basic algorithm:

procedure

begin

Initialize V(s) arbitrarily for all ; t = 0;

Repeat for each step of episode

Choose action a from the possible set of actions for state s.

Take action a, observe reward r and next state s'.

Until s is terminal step

End

Project Goals

The main goal of the project is to get familiar with ML, with focus on RLalgorithms, and how to use it for Artificial Intelligence (AI) purpose.Below is a list of goals that were defined in the project:

  • Literary review
  • Read a set of selected papers and books on the theoretical part of the project.
  • Summarize the relevant material in a report.
  • Selecting a suitable problem
  • Finding a problem that could be solved via RL algorithms.
    This problem should answer a few requirements:
  • It should be feasible to formulate the problem as RL problem.
  • The problem should be feasibly implemented.
  • It is preferential to find a problem that is also solvable with non-RL techniques, in order to have a benchmark for the RL solution.
  • Solving the problem
  • Learning the problem's environment.
  • Designing a non-learning solution.
  • Formulating the problem as RL problem.
  • Designing learning solutions.
  • Implementing the learning solutions.
  • Investigating performance
  • Running different simulations for different scenarios.
  • Collecting results for later comparison.
  • Producing graphs to ease the examination.
  • Summary and conclusions
  • Explaining the results

Project design

RoboCode environment

"RoboCode is an easy-to-use robotics battle simulator that runs across all platforms supporting Java 2. You create a robot, put it onto a battlefield, and let it battle to the bitter end against opponent robots created by other developers. RoboCode comes with a set of pre-fab opponents to get you started, but once you outgrow them, you can enter your creation against the world's best in one of the leagues being formed worldwide." – From IBM's RoboCode website.

The RoboCode platform supplies programmers with an environment of tanks that battles each other on a plain, objects-less screen. The programmer writes only the code that defines the behavior of the tank. The platform provides different tank sensors and controllers. The sensors are represented as a set of get functions (e.g. getEnergy(), getHeading()) and a set of events (e.g. BulletHitEvent, HitByBulletEvent). The programmer use this inputs to determine an action and then outputs it to the game using action functions such as fire(), turnLeft() etc.

A detailed manual explaining how to use RoboCode is attached in appendix 2.

Problem formulation

RL aims to learn and optimize a mapping from states to actions by trial and error interactions with the environment.

In order to implement RL the following should be formulated:

  • States
  • Actions
  • Reward
  • Q-function

Below are two approaches for the problem formulation. The first one is detailed and intended to map the world with high accuracy, therefore aiming to achieve

First attempt – detailed function

States

The robot's state is defined according to the following parameters:

Heading – the direction of the robot. 4 possible values {up / right / down / left}.

Gun heading – the direction of the robot's gun. 4 possible values {up / right / down / left}.

Radar heading – the direction of the robot's radar. 4 possible values {up / right / down / left}.

Robot's location – the (x, y) coordinates of the robot. Under the assumption the battlefield's dimensions are 800x600 we shall divide the ground to 40 areas in the size of 40x15. 40 possible values.

Robot's energy – the amount of energy left. Instead of 100 different values we shall group together every 10 sequential levels, resulting in 10 different possible values {>90, >80 … >10, >0}.

Enemy's location, relative angle – The place of the enemy relative to the robot's heading. 4 possible values {up / right / down / left}.

Enemy's location, distance – The distance from the robot to the enemy. By drawing imaginary radiuses we can use 4 different values for this parameter {x =< 150, 150 < x =< 300, 300 < x =< 450, 450 < x} (values are in units of pixels).

Enemy's energy – the amount of energy left to the enemy. 10 possible values.

Was hit from bullet? – 2 possible values {was hit / wasn't hit} from bullet in the last turn.

Was hit from collision with a wall? – 2 possible values {collided / didn't collide} with a wall in the last turn.

Summing up, in this approach there are 4x4x4x40x10x4x4x10x2x2 = 16384000 different states.

Actions

The following list contains the possible actions:

Move, direction–2 possible values {forward / backwards}.

Move, distance – 5 possible values {0 /10 / 20 / 40 / 80}.

Turn – change the angle of the robot. 4 possible values {up / right / down / left}.

Turn gun - change the angle of the gun. 4 possible values {up / right / down / left}.

Turn radar - change the angle of the radar. 4 possible values {up / right / down / left}.

Fire power – 5 possible values {0 / 0.5 / 1 / 2 / 3}. 0 means don't shoot.

Summing up, in this approach there are 2x5x4x4x4x5 = 3200 different possible actions.

Reward

The objectives of the robot are defined by the reward for every action.

The reward is derived from the change in the robot's energy and the enemy's energy. It can be calculated so:

Where = the change in the robot's energy (current energy – last turn's energy).

= the change in the enemy's energy.

Problem

The problem in this formulation is the amount of space required to store all this information. The Q-function table must contain a rating for every pair of (State, Action), resulting in16384000x3200 = 52428800000values. This amount of values, assuming every value is represented as float variable and consumes 4 bytes, demands 195GBgigabytes. Database this big can't be contained in the memory in runtime and therefore this level of detailing isn't practical.

The next attempt reduce the detail level in order to maintain a small database, small enough to read it before the battle begins under RoboCode's environment limitations (robots must finish their initialization phase under a certain time constraint).

Second attempt – reducing space

States

The robot's state is derived only from the ratio according to the following table:

State / Priority / Ratio
Very Dangerous / 6 / r >1.5
Dangerous / 5 / 1.5 r > 1.2
Unsafe / 4 / 1.2 r > 1
Normal / 3 / 1 r > 0.7
Safe / 2 / 0.7 r > 0.5
Very Safe / 1 / 0.5 r

Actions

A predefined set of actions is available for the robot and the desired action for each state is described in the following table:

Priority / Action / Appropriate for state
6 / Very defensive / Dangerous, Very Dangerous
5 / Defensive / Unsafe, Dangerous
4 / Normal / Normal
3 / Strong / Safe, Normal
2 / Offensive / Safe
1 / Very offensive / Very Safe, Safe

The robot has a basic behavior and the chosen action determines a few parameters in each turn's behavior parameters.

Each turn the robot performs the following actions: move, turn, fire and scan for other robots.

In safer state the robot will allow itself taking more aggressive actions, meaning firing with more power and moving more frantically.

On the other hand, when more dangerous states are encountered the robot will take more defensive actions, including reducing its fire power and calming its movement.

When using more firepower the robot suffers bigger penalty for misses (reflected in energy reduction).

When moving more frantically the robot improves its bullets dodging but increases the chance of hitting walls and other robots.

Reward

The objectives of the robot are defined by the reward for every action.

In this design I chose to reward the robot for his actions according to the following equation:

Where = the change in the robot's energy (current energy – last turn's energy).