Project Proposal

CS 790 (Meng Project)

Stefan Witwicki

Jail Break / Prison Guard Training Game
(Cooperating AI)

The Motivation

The original intent of this project is to train agents to cooperate with one another in order to achieve a common goal. Thus, in designing the game, it seems crucial to reward the agents only if they demonstrate cooperation. This, in essence, forces them to cooperate if they are to learn and improve. It is the attempt of this design to make the agents’ goal difficult enough so that some level of cooperation is necessary for the agents to win the game. On the other hand, the game environment must be simple enough so that the AI may learn relatively quickly. Additionally, this game tries to be somewhat unique, interesting to watch, and fun to play against.

The Game

A lock breaks and a prisoner’s cell door swings open freely. An alarm goes off, alerting two prison guards of the breach. As the prisoner runs toward the exit, the guards must navigate through the maze of hallways and catch the prisoner before he reaches the exit. The environment might look something like the picture on the previous page. In order to catch the prisoner, the guards must corner him. Since there are no corners or dead ends (only hallways), this means getting on opposite sides of him on a hallway so that he has no room to move in any direction.

Notice that there are multiple routes that the prisoner may take to get to the exit, making it non-trivial for the guards to catch him. To make things even more difficult, the prisoner can move twice as fast as the guards. All players move only one square at a time. But the order of play is: prisoner moves, guard #1 moves, prisoner moves, guard #2 moves, prisoner moves, etc. Even if the prisoner’s strategy isn’t very intelligent (i.e. running around like a chicken with it’s head cut off) the guards will have to work together in order to catch him.

The States and Acts

The state of each agent is his location in the environment, most likely in the form of an <x, y> board coordinate. Each state will contain information about adjoining states as well as a series of percepts (described in the next section). Whenever an agent takes a turn, his set of possible acts consists of all moves to open adjacent squares. He of course may not move through walls or onto squares occupied by other agents. Alternatively he can choose not to move. If he is cornered in, this is his only available act.

The Percepts

Each player’s vision extends to the nearest wall in each of the four (horizontal and vertical) directions. At the beginning of the game, both guards are aware of the start location of the prisoner. But the guards have an extra percept that should give them an advantage over the prisoner. The guards have some sort of communication link (i.e. two way radios), some channel over which one guard may send information to the other and vice versa. Hopefully, in training, the guards will learn to use this channel to give each other useful information such as “he went that way”. But no structure will be pre-imposed on the information. It will be up the training algorithm. The guards will have to develop their own sort of language. And whatever they choose to communicate in that language will be based on their percepts. Each time a guard agent takes a turn, he can also send information over the communications channel.

The Training

Since this project focuses on cooperative AI, much more concern will be given to the training of the guard agents than that of the prisoner agent. Initially I plan on just giving the prisoner a simple rule-based strategy. I will also allow for human control of the prisoner. If both of these methods seem insufficient to properly train the guards I will try something more sophisticated. As for the training of the guards, one training strategy I would like to try and apply is temporal difference learning. One possibility for the models that will be trained (and used to control the guards) is Artificial Neural Networks. I have quite a bit of experience with ANNs, although not in conjunction with TD learning. I plan to do some reading in these areas, as well as others you think might be relevant.

Implementation

I have chosen to build a graphical interface for the game in Java. Training will be performed by applying Temporal Difference Q-Learning to ANN function approximation.