Simple Reinforcement Learning Agents: Pareto Beats Nash in an Algorithmic Game Theory Study

Steven O. Kimbrough

Ming Lu

University of Pennsylvania

Draft: 9 November 2003

1  Background

Contexts of strategic interaction (CSIs) appear in nearly every social situation. They are characterized by interdependent decision making: two or more agents have choices to make and the rewards an individual receives in consequence of its choices depend, at least in part, on the choices made by other agents. Such contexts, when abstracted and formalized in certain ways, are the subject of game theory, which seeks to “solve”—predict and explain the outcomes of—games (i.e., of CSIs abstracted and formalized in certain stylized fashions).

Any solution theory for CSIs (or games) must make and rely upon two kinds of assumptions:

1.  SR (Strategic Regime) assumptions. There are assumptions about the representation and structure of the CSI (or game), including the rules of play and the payoffs to the players. Typically, these assumptions are expressed as games in strategic form, games in extensive form, characteristic function games, spatial games, and so on.

2.  SSR assumptions. These are assumptions about the Strategy Selection Regimes (SSRs) employed by the agents, or players, in the game. Classical game theory makes two kinds of SSR assumptions, which typically apply to all players (Luce & Raiffa, 1957; Shubik, 1982):

a.  Ideal rationality assumptions. It is normally assumed that agents are ‘rational’ and that Rational Choice Theory in some form (e.g., Savage’s Subjective Expected Utility theory) characterizes this kind of (ideal) rationality. Roughly, agents are assumed to have utilities and to be maximizers of their utilities.

b.  Knowledge assumptions. It is normally assumed that agents are omniscient with respect to the game. The agents know everything about the game, common knowledge obtains among all the players, and all agents have unlimited computational/ratiocination powers.

In what follows, we report on a series of experimental investigations that examine play in games under non-standard SSR assumptions, at least as judged by the classical game theory literature. We investigate a series of games that are well recognized in the classical literature and that have been extensively studied. Our game—Strategic Regime—Assumptions are conventional, although we focus on repeated or iterated (aka: staged) games.

It is, and has always been, recognized that the classical SSR assumptions (as we call them) are unrealistic. The original experimental work on Prisoner’s Dilemma, among other games (Flood, 1952), was motivated by such concerns. Even so, they—and the consequences they engender—are interesting. The assumptions often afford tractability, allowing games to be ‘solved’. Because they capture the notion of a certain plausible kind of ideal rationality, it is interesting to determine how well they describe actual human behavior. Even if they are inaccurate, they have value as a normative benchmark. And given the considerable powers of human cognition and institutions, it is not prima facie implausible that classical SSR assumptions will often yield accurate predictions.

This is all well and good, but the story is not over. There are certain puzzles or anomalies associated with the classical SSR assumptions. Famously in the Prisoner’s Dilemma game, and in other games, the Nash Equilibrium (NE) outcome is not Pareto efficient. Classical theory sees the NE as the solution to the game, yet many observers find it anomalous and experiments with human subjects often indicate support for these observers (Luce & Raiffa, 1957; Rapoport & Guyer 1976). Further, the NE need not be unique, posing thereby a challenge to the classical theory, which often struggles, or has to be stretched, to predict equilibrium outcomes that seem natural and that are reached by human subjects easily. In short, the classical theory has often proved to be a poor—weak and inaccurate—predictor of human behavior (Roth & Erev, 1995).

Besides the well-known puzzles and anomalies, there is another category of reasons to study games under variations of the classical SSR assumptions. Rational Choice Theory and omniscience may be plausible assumptions for experienced humans in certain favorable institutional settings (e.g., well-established markets). They are often not plausible assumptions for games played by birds, bees, monkeys up in trees, bacteria, and other similarly less cognitively well-endowed creatures. It is, simply put, scientifically interesting to investigate the play and outcomes in games in which the SSR assumptions of classical game theory are relaxed sufficiently to be capable of describing these kinds of more limited agents. Equally so, this is interesting from a practical, applications-oriented perspective. Adaptive artificial agents, e.g. fielded for purposes of electronic commerce, will inevitably resemble the lower animals more than their creators, at least in their cognitive powers.

With these motivations principally in mind, we investigated repeated play by simple, adaptive agents in a number of well-known games. Any such investigation, however, faces an immediate and urgent theoretical problem: There are indefinitely many ways to relax the classical SSR assumptions; how does one justify a particular alternative? We choose with a number of criteria in mind.

1.  Simple. There are few ways to be ideally rational and indefinitely many ways not to be. In examining alternatives it is wise to begin with simple models and complexify as subsequent evidence and modeling ambition requires.

2.  New. Much has been learned about non-ideally rational agents through studies of the replicator dynamic (see Gintis, 2000, for a review). These investigations, however, see populations as evolving, rather than individual agents adapting. The individuals are typically modeled as naked, unchanging strategies, rather than adaptive agents, which proliferate or go extinct during the course of continuing play. Agents in some ‘spatialized’, cellular automata-style games have been given certain powers of state change and adaptation, but these have on the whole been limited in scope (e.g., Epstein & Axtell, 1996; Grim et al., 1998). Experimenting with game-playing agents that are using reinforcement learning is a comparatively under-developed area and the kinds of experiments we report here are, we believe, original.

3.  Theoretically motivated. Reinforcement learning as it has developed as a field of computational study has been directly and intendedly modeled on learning theories from psychology, where there is an extensive supporting literature. This important class of learning model is a natural first choice for modeling agents in games, because it appears to apply broadly to other areas of learning, because its theoretical properties have been well investigated, and because it has achieved a wide scope of application in multiple domains.

4.  Adaptive. Agents should be responsive to their environments and be able to learn effective modes of play.

5.  Exploring. Agents should be able actively to probe their environments and undertake exploration in the service of adaptation; agents face the exploration-exploitation tradeoff and engage in both.

In addition, the SSRs should be realizable in sense that they specify definite procedures that simple agents could actually undertake. It is here, perhaps, that the present approach, which we label algorithmic game theory, differs most markedly from classical game theory and its assumption of ideal rationality, irrespective of realizability constraints.

We turn now to a discussion of the elements of reinforcement learning needed as background for our experiments.

2  Reinforcement learning

2.1  Simple Q-learning

Our experimental agents used a simple form of Q-learning, itself a variety of reinforcement learning. Detailed description of Q-learning is easily found in the open literature (e.g., Watkins, 1989; Watkins & Dayan, 1992; Sutton & Barto, 1998). We limit ourselves here to a minimal summary for the purposes at hand.

The Q-learning algorithm works by estimating the values of state-action pairs. The value Q(s,a) is defined to be the expected discounted sum of future payoffs obtained by taking action a in state s and following an optimal policy thereafter. Once these values have been learned, the optimal action from any state is the one with the highest Q-value. The standard procedure for Q-learning is as follows. Assume that Q(s,a) is represented by a lookup table containing a value for every possible state-action pair, and that the table entries are initialized to arbitrary values. Then the procedure for estimating Q(s,a) is to repeat the following loop until a termination criterion is met:

1. Given the current state s choose an action a. This will result in receipt of an immediate reward r, and transition to a next state s'. (We discuss below the policy used by the agent to pick particular actions, called the exploration strategy.)

2. Update Q(s, a) according to the following equation:

(1)

where a is the learning rate parameter and Q(s,a) on the left is the new, updated value of Q(s,a).

In the context of repeated games, a reinforcement learning (Q-learning) player explores the environment (its opponent and the game structure) by taking some risk in choosing actions that might not be optimal, as estimated in step 1. In step 2 the action that leads to higher reward will strengthen the Q-value for that state-action pair. The above procedure is guaranteed to converge to the correct Q-values for stationary Markov decision processes.

In practice, the exploration policy in step 1 (i.e., the action-picking policy) is usually chosen so that it will ensure sufficient exploration while still favoring actions with higher value estimates in given state. A variety of methods may be used. A simple method is to behave greedily most of the time, but with small probability, e, choose an available action at random from those that do not have the highest Q value. For obvious reasons, this action selection method is called e-greedy (see Sutton & Barto, 1995). Softmax is another commonly used action selection method. Here again, actions with higher values are more likely to be chosen in given state. The most common form for the probability of choosing action a is

(2)

where t is a positive parameter and decreases over time. It is typically called the temperature, by analogy with annealing. In the limit as t®0, Softmax action selection becomes greedy action selection. In our experiment we investigated both e-greedy and Softmax action selection.

2.2  Implementation of Q-learning for 2 by 2 games

A Q-learning agent does not require a model of its environment and can be used on-line. Therefore, it is quite suited for repeated games against an unknown co-player (especially an adaptive, exploring co-player). Here, we will focus on certain repeated 2 by 2 games, in which there are two players each having two possible plays/actions at each stage of the game. It is natural to represent the state of play, for a given player, as the outcome of the previous game played. We say in this case that the player has memory length of one. The number of states for 2 by 2 game is thus 4 and for each state there are two actions (the pure strategies) from which the player can choose for current game. We also conducted the experiments for the case that players have memory length of two (the number of states will be 16) and obtained broadly similar results. The immediate reward a player gets is specified by the payoff matrix.

For the Softmax action selection method, we set the decreasing rate of the parameter t as follows.

(3)

T is a proportionality constant, n is number of games played so far. Q, called the annealing factor, is a positive constant that is less than 1. In the implementation, when n becomes large enough, t is close to zero and the player stops exploring. We use Softmax, but in order to avoid cessation of exploration, our agents start using e-greedy exploration once the Softmax progresses to a point (discussed below) after which exploration is minimal.

3  Experiments

3.1  Motivation

Repeated 2 by 2 games are the simplest of settings for strategic interactions and are a good starting point to investigate how outcomes arise under a regime of exploring rationality versus the ideal rationality of classical game theory. The Definitely Iterated Prisoner’s Dilemma, involving a fixed number of iterations of the underlying game, is a useful example. Classical game theory, using a backwards induction argument, predicts that both players will defect on each play (Luce & Raiffa, 1957). If, on the other hand, a player accepts the risk of cooperating, hoping perhaps to induce cooperation later from its counter-player, it is entirely possible that both players discover the benefits of mutual cooperation. Even if both players suffer losses early on, subsequent sustained mutual cooperation may well reward exploration at the early stages.

Motivated by this intuition, we selected 8 games and parameterized their payoffs. The players are modeled as Q-learners in each repeated game. In 5 of the games the Pareto optimal (socially superior, i.e., maximal in the sum of its payoffs) outcome does not coincide with a Nash Equilibrium. The remaining 3 games, which we included to address the multi-equilibrium selection issue, each have two pure-strategy NEs.

3.2  The games and the parameterization

We parameterized each of our 8 games via a single parameter, d, in their payoff matrices. In the payoff matrices below, the first number is the payoff to the row player and the second is the payoff to the column player. We mark the Nash Equilibria with # and the Pareto efficient outcomes with *. Pareto optimal (socially superior) outcomes are labeled with **. C and D are the actions or pure strategies that players can take on any single round of play. The row player always comes first in our notation. Thus, CD means that the row player chose pure strategy C and column player chose pure strategy D. So there are four possible outcomes of one round of play: CC, CD, DC, and DD.

The first two games are versions of Prisoner’s Dilemma (PD). The value of d ranges from 0 to 3. When its value is 2 (see Table 1), it corresponds to the most common payoff matrix in the Prisoner’s Dilemma literature.

Place Table 1, “Prisoner’s Dilemma, Pattern 1”, approximately here.

C / D
C / (3, 3)** / (0,3+d)*
D / (3+d, 0)* / (3-d, 3-d)#

Table 1: Prisoner’s Dilemma, Pattern 1

Place Table 2, “Prisoner’s Dilemma, Pattern 2”, approximately here.

C / D
C / (3,3)** / (0, 3+d)*
D / (3+d, 0)* / (d, d)#

Table 2: Prisoner’s Dilemma, Pattern 2

While the Prisoner’s Dilemma, in its usual form, is a symmetric game (see Tables 1 and 2), the following three games, adapted from Rapoport and Guyer (1976), are asymmetric. The value of d ranges from 0 to 3 in our experiments with these games. Note that as in Prisoner’s Dilemma, in Games #47, #48, and #57 (Tables 3—5) the Nash Equilibrium does not coincide with the Pareto optimal outcome.