Learnable Control Programs for Agents

Learning Strategies for Mid-Level Robot Control:

Some Preliminary Considerations and Experiments

Nils J. Nilsson

Robotics Laboratory

Department of Computer Science

Stanford University

Stanford, CA 94305

http://robotics.stanford.edu/users/nilsson/bio.html

Draft of May 11, 2000

ABSTRACT

Versatile robots will need to be programmed, of course. But beyond explicit programming by a programmer, they will need to be able to plan how to perform new tasks and how to perform old tasks under new circumstances. They will also need to be able to learn.

In this article, I concentrate on two types of learning, namely supervised learning and reinforcement learning of robot control programs. I argue also that it would be useful for all of these programs, those explicitly programmed, those planned, and those learned, to be expressed in a common language. I propose what I think is a good candidate for such a language, namely the formalism of teleo-reactive (T-R) programs. Most of the article deals with the matter of learning T-R programs. I assert that such programs are PAC learnable and then describe some techniques for learning them and the results of some preliminary learning experiments. The work on learning T-R programs is in a very early stage, but I think enough has been started to warrant further development and experimentation. For that reason I make this article available on the web, but I caution readers about the tentative nature of this work. I solicit comments and suggestions at: .

I. Three-Level Robot Architectures

Architectures for the control of robots and other agents are often stratified into three levels. Working up from the motors and sensors, the servo level is in direct sensory control of effectors and uses various conventional and advanced control-theory mechanisms---sometimes implemented directly in hardware circuitry. Next, what I call the teleo-reactive level organizes the sequencing of servo-level actions so that they robustly react to unforeseen and changing environmental conditions in a goal-directed manner. Control at this level is usually implemented as computer programs that attempt to satisfy sub-goals specified by the level above. The top level, the strategic level, creates plans to satisfy user-specified goals. One of the earliest examples of this three-level control architecture was that used in Shakey, the SRI robot (Nilsson, 1984). There are several other examples as well (Connell, 1992).

There are various ways of implementing control at these levels---some of which support adaptive and learning abilities. I am concerned here primarily with the middle, teleo-reactive, level and with techniques by which programs at this level can learn. Among the proposals for teleo-reactive control are conventional computer programs with interrupt and sensor-polling mechanisms, so-called “behavior-based” control programs, neural networks (usually implemented on computers), finite-state machines using explicit state tables, and production-rule-like systems, such as the so-called “teleo-reactive” programs (Nilsson, 1994).

Some sort of adaptivity or machine learning seems desirable, possibly even required, for robust performance in dynamic, unpredictable environments. Two major kinds of learning regimes have been utilized. One is supervised learning, in which each datum in a specially gathered collection of sensory input data is paired with an action response known to be appropriate for that particular datum. This set of input/response pairs is called the training set. Learning is accomplished by adjusting the control mechanism so that it produces (either exactly or approximately) the correct action for each input in the training set.

The other type, reinforcement learning, involves giving occasional positive or negative “rewards” to the agent while it is actually performing a task. The learning process attempts to modify the control system in such a way that long-term rewards are maximized (without necessarily knowing for any input what is the guaranteed best action).

In one kind of supervised learning, the controller attempts to mimic the input/output behavior of a “teacher” who is skilled in the performance of the task being learned. This type is sometimes called behavioral cloning (Michie, et al., 1990; Sammut, et al., 1992; Urbancic & Bratko, 1994). A familiar example is the automobile-steering system called ALVINN (Pomerleau, 1993). There, a neural network connected to a television camera is trained to mimic the behavior of a human steering an automobile along various kinds of roads.

Perhaps the most compelling examples of reinforcement learning are the various versions of TD-Gammon, a program that learns to play backgammon (Tesauro, 1995). After playing several hundred thousand backgammon games in which rewards related to whether or not the game is won or lost are given, TD-gammon learns to play at or near world-championship level. Another example of reinforcement learning applied to a practical problem is a program for cell phone routing (Singh & Bertsekas, 1997).

II. The Programming, Teaching, Learning (PTL) Model

Although machine learning methods are important for adapting robot control programs to their environments, they by themselves are probably not sufficient for synthesis of effective programs from a blank slate. I believe that efforts by human programmers at various stages of the process will continue to be important---initially to produce a preliminary program and later to improve or correct programs already modified by some amount of learning. (Some of my ideas along these lines have been stimulated by discussions with Sebastian Thrun.)

The programming part of what I call the PTL model involves a human programmer attempting to program the robot to perform its suite of tasks. The teaching part involves another human, a teacher, who shows the robot what is required (perhaps by “driving” it through various tasks). This showing produces a training set, which can then be used by supervised learning methods to clone the behavior of the teacher. The learning part shapes behavior during on-the-job reinforcement learning, guided by rewards given by a human user, a human teacher, and/or by the environment itself. Although not dealt with in this article, a complete system will also need, at the strategic level, a planning part to create mid-level programs for achieving user-specified goals. [A system called TRAIL was able to learn the preconditions and effects of low-level robot actions. It then used these learned descriptions in a STRIPS-like automatic planning system to create mid-level robot control programs (Benson, 1996).]

I envision that the four methods, programming, teaching, learning, and planning might be interspersed in arbitrary orders. It will be important therefore for the language(s) in which programs are constructed and modified to be languages in which programs are easy for humans to write and understand and ones that are compatible with machine learning and planning methods. I believe these requirements rule out, for example, C code and neural networks, however useful they might be in other applications.

III. Perceptual Imperfections

Robot learning must cope with various perceptual imperfections. Before moving on to discuss learning methods themselves, I first describe some perceptual difficulties. Effective robot control at the teleo-reactive level requires perceptual processing of sensor data in order to determine the state of the environment. Suppose, in so far as a given set of specific robot tasks is concerned, the robot’s world can be in any one of a set of states {Si}. Suppose the robot’s perceptual apparatus transforms a world state, S, through a mapping, P, to an input vector, x. That is, so far as the robot is concerned, its knowledge of its world is given entirely by a vector of features, x = (x1, x2, . . ., xn). (I sometimes abbreviate and call x the agent input even though the actual input is first processed by P.)

Two kinds of imperfections in the perceptual mapping, P, concern us. Because of random noise, P might be a one-to-many mapping, in which case a given world state might at different times be transformed into different input vectors. Or, because of inadequate sensory apparatus, P might be a many-to-one mapping, in which case several different world states might be transformed into the same input vector. This latter imperfection is called perceptual aliasing.

(One way to mitigate against perceptual aliasing is to keep a record in memory of a string of preceding input vectors; often, different world states are entered via different state sequences, and these different sequences may give rise to different perceptual histories.)

We can distinguish six interesting cases in which noise and perceptual aliasing influence the relationship between the action desired in a given world state and the actual action taken by an agent in the state it perceives. I describe these cases with the help of some diagrams.

Case 1 (no noise; no perceptual aliasing):

Here, each world state is faithfully represented by a distinct input vector so that the actual actions to be associated with inputs can match the desired actions. This is the ideal case. Note that different world states can have the same desired actions. (Taken in different world states, the same action may achieve different effects.)

Case 2 (minor noise; no perceptual aliasing):

Here, each state is nominally perceived as a distinct input (represented by the dark arrows in the diagram), but noise sometimes causes the state to be perceived as an input only slightly different from the nominal one. We assume in this case that the noise is not so great as to cause the agent to mistake one world state for another. For such minor noise, the actual agent action can be the same as the desired action.

Cases 3 and 4 (perceptual aliasing; no noise):

In this example, perceptual aliasing conflates three different world states to produce the same agent input. In case 3, S1 and S2 have different desired actions, but since the agent cannot make this distinction it will sometimes execute an inappropriate action. In case 4, although S1 and S3 are conflated, the same action is called for, which is the action the agent correctly executes.

Cases 5 and 6 (major noise occasionally simulates perceptual aliasing):

Here, although each state is nominally differentiated by the agent’s perceptual system (the dark arrows), major noise sometimes causes one world state to be mis-recognized as another. Just as in the case of perceptual aliasing, there are two different outcomes: in one (case 5), mis-recognition of S1 as S2 evokes an inappropriate action, and in the other (case 6), mis-recognition of S1 as S3 leads to the correct action. Unlike case 3, however, if mis-recognition is infrequent, case 5 will occur only occasionally, which might be tolerable.

In a dynamic world in which the agent takes a sequence of sensor readings, several adjacent ones can be averaged to reduce the effects of noise. Some of the case 5 mis-recognitions might then be eliminated but at the expense of reduced perceptual acuity. We will see examples of the difficulties these various imperfections cause in some learning experiments to be described later.

IV. Teleo-Reactive (T-R) Programs

A. The T-R Formalism

A teleo-reactive (T-R) program is an agent control program that robustly directs the agent toward a goal in a manner that continuously takes into account changing perceptions of the environment. T-R programs were introduced in two papers by Nilsson (Nilsson 1992, Nilsson 1994). In its simplest form, a T-R program consists of an ordered list of production rules:

K1 à a1

. . .

Ki à ai

. . .

Km à am

The Ki are conditions on perceptual inputs (and possibly also on a stored model of the world), and the ai are actions on the world (or that change the model). In typical usage, the condition K1 is a goal condition, which is what the program is designed to achieve, and the action a1 is the null action.

A T-R program is interpreted in a manner roughly similar to the way in which ordered production systems are interpreted: the list of rules is scanned from the top for the first rule whose condition part is satisfied, and the corresponding action is then executed. A T-R program is usually designed so that for each rule Ki à ai, Ki is the regression, through action ai, of some particular condition higher in the list. That is, Ki is the weakest condition such that the execution of action ai under ordinary circumstances will achieve some particular condition, say Kj, higher in the list (that is, with j < i ). T-R programs designed in this way are said to have the regression property.

We assume that the set of conditions Ki covers most of the situations that might arise in the course of achieving the goal K1. (Note that we do not require that the program be a universal plan, i.e. one covering all possible situations.) If an action fails, due to an execution error, noise, or the interference of some outside agent, the program will nevertheless typically continue working toward the goal in an efficient way. This robustness of execution is one of the advantages of T-R programs.

T-R programs differ substantively from conventional production systems, however, in that actions in T-R programs can be durative rather than discrete. A durative action is one that can continue indefinitely. For example, a mobile robot might be capable of executing the durative action move, which propels the robot ahead (say at constant speed). Such an action contrasts with a discrete one, such as move forward one meter. In a T-R program, a durative action continues only so long as its corresponding condition remains the highest true condition in the list. When the highest true condition changes, the current executing action immediately changes correspondingly. Thus, unlike ordinary production systems, the conditions must be continuously evaluated; the action associated with the currently highest true condition is always the one being executed. An action terminates when its associated condition ceases to be the highest true condition.