Journal of Vision http://journalofvision.org/ 1

Control of attention and gaze in complex environments

Jelena Jovancevic / Center for Visual Science, University of Rochester, Rochester, NY, USA
Brian Sullivan / Center for Visual Science, University of Rochester, Rochester, NY, USA /
Mary Hayhoe / Center for Visual Science, University of Rochester, Rochester, NY, USA

In natural behavior fixation patterns are tightly linked to the ongoing task. However, it is not known how task control interacts with image properties. How do task driven systems deal with unexpected salient stimuli? We studied the effect of potential collisions with pedestrians on distribution of gaze of Ss walking in a virtual environment. Subjects relatively frequently fixated pedestrians revealing them as the focus of attention. The increase in the probability of fixating pedestrians on a collision course over all pedestrians was relatively modest, suggesting that potential collisions do not automatically attract a fixation. Collision detection performance was modulated by the prior fixations on the pedestrians. The detection of potential collisions lead to the change in the strategy of looking at pedestrians. This suggests that subjects rely on the active monitoring to acquire information about the stimuli based on the expectations about the environment.

Keywords: Attention, Saccades, Active monitoring, Task demand

Journal of Vision Jovancevic, Sullivan & Hayhoe 16

Introduction

One of the crucial problems in research on vision is to understand the principles that guide the selection of information from visual scenes. Vision is an active process which enables us to have access to relevant information when it is needed. It is clear that it is not possible for the visual system to process all the information in the visual array (Ullman, 1984), thus there must be mechanisms that guide this selection process (Findlay and Gilchrist, 2003). Deployment of gaze is an overt manifestation of the allocation of attention (Henderson, 2003). It can thus serve to help understand visual and cognitive processes in scene perception. Where do people look in a scene? Early studies demonstrated that uninformative parts of the scene rarely get fixated (Buswell, 1935). How are ‘interesting and informative’ parts of the image determined or chosen by the visual system? One potential answer is that stimulus- based information generated from the image attracts attention, and thus gaze, as a part of a bottom up system. The other possible answer is that gaze reflects task directed acquisition of information as a part of a top-down system.

Investigation of the problem of deployment of gaze in visual scenes has taken several directions. In one approach, involving ‘scene statistics’ investigators have found that high spatial frequency content and edge density play a role in attracting fixations (Mannan et al, 1997). Furthermore, local contrast was found to be higher and two-point correlation lower, for fixated parts of the scene (Krieger et al, 2000; Parkhurst and Niebur, 2003; Reinagel et al, 1999). In another approach, visual saliency of the image is computed using the known properties of primary visual cortex (Itti and Koch, 2000,2001; Koch and Ullman, 1985; Torralba 2003). This ‘saliency-map’ approach uses image dimensions such as color, intensity, contrast, edge orientation etc., to generate saliency maps for each dimension, which are combined to create a single saliency map that indicates regions of interest in an image. Saliency then serves as a predictor of the distribution of gaze in a scene and can be correlated with human data (Oliva et al, 2003, Parkhurst et al, 2002). However, both of these approaches are essentially correlational techniques in that they do not establish a causal link between fixation locations and properties of the image. Saliency models are also inflexible and typically have no way of dealing with multiple salient objects in a scene. Perhaps most importantly, however, they account for only a portion of the variance in the deployment of gaze. For example, Parkhurst et al (Parkhurst et al, 2002) found the correlation between the location of highest salience and the observed fixation locations to be on average 0.45, for images of complex natural scenes, and 0.55 for computer generated fractals.

Other research has concentrated on task-related knowledge, seeking to explain control of gaze in scene perception. For example Yarbus’ classic experiments (Yarbus, 1967) revealed the importance of instructions on the location of fixations, suggesting that cognitive factors are important in defining locus of gaze, in addition to stimulus factors. However studies such as these involving viewing of pictures, are not easily controlled, as the experimenter often has no access to what the observer is doing from moment to moment during the viewing period. In addition, viewing a picture of a scene is very different from acting within that scene, because different information is needed to guide behavior. Wallis and Bülthoff (2000), for example, showed that drivers and passengers in virtual environment have different sensitivity to changes in the scene.

Several advances in technology such as new mobile eye trackers that can be used in natural environments, as well as progress in computer graphics technology, including new immersive environments, now allow investigation of active gaze control in natural tasks in controlled conditions (Triesch et al, 2003; Droll et al 2005; Shinoda et al, 2001; Turano et al, 2001). Recent eye movement research has focused on extended visuo-motor tasks such as driving, walking, sports, and making tea or sandwiches (Land and Lee, 1994; Land and Furneaux, 1997; Shinoda et al, 2001; Hayhoe et al, 2003; Turrano et al, 2003; Land et al, 1999; Land, 1998). What these studies found is that the eyes are positioned at a point that is not the most salient, but is relevant for the immediate task demands. Fixations are tightly linked in time to the evolution of the task, and very few fixations are made to regions of low interest regardless of their saliency (Hayhoe et al 2003; Land et al 1999; Sprague and Ballard, 2003; Sprague and Ballard, 2005). What these studies also revealed is that fixations appear to have the purpose of obtaining quite specific information. For example, cricket players fixate the bounce point of the ball just ahead of its impact, because this provides them with critical information in estimating the desired contact point with the bat (Land and McCleod, 2000). These task-specific computations have been referred to as ‘visual routines’ (Ullman, 1985, Hayhoe 2000). Visual routines can make use of higher level information to limit the amount of information that needs to be analyzed to that relevant to the current task, thus reducing the computational load. For example in a block copying experiment by Ballard et al (1995) observers had to copy simple colored block patterns from a model area, to a work area on a computer screen, using a mouse to pick up and move blocks from a resource area. Regular, stereotyped fixation patterns were observed. In copying one block, this sequence frequently involved four fixations: the first one to the block in the model, followed by a fixation to the corresponding block in the resource area. A block in the model was then re-fixated, with the fourth fixation falling on the intended location of a block in work area. Given the task, these fixations can be reliably associated with particular aspects of the task. The first fixation is presumably used to identify the color of the block to be copied, the second to guide the movement to pick up the block, the third to acquire the information about the location of the block in the model, and the final one is to guide the hand with the block to the work area. In other words, it would appear that the two fixations falling on the same object serve the purpose of getting two different pieces of information: namely, the first fixation of the block in the model is used to acquire information about its color, and the second one about its location. This has been referred to as the ‘just in time’ strategy by Ballard et al, because the observer gets only the specific information needed for the particular part of the task just in time for its execution. The fact that a visual routine is specific to a particular behavioral strategy lead Ballard et al (1997) to suggest that visual routines promote computational efficiency.

However, if eye movements are controlled by the task, one problem is how perceptual information that is not on the current agenda is accessed. In normal ongoing behavior it is not always possible to anticipate what information is required. How does the visual system divide attention between the current task goals, and unexpected stimuli that may be important and may change the task demands? This issue has been referred to as the ‘scheduling’ problem (Hayhoe, 2000; Shinoda, 2001). As previously discussed there are two possible answers to this problem. One is that attention is attracted exogenously, by the stimulus. The other possibility is that attention is attracted endogenously according to the observer’s internal agenda. Traditionally, basic visual responses are thought to be driven ‘bottom-up’, by the properties of the stimulus. For example, temporal transients are usually thought to attract attention (Yantis, 1998; Theeuwes, 2001). However, sudden onsets of stimuli which could attract attention are relatively rare in normal situations. Indeed visual transients across the entire image are generated more or less continually by the observer’s own movements. Given the complexity of the visual scene in more natural contexts, and in particular dynamic environments where many aspects of the visual input may be unpredictable, it is natural to assume that the distribution of attention is not static, but is constantly evolving, and is probably dependent on both the current behavioral context as well as stimulus properties. However, the relative role of top-down control vs. bottom-up salience in natural vision is not yet clear.

Although top-down systems have an advantage over bottom-up when it comes to computational load, because they only deal with limited amounts of information, the question remains: how do they deal with unexpected events? To what extent do bottom up systems influence the allocation of attention? This problem was addressed in previous experiments with virtual driving (Shinoda et al, 2000.). In these experiments, subjects’ ability to detect Stop signs, when they were visible for restricted periods of time, was examined. Subjects’ performance in detection of Stop signs was found to be heavily modulated both by the instructions and the local visual context (location of the Stop sign in the intersection or midblock). The authors concluded that fixations on Stop signs were primarily controlled on the basis of active search, and thus the problem of ‘scheduling’ appeared to be solved by using top-down scheduling of learned strategies in this task. It is unclear how broadly these results hold, however. The Stop sign in the Shinoda experiment was relatively small, and stationary with respect to the scene. Also, its behavioral significance is mostly symbolic (in the sense that ignoring it has no direct consequences in the absence of traffic).

The goal of the present investigation was to probe the question of what controls the distribution of attention in complex scenes, using a behaviorally more salient stimulus. We devised a virtual environment where observers walked along a virtual footpath with virtual pedestrians. The logic was to examine subjects’ sensitivity to an unexpected event which was chosen based on behavioral relevance and saliency. The unexpected event was that some of the virtual pedestrians occasionally changed their trajectory to a collision path with the observer for a limited time period, and then went back to their original path. When an object is on a collision course there is no lateral movement with respect to the body. In this situation colliding objects create a stationary looming stimulus on the retina when gaze is fixed. Previous research has shown that the rate of expansion of the looming stimulus is used to compute time to collision (Regan and Gray, 2000; Tresilian et al, 1999.). Further, neurons in motion sensitive areas of visual cortex (MST) appear to be sensitive to radial expansions as generated by looming stimuli (Duffy and Wurtz, 1995). The situation of having a pedestrian on a collision course as a stimulus is also more salient than the Stop sign detection (Shinoda et al, 2000.): the visual angle is larger and the retinal eccentricity smaller, thus providing a stronger test of the hypothesis. Detection of the stimulus (i.e. the colliding pedestrian) should be revealed by a fixation, or by avoidance or by both. Detection might result from continuous monitoring of the pedestrians in peripheral vision, or the new configuration of the flow field may itself attract attention. Frequent failure to detect a stimulus, on the other hand, would favor top-down monitoring. Further, since walking is a relatively easy task, in one of the conditions subjects were asked to follow a leader pedestrian, which served as a constraint for both their path and their gaze. If attentional resources are required to detect the potential collision, the concurrent perceptual task of following a leader should reduce the probability of detection. If, on the other hand, the change in the flow field has the power to attract attention bottom-up, a colliding pedestrian should be detected almost invariably. In addition to this, we manipulated the saliency of the colliding pedestrian. by increasing the speed of the colliding pedestrians during the ‘collision period’ in a separate set of trials. If observers rely on bottom-up scene analysis to initiate a particular visual computation, they should be sensitive to these manipulations. However, if subjects’ behavior is controlled by top-down factors, these manipulations should have little effect.