Re: Shih Using the Attention Cascade Model to Probe Cognitive Aging

Page 1 of 18

Online Supplementary Materials

re: Shih – Using the attention cascade model to probe cognitive aging

This document consists of four sections:

Section 1: Specifications of the attention cascade model. It provides full descriptions and justifications for the mathematical implementations of the model.

Section 2: Modeling procedures. It describes the procedures of finding the optimum estimates and parameter distributions, and the procedure of producing model predictions.

Section 3: A table of shared variance between parameter estimates and R2

Section 4: References

Section 1: Specifications of the Attention Cascade Model

This section describes and justifies the functional forms, equations, and parameters of the attention cascade model in the context of attentional blink (AB) experiments. All specifications follow those proposed in Shih (2008) and Shih and Sperling (2002). Wherever the time dimension is concerned, one unit of time represents 100 ms. This implementation brings values of parameters and variables into a reasonable range. For example, 25 and 140 ms were respectively implemented as 0.25 and 1.4 time units. However, for convenience and meaningfulness, descriptions concerning the time are referred in terms of milliseconds whenever appropriate. Although the dimension for most of the parameters and variables is time, the data involve accuracy, not response times. The model development so far has focused on the components of attention control mechanism and working memory.

Sensory Processor and Long-Term Memory (LTM)

Sensory mechanisms in vision are relatively well understood and are depicted in a variety of (computationally) models, one of which can be implemented if required. Alternatively, using appropriate stimulus parameters may minimize the necessity of elaborating the sensory processor. The latter applies to AB research, which typically presents a single RSVP of supra-threshold stimuli at the fixation.

Two routes.The sensory processor has two production routes: one mandatory and the other contingent on the bottom-up salience of a stimulus. The formeractivates LTM traces of the stimulus, whereas the latter initiates the attention window.The activated traces arepreliminary representations of the stimuli. The same degree of activation may be assumed for well learned stimuli such as digits and letters in RSVP. Using well learned stimuli and an AB task (involving mainly working memory or WM) minimize the necessity to model the LTM components.

Notably, as most accounts of AB,the attention cascade model assumes that LTM is accessed before WM. Thisorder is consistent with the results of event-related potentials measured in AB experiments, suggesting that LTM traces are automatically activated irrespective of the AB, and that the failed report is due to unsuccessful consolidation in WM(Rolke, Heil, Streb, & Henninghausen, 2001; Sergent, Baillet, & Dehaene, 2005).

A stimulus’s preliminary representation is described by a rectangular function of time, whose width equaling the SOA. The function takes the value 1 when the stimulus is perceptually available (i.e., including physical stimulation and visual persistence) and 0 otherwise. This simplification seems plausible for stimuli that are equivalently, highly familiar and are displayed briefly at the fixation at high contrast.

However, initial explorations consistently over-estimated P(T2|T1) in the SN condition (i.e., the condition with Salient T1 and Non-salient T2), and the over-estimation reduced as the TOA increased. In fact, predictions for the SN condition were similar to those for the NN condition (i.e., the condition with Non-salient T1 and Non-salient T2). This is because the original model was applied to studies whose stimulus contrast/luminance remained constant for all stimuli (even when stimulus salience was manipulated) and there was no T1 Salience x T2 Salience interaction (see Figure 4). It appeared that in the present study non-salient T2 suffered masking effect from salient T1 and the effect weakened as the TOA increased. To account for these few data points, the height of the rectangular function for non-salient T2 in the SN condition is weighted by a scalar ζ (0 < ζ < 1). After further explorations, ζis defined by Equation 1.

ζ = m + (1 - m) (1 e-m t ) / (1)

where m(in [0,1]) is the initial masking factor that describes perceptual interference from a salient item to a non-salient item when they appear simultaneously (i.e., TOA = 0 ms). Thus, m indicates both the masking impact (smaller m, greater impact) and the recovering rate from masking (smaller m,slower recovery). Equation 1 says thatζ increases exponentially with time and asymptotes at 1 when masking no longer effective. To reiterate, Equation 1 applies to non-salient T2 in the SN condition only; it does not apply to T1 (salient or not), salient T2, or non-salient T2 in the NN condition.

Attention Control Mechanism

The attention control mechanismis conceptually equivalent to, for example, the central production system in ACT-R (Anderson & Lebiere, 1998) or the cognitive processor in EPIC models (e.g., Meyer & Kieras, 1997). According to task demands, it configures target templates, attention window, and when to output from the response buffer.

Target templates may consist of, for example, perceptual and conceptual features that define the targets. They may be used to pre-sensitize the LTM traces of potential targets, leading to priming effect (an aspect not modeled here). The templates are used to compute the top-down salience of items to guide selection and resource allocation. If the top-down salience exceeds a pre-set criterion, then the attention window is signaled to open to transfer the preliminary representations to WM. Processing resources in WM will then be distributed in proportion to top-down saliencies of stimuli. For simplicity, the top-down salience for target digits and distractor letters are respectively set at 1 and 0in the present application.

The onset and width of the attention window determine which and how much information of preliminary representations is transferred to the WM buffer. The process of transfer is usually referred as attention gating. The window is described by a rectangular function of time, which takes the value 1 over the window width (i.e., from onset to offset), and 0 otherwise. The width w determines the amount of information transferred. A narrow window may not transfer all available target information; whereas a wide window may admit distracting information, taking up resources or increasing internal noise. Thus, the width may reflect the ability to select relevant information as well as the ability to filter out irrelevant information. The window width depends on the presentation rate, task demands, and inherent limitations. For example, it may be narrower for fast than slow presentation; it is narrower for a typical AB task than a task requiring a report of 4 consecutive items from RSVP (e.g., Nieuwenstein & Potter, 2006; Weichselgartner & Sperling, 1987); an extremely narrow window (e.g., < 10 ms) may not be possible due to finite resolution that can be achieved by neural activities. Although the width may vary from trial to trial, all models (including the attention cascade model) assume the width remains constant given invariable presentation rate and task demand in a block of trialsto make calculations tractable.

The onset of the attention window is a random variable, whose distribution is determined by the route through which the attention window is triggered. The mandatory (or controlled) route includes four processing stages (i.e., sensory processor, LTM, target templates, and attention window), while the “bottom-up salient” (or automatic) route includes two (i.e., sensory processor and attention window). The random variables of the processing times in the two or four stages are assumed to distribute independently and identically as an exponential probability density function (e-t/β)/β (i.e., one-stage RC-circuit or first-order gamma function). The time constant β reflects the processing rate -- the larger the value the slower the rate. Although β may vary from stage to stage, it is not possible to independently estimate it for each stage given the present experimental design and data. However, assuming the same β for the four stages gives good quantitative fits in Shih (2008) and makes the model parsimonious. Thus, the same β is assumed for these four stages. Consequently, relative to the onset of the preliminary representation for Itemi, the window onset (or triggering time) in the automatic and controlled modes are respectively distributed as second- and fourth-order gamma density functions:

/ α = 2 or 4 / (1)

By definition, the mean onsets (and variance) for the automatic and controlled modes are respectively 2β (2β2) and 4β (4β2). That is, although the onset varies from trial to trial, it is generally faster and less varied in the automatic than controlled mode.

Working Memory (WM)

WM buffer. The area of the preliminary representation for Itemi over the interval of the attention window defines the initial strengthvalue si of Itemi arriving at the WM buffer. Because the window onset is a random variable, si is also a random variable. If si is greater than the response threshold θ, the item is passed to the decision stage; otherwise it enters or awaits the consolidation processor in WM. θplays a role in accounting for conditions with multiple consecutive targets (e.g., Nieuwenstein & Potter, 2006; Reeves & Sperling, 1986; Sperling & Weichselgartner, 1995) or +1 blank (i.e., replacing the distractor trailing a target with a blank interval; e.g., Raymond et al., 1992). Because neither condition is present in the current application, θ is not estimated. While awaiting the processor, si reduces. Consistent with theories involving WM (e.g., ACT* by Anderson, 1983; see Rubin & Wenzel, 1996, for a review), the attention cascade model assumes the reduction follow an exponential decay function defined by Equation 3.

/ (3)

where qi is the remaining strength for Item i at the end of queuing for a duration d (d >= 0).Equation 3 reveals two unique features of the attention cascade model – (a) the decay rate is defined by si (instead of a new parameter), and (b) the greater si is, the slower the decay.

Consolidation processor. Once the processor becomes available, it admits all representations in the WM buffer concurrently.Item strengths are then changed via three successive operations. First, strengths are weighted by items’ top-down salience. Second, if the sum of the weighted strengths exceeds the capacity C, all items are weakened just enough that C is not surpassed. See Equation 4.

/ ((4)

C is equivalent to the number of items per SOA unit that the available resources can process. Third, the weighted item strengthri grows as a cumulative density function of an exponential distribution over the consolidation duration π according to Equation 5.

/ (5)

where vi is the resulting strength of Item i at the end of consolidation. Equation 5 reveals twofurther unique features of the attention cascade model – (a) the growth rate is defined by ri (instead of a new parameter), and (b) the greater ri is, the faster the growth.Although π may vary from trial to trial, π is assumed a constant in the present application because its variation can be absorbed in the random variable of internal noises (see below).

The model stipulates that at the end of consolidation the output will be the identity of Item i if vi is greater than the internal noise at that moment; otherwise a guess will be made. A Gaussian distribution with a mean μn and a standard deviation σn is assumed for the internal noise during decision. Via the noise distribution, item strengths are mapped to probabilities of report.

In sum, the current application involves seven model parameters – m is the initial masking factor; 1/β is the processing rate of pre-WM stages; w is the width of the attention window; C is the processing capacity of WM; π is the consolidation duration; and μn and σn are respectively the mean and standard deviation of a Gaussian distribution for internal noise.Except for m (applicable to the SN condition only), they are assumed invariant between conditions.

Section 2: Modeling Procedures

Five phases of modeling are described in this section. Estimates in Phase 1 produced best fitting (least square) analytic solutions for P(T1) and P(T2). These estimates were then tuned in Phase 2 using Monte Carlo simulations to produce best fitting P(T1), P(T2), and P(T2|T1). Best fits (least squares) in both phases were obtained with the optimization program PRAXIS (Brent, 1973). Phase 3 was to further tune the estimates and to establish their stability by an extensive grid search. Phase 4 estimated bootstrap-based, 95% confidence interval (Efron & Tibshirani, 1986) for each parameter. Phase 5 produced model predictions.

Phase 1: Quick estimation using approximate analytical solutions

When considering data aggregated over a session (i.e., not on the basis of individual trials), it is feasible to derive analytical solutions that generate predictions for P(T1) and P(T2). This is a quick way to obtain initial estimates to feed into a time-consuming approximation procedure using Monte Carlo trials. Thus, for each group, I first estimated parameters that gave best fitting to P(T1) and P(T2). For each set of parameter values, predictions are produced in the following steps.

Given the time constant β regarding the initiation of the attention window, derive the automatic and controlled triggering time functions according to Equation 2.
Convolve each triggering time function with the attention window, producing the automatic or controlled attention gating function (AGF). Given the assumed rectangular function (of a width equals to the SOA) for preliminary representations, the area under the AGF that a stimulus is perceptually available defines the average strength of the stimulus.
If T2 is under AGF (one triggered by T1 and another by T2), strength is computed under each AGF and the greater one is taken as the item strength.
For the condition displaying salient T1 and non-salient T2, average T2 strength is weighted by the masking scalar as defined in Equation 1.
T1 is always advanced to the consolidation processor without delay. T2 is advanced to the processor without delay if the TOA is 100 ms (i.e., T2 trails T1) or if the TOA is greater than the processing duration π of the processor. Any delay (π - TOA) reduces the average T2 strength according to Equation 3.
Average stimulus strength is weighted by 1 if it is a target and 0 if a distractor.
If T1 and T2 are advanced to the consolidation processor in one batch, strength of each is scaled according to Equation 4.
The resulting strength from Step 7 grows over the duration of consolidation processing according to Equation 5.
The final strength derived from Step 8 is used to compute the recall probability given Gaussian distribution of a mean μn and a standard deviation σn.
Incorporate the guessing rate g (i.e., 1/4) into the recall probability p from Step 8. That is, the predicted observations are derived from (p - pg + g).

Phase 2: Improve estimates using Monte Carlo simulations

Estimates obtained in Phase 1 were tuned to predict P(T1), P(T2), and P(T2|T1) using Monte Carlo simulations. For each set of parameter values, 1000 trials were run for each condition. On each trial, whether a target is recalled is computed as follows.

Decide the onset time ta of attention window by randomly sampling a value from the automatic or controlled triggering time distribution (Equation 2) depending on the bottom-up salience of a target.
Stimulus strength s is the area of the attention window that overlaps with the perceptual persistence of the stimulus, i.e., Maximum( 0, SOA ta ). When the perceptual persistence of a stimulus traverses two windows [e.g., T2 is under the window triggered by T1 and T2 respectively], its strength is the maximum of the two areas.
Multiple s by the masking scalar (Equation 1) if it corresponds to a non-salient T2 presented after a salient T1.
Multiple s by 1 if it corresponds to a target and 0 if a distractor.
T1 and T2 are processed together in the consolidation processor if TOA <= 120 ms and separately otherwise. When they are processed together, strength of each is scaled according to Equation 4. When they are processed separately, T2 strength decays according to Equation 3 if queuing is required.
The resulting strength from Step 5 grows over the duration of consolidation processing according to Equation 5.
When reporting, the resulting strength v from Step 6 is compared against the noise value η that is randomly sampled from a Gaussian distribution with a mean μn and a standard deviation σn for each stimulus. If v > η, the target is identified. Otherwise, a guess is produced, which has 1/4 probability of being correct.

Let n(X) be the number of trials that X is reported correctly. The program keeps separate counts for n(T1), n(T2), and n(T1 and T2). Thus, P(T1) = n(T1)/1000 and P(T2|T1) = n(T1 and T2)/n(T1).

Phase 3: Obtain optimum estimates via grid search

An extensive grid search was carried out to tune the estimates obtained in Phase 2 and to ascertain the stability of the estimate. For each parameter, several values were chosen around the value of its estimate. For example, given an estimate of 11 ms for the time constant β, the parameter value was varied in 9 levels: between 7 and 15 ms in steps of 1 ms; given 7 free parameters and 9 levels in each parameter, there are 79 combinations of parameter values. For each combination, 12 runs of 200 Monte Carlo trials per conditions were performed. The statistic R2 (Judd & McClelland, 1989) was computed at the end of each run to indicate the amount of variance in the data that is accounted for by the model corrected for the number of free parameters; the value of R2 is between 0 and 1, with 1 denoting a perfect fit. An average R2 of the 12 runs was computed for each set of parameter values; the set and average R2 was saved if the average R2 was greater than a pre-set criterion (e.g., 0.79). In one grid search, for example, 22631 sets (out of 79) met the criterion. The range of parameter values were systematically narrowed down by a series of frequency analyses. For example, β = 8 only occurs in three sets (out of 22631); these three sets were eliminated from further analysis.

Phase 4: Estimate parameter distribution using a bootstrap method

A bootstrap method (Efron & Tibshirani, 1986) was used to approximate the distribution of each parameter estimate. For each group, the computer generated 10,000 bootstrap data sets by sampling with replacement from the original data set; each bootstrap data set had the same number of "participants" as the original data set. For each bootstrap data set, the model parameters were estimated. Thus, for each parameter and group, there is a bootstrap sampling distribution from which the 95% confidence interval is determined using the percentile method (i.e., between 2.5 and 97.5 percentiles).

Phase 5: Model predictions

Using the optimum estimates, 100 rounds of Monte Carlo simulations were conducted for each group.Each round contained 1000 trials per condition and produced a set of P(T1), P(T2), and P(T2|T1) as well as an R2, which indicates the amount of variance in the data that is accounted for by the model corrected for the number of free parameters.