SUPPLEMENTARY MATERIAL

for

How do targets, nontargets, and context influence real-world object detection?

Harish Katti1, Marius V. Peelen2 & S. P. Arun1

1Centre for Neuroscience, Indian institute of Science, Bangalore, India, 560012

2Center for Mind/Brain Sciences, University of Trento, 38068 Rovereto, Italy.

CONTENTS

SECTION S1: COMPUTATIONAL MODELS

SECTION S2: ANALYSIS OF LOW-LEVEL FACTORS

SECTION S3: PARTICIPANT FEEDBACK

SECTION S4: ANALYSIS OF DEEP NEURAL NETWORKS

SECTION S5: SUPPLEMENTARY REFERENCES

SECTION S1: COMPUTATIONAL MODELS

Target features: We used aHhistogram of oriented gradients (HOG) to learn a bag of 6 six templates consisting of 2 two views each of 3 three poses of people and 6 six unique templates for cars (FelzenszwalbFelzenszwalb, Girshick, McAllester, & Ramanan et al., 2010). These templates are essentially filters and the response to convolution of these filter templates with intensity information at different locations within multiple scales of a scene, indicates whether some regions in the scene bear a strong or weak resemblance to cars or people, in this implementation a zero detector score indicates an exact match between a region and template and more negative values indicate weak matches. We thresholded the degree of match between the person template, and a scene region is thresholded at two levels:,oOne is a tight threshold of -0.7 that has very few false alarms across the entire data set, and a second, weaker threshold of -1.2 is set to allow for correct detections as well as false alarms. A total of 31 attributes were then defined over the person detections in an image,

  1. The attributes include the number of high confidence detections (estimate of Hhits).
  2. Difference between high and low confidence detections (estimate of Ffalse aalarms).
  3. Average area of detected person boxes weighted by detector score on each detection box.We weighted detections by the detector score based on feedback from subjects in which they indicated greater ease of target detection for larger targets and when the target appearance is more conspicuous.
  4. Average deformation of each unique part in detected boxes. This is calculated by first normaliszing each detection to a unit square and finding the displacement of each detected part from the mean location of the part over the entire set of 1,300 scenes in the car task or 1,300 scenes in the person task, as applicable.
  5. 5 Five bin histogram of eccentricity of person detections with respect to fixation.
  6. 6 Six bin histogram of the 6six person model types as detected in this scene.

A similar set of 31 attributes were defined for car detections;, in this manner, we represented each scene by a 62 dimensional vector capturing various attributes about partial matches to targets in the scene. Coarse and part structure information captured by car and person HOG templates are visualiszed in Figure Fig. S1. Representative examples of hits and false alarms from partial matches of car and person HOG templates, are shown in Figure Fig. S2.

To establish that this manner of summarising summarizing HOG template matches is indeed useful, we also evaluate target model performance using only the average HOG histograms (Dalal & Triggs, 2005) over all partial target match boxes arising from the same source detector (FelzenszwalbFelzenszwalb et al., 2010). Model correlations for this baseline detector (r = 0.39 ± 0.026 for person detection and r = 0.37 ± 0.027 for car detection) are lesser than the correlations for our HOG summary models (r = 0.45 ± 0.01 for person detection and r = 0.57 ± 0.01 for car detection).

We also evaluate whether unique aspects of response time variation can be explained by specific target attributes such as configural changes in the parts that are detected within a HOG template. For this purpose, we retain only 16 dimensional part deformation information and train models for target detection. We recomputed person model correlations after regressing out the remaining person target attributes, responses on the same scenes in car detection, distance of nearest person to scene center, largest person size, number of people, and predictions of a SIFT- based clutter model. Likewise for models trained to predict detection responses in the car task, we recomputed correlations after regressing out the remaining car target attributes, responses on the same scenes in person detection, distance of nearest car to scene center, largest car size, number of car, and predictions of a SIFT- based clutter model. We observe that models trained with part deformation information alonepredict target detection response times (r = 0.16 ± 0.015 for person detection and r = 0.1 ± 0.036 for car detection).

Nontarget features: All 1,300 scenes used in the car detection task and 1,300 scenes used in the person detection task were annotated for the presence of objects. We avoided extracting features from each nontarget object because isolating each object is extremely cumbersome, and because nontarget objects may potentially share visual features with the target. Instead, we annotated each scene with binary labels corresponding to the presence of each particular nontarget object. Objects were included in the annotation only if they occurred close to the typical scale of objects in the data set;, global scene attributes and visual concepts such as ‘sky’, and ‘water’, etc were not annotated. Annotations were standardized to a 67- word dictionary and the final list of unique object labels along with frequency of occurrence in the 1,300 scenes used in the car detection task are: text (277), sign (460), stripe (243), pole (504), window (679), entrance (77), tree (687), lamppost (308), fence (271), bush (116), colour (133), roof (225), box (84), thing (90), glass (151), manhole -cover (19), door (279), hydrant (37), dustbin (80), bench (51), snow (4), stair (79), cable (148), traffic- light (78), parking- meter (16), lamp (38), cycle (61), boat (22), rock (47), flower- pot (46), statue (20), flower (33), flag (31), wheel (10), table (24), animal (14), cloud (27), cone (15), chair (34), shadow (6), umbrella (15), bag (18), hat (2), lights (1), cannon (1), grating (1), bird (7), bright (2), cap (1), cart (1), lamp-post (2), spot (1), wall (0), light (0), branch (0), clock (0), shoe (0), vehicle (0), spectacles (0), shelter (0), gun (0), drum (0), sword (0), pumpkin (0), bottle (0), pipe (1), leaf (1).

The frequency of occurrence of each of these 67 labels in the 1,300 scenes used in the person detection task is: text (294), sign (412), stripe (134), pole (446), window (563), entrance (81), tree (598), lamppost (280), fence (227), bush (50), colour (136), roof (186), box (90), thing (118), glass (170), manhole -cover (10), door (242), hydrant (22), dustbin (87), bench (78), snow (4), stair (63), cable (87), traffic -light (71), parking -meter (10), lamp (47), cycle (105), boat (23), rock (55), flower -pot (66), statue (33), flower (35), flag (33), wheel (31), table (37), animal (21), cloud (7), cone (11), chair (59), shadow (8), umbrella (44), bag (116), hat (6), lights (1), cannon (1), grating (1), bird (8), bright (2), cap (11), cart (1), lamp-post (4), spot (1), wall (1), light (1), branch (1), clock (1), shoe (2), vehicle (1), spectacles (1), shelter (1), gun (1), drum (1), sword (1), pumpkin (1), bottle (1), pipe (1), leaf (1).

We excluded visual concepts that could be global (snow) or are more like visual features (bright, color, shadow, lights/reflections, and bright spot).

We also excluded labels with very rare (less than 10) occurrences in the target -present or target -absent data sets;, this was done to ensure stable regression and model fits. In this manner, we limited the models to use a maximum of 36 unique nontarget labels. We also verified that there is not qualitative change in reported results on including these rare nontarget objects.

Regression models were trained to predict target rejection and detection RTs for cars and people separately. These regression models yield informative weights shown in Figure Fig. S3 that indicate which nontargets typically speed up or slow down target rejection or detection. Examples are ‘cycle’, ‘dustbin,’ and ‘hydrant’, which speed up person detection and slow down person rejection Figure Figs. S3a A, –bB. Similarly, nontargets such as ‘cone’ (traffic -cone) and ‘entrance’ speed up car detection and slow down car rejection.

Coarse scene structure

These features are derived from the energy information over representative orientations at multiple scales in a scene and was first proposed in (Oliva & Torralba, 2001). This coarse scene envelope was extracted over blurred versions of the input scene to avoid object-scale statistics from contaminating the coarse scene description. This process yields a 512 dimensional feature vector for each scene. In post- hoc analysis, we verified that blurred scenes do not give rise to target matches (FelzenszwalbFelzenszwalb et al., 2010). We found that this method of modelling coarse scene envelopesoutperforms other approaches such as extracting activations from a scene classification CNN with blurred scenes as input. One of the important reasons for better performance of the GIST operator on our set is that it captures variations arising out of changes in field- of -view and scene depth, more reliably in the first few principal components than the CNN- based descriptor.

Other alternative features considered

We considered using several alternate feature representations, but their performance was generally inferior to our final feature representation for each channel. We have listed the details of these models and their relative performance compared to the best models.

Table S1 Description of baseline models.

Information type / Description / Notes
Target features / Whole target HOG description trained iteratively using Ssupport vector machines and hard negative examples. / Less hits and less meaningful false alarms when compared to the deformable sum of parts HOG model (FelzenszwalbFelzenschwab et.al., 2010).
Target features / Average HOG histograms from partial matches to target appearance. / Standard HOG histograms were extracted (Dalal & Triggs. 2005) from the locations of partial template matches using the method in (FelzenszwalbFelzenszwalb et al., 2010). Models informed with these features explained less than 30% of the variance explained by models informed with detection summaries described above.
Nontarget features / Softmax confidence scores on object categories from a deep convolutional network (object CNN) trained for 1,000 way object classification (Zhou, Khosla, Lapedriza, Oliva, & Torralba et.al., 2014). / The network is biased and always gives false alarms for some objects more than others. False alarms for some important nontarget categories indicate that the CNN has learnt context more than the object appearance for some categories. Regression models on Softmax confidence scores predict RTs poorly as compared to features in penultimate layers of the CNN.
Coarse scene structure features / Deep convolutional features over blurred versions of input scenes / Each scene was represented by a 4096 dimensional real valued vector that is obtained by presenting the scene a input to a pre trained deep convolutional network (CNN) that had been trained for 205-way scene classification (Zhou et.al., 2014).
Coarse scene structure features / Combination of GIST and activations from a deep convolutional network to blurred scenes. / Combining GIST features with scene classification CNN activations did not improve performance either.

Table 1: Description of baseline models.

Table S2 Baseline model generalization to new scenes in person and car detection

Feature type / Model name / Person detection / Model name / Car detection
rc / Best model of same type / rc / Best model of same type
Noise ceil / 0.45 ± 0.02 / Noise ceil / 0.45± ± 0.02
Avg HOG / T / 0.24± ± 0.02 / 0.41± ± 0.01 / T / 0.31± ± 0.02 / 0.50± ± 0.01
Nontarget Softmax / N / 0.12± ± 0.01 / 0.14± ± 0.02 / N / 0.04± ± 0.02 / 0.17± ± 0.02
Blur scene CNN / C / 0.25± ± 0.01 / 0.30± ± 0.01 / C / 0.28± ± 0.01 / 0.41± ± 0.01
Blur scene CNN + GIST / C / 0.23± ± 0.01 / 0.30± ± 0.01 / C / 0.35± ± 0.01 / 0.41± ± 0.01

Table 2: Baseline model generalization to new scenes in person and car detection. Note. Note. Performance of baseline models for each information channel are shown alongside the performance of the model trained with most informative target/nontarget/coarse scene features. Conventions are as in Table 1 in main manuscript.

Table S3:Baseline model generalization to new scenes in person and car rejection.

Feature type / Model name / Person rejection / Model name / Car rejection
rc / Best model of same type / rc / Best model of same type
Noise ceil / 0.45± ± 0.02 / Noise ceil / 0.45± ± 0.02
Avg HOG / T / 0.01± ± 0.02 / 0.25± ± 0.01 / T / 0.04± ± 0.02 / 0.06± ± 0.02
Nontarget Softmax / N / 0.14± ± 0.02 / 0.20± ± 0.02 / N / 0.09± ± 0.02 / 0.34± ± 0.01
Blur scene CNN / C / 0.06± ± 0.02 / 0.14± ± 0.02 / C / 0.13± ± 0.02 / 0.15± ± 0.02
Blur scene CNN + GIST / C / 0.06± ± 0.02 / 0.14± ± 0.02 / C / 0.09± ± 0.02 / 0.15± ± 0.02

Table 3: Baseline model generalization to new scenes in person and car rejection. Note. Performance of baseline models for each information channel are shown alongside the performance of the model trained with most informative target/nontarget/coarse scene features. Conventions are as in Table 1 in main manuscript.

Figure Fig. S1: Visualizsation of target templates learned for cars and people.

(A) a Histogram of oriented gradient structure learnt for three out of six canonical views of isolated cars. Three more views are generated by flipping the shown templates along the vertical axis.

(B) b Histogram of oriented gradient structure learnt for 8 eight parts within each view of a car.

(C) c Deformation penalties that are imposed on the 8 eight parts defined within each template for a car, part detections in whiter regions incur more penalty. .

(D)d Histogram of oriented gradient structure learnt for three out of six canonical views of isolated people. Three more views are generated by flipping the shown templates about the vertical axis.

(E)e Histogram of oriented gradient structure learnt for 8 eight parts within each view of a person. f

(F) Deformation penalties that are imposed on the 8 eight parts defined within each template for a person, part detections in whiter regions incur more penalty.

Figure Fig. S2:Representative examples of true and false detections from the HOG detectors trained for cars and people.

Illustrative examples of correct and incorrect matches to person (A-Ca–c) and car (D-Fd–f) HOG templates. The HOG representation of each scene is visualiszed along with the correct (Hhits) and incorrect (Ffalse alarms) matches.

(A) a

Person -like structure embedded in the intensity gradient information on doors, giving rise to person false alarms.

(B) b Car false alarm due to box- like shape and internal structure of a parking entrance.

Figure Fig. S3: Feature weights estimated for nontarget labels, using regression analysis.

Regression weights estimated for nontarget labels over car–-person absent scenes, for best models that either predict person rejection RTs (x axis, best model contains targets + nontargets) or those that predict car rejection RTs (y axis, best model contains nontarget and coarse scene features). A positive feature weight for labels such as ‘animal’,'flowerpot', and 'door' indicate that presence of those features will slow down person rejection in target-absent scenes. Nontarget labels such as ‘traffic -cone’, box-like structures,and ‘fence’ provide greater evidence for cars than people and hence slow down car rejection. Nontargets such as ‘trees’, and‘roof’, etc either have evidence for both cars and people, or contribute to general clutter in the scene and hence slow down both car and person rejection.Error bars indicate standard deviation of the regression weight across 20 cross validated regression model instances. These results remain qualitatively unchanged for choices of car rejection models containing either nontarget + coarse scene information or nontarget + target feature information.

SECTION S2: ANALYSIS OF LOW-LEVEL FACTORS

Targets that are large or that occur close to the center of the scene are detected faster by humans (Wolfe, Alvarez, Rosenholtz, Kuzmova, Sherman, et al. 2011). While it is non-trivial to find target size or eccentricity without processing target features, we nonetheless investigated whether such “low-level” factors can predict the observed rapid detection data. We calculated a number of such factors, as detailed below.

1)To estimate the area of the target, we recorded the area of the largest high-confidence in the scene as yielded by the HOG detector.

2)To estimate the number of targets in the scene, we manually counted the number of cars and people in target-present scenes and assigned one of the following levels (0, 1, 2, 3, 4, 5, greater than 5 and less than 10, greater than 10).

3)To estimate clutter in the scene, we evaluated a variety of computational metrics including number of corners in the scene (Harris & Stephens, 1988; Shi & Tomasi, 1994) and sScale invariant feature transform (SIFT) (Lowe, 2004) that discovers key points that afford some degree of invariance when the scene undergoes scaling, translation and rotation transformations.SIFT key points have been used extensively to represent local image properties and for object representation and retrieval (Mikolajczyk & Schmid, 2005). We found that they also measure scene clutter well and use the number of SIFT points as a rough estimate of objects in a scene.

4)To estimate target eccentricity, we measured the radial distance of the nearest high confidence target detection, from the center of the scene.

An illustrative example scene containing cars as well as people is shown overlaid with information that is used to extract these independent features is shown in the following Figure Fig. S4.

Figure Fig. S4:Visualization of task independent factors derived from a scene containing both cars and people. Locations of most confidently detected person and car closest to scene center are marked by red and green boxes, respectively. The radial distance is marked using dashed lines. SIFT (Lowe, 2004) interest points are shown using yellow circles, randomly chosen 10% of detected points are shown here.

The correlation of each factor with the observed detection time is shown in Figure Fig. S5. At the level of individual factors, we observe that larger target sizes and presence of low eccentricity targets seem to reliably speed up detection RTs for both cars and people. To assess how the combination of all factors might explain detection performance, we trained a model including all factors. These models explained person detection (r = 0.28± ± 0.02) and car detection times (0.31± ± 0.04) to some degree but their performance was still inferior to the best models based on target and coarse scene features.