22

Poster

TITLE

Investigation of Discrepancy-Defined Self-Awareness in Multi-Source Feedback

ABSTRACT

This study was conducted to clarify the meaning of computational self-awareness (mathematical discrepancy between self- and other-ratings) in multi-source feedback. Through the application of Item Response Theory (IRT), the relationship between observed and latent performance domains on a 360˚ assessment was compared for high versus low computationally self-aware individuals.

PRESS PARAGRAPH

Differences in self- and other-ratings on 360 assessments have traditionally been considered important at the feedback level; A manager’s self-assessment may differ from the ratings of his/her constituent groups, and this difference can be one source of managerial conflict. Recently, the degree to which a difference exists between self- and other-ratings has become a topic of interest to organizational researchers. The thought here is that congruence of self- and other-ratings is indicative of rater “self-awareness”. This study documents the reliability of self-awareness, and implements a statistical procedure to tease out possible sources of differentiation between high and low self-aware individuals.


Multi-rater feedback and appraisal systems (i.e., 360° assessments) may be implemented to evaluate manager competencies, track employee performance levels, or to develop organizational development interventions (Lawler, 1967, 1987; Cleveland, Murphy, & Williams, 1989; Sharkey, 1999). The multi-rater process generally consists of a target person being evaluated along multiple performance dimensions by subordinates, peers, clients, and/or supervisors as well as making his or her own self-evaluations (cf., Dunnette, 1993).

A consistent finding across multi-rater systems is a discrepancy between self- and other-ratings. This discrepancy consists of the target either over- or under-evaluating his/her performance relative to others’ aggregated ratings. The degree to which a discrepancy between self- and other-ratings exists has been conceptualized as the target rater’s self-awareness; smaller discrepancies indicate greater self-awareness, larger discrepancies indicate less self-awareness. Recently, this self-awareness construct has become a stand-alone topic of investigation in industrial and organizational psychology (Atwater, Ostroff, Yammarino, & Fleenor, 1998; Church, 1997; Fletcher & Baldry, 2000). The current study explores the stability and content of this difference-score dependent construct through the documentation of difference-score reliability as well as the application of differential item functioning methodology.

Self-Awareness

One of the initial investigations of this computational form of self-awareness in multi-source feedback situations was conducted by Atwater and Yammarino (1992). Grouping individuals into over-estimator, under-estimator, and in-agreement categories based on discrepancy scores, these researchers found that the correlations between leader behavior and naval officer performance were moderated by computational self-awareness. The researchers concluded that computational self-awareness can be conceptualized as an individual difference variable, and that this variable may serve as a moderator of predictor-criterion validities.

Fletcher and Baldry (2000) examined personality correlates of computational self-awareness, operationalizing self-awareness as the degree of congruence between self- and other-ratings along a 7-point ordinal scale. Computationally self-aware individuals were found to be more shrewd and experimenting, more likely to attempt to influence others, and also more likely to fall into the particular team role of Monitor-Evaluator than were computationally less self-aware individuals. Given the large variation in levels of computational self-awareness found in this study, the investigators concluded that computational self-awareness “might be considered as an individual difference variable in its own right” (Fletcher & Baldry, 2000; p. 314).

Different investigators have applied different mathematical algorithms in the operationalization of computational self-awareness. Atwater and Yammarino (1992) employed the aforementioned tripartite distinction. This categorization scheme was extended to a four-group model in 1997 looking at: a) over-estimators, b) under-estimators, c) in-agreement [good] estimators, and d) in-agreement [bad] estimators (Atwater & Yammarino, 1997). Additionally, Fleenor, McCauley, and Brutus (1996) have proposed a six-category model, slightly different from the Atwater and Yammarino (1997) categories.

As a concluding remark on self-awareness research in general, Atwater et al. (1998) identified two areas that have been neglected by researchers of computational self-awareness: 1) an identification of the meaning of self-other agreement, and 2) a fit of categorization scheme with purpose/meaning of computational self-awareness. The researchers specifically state that “researchers in the area of self-other agreement have not provided explicit definitions of the conceptual form of the relationship between self-ratings, other ratings, and outcomes, nor have they provided strong rationales for choosing one analytic strategy or measure of self-other agreement over another” (p. 580). The current investigation directly addresses the first area of empirical neglect (i.e., investigates sources of differentiation between self-awareness categories).

Systematic sources of variability in computational self-awareness (operationalized as self-other discrepancy/congruence) from a 360° measurement perspective have been attributed to target and rater friendship (McEvoy, 1990; McEvoy & Buller, 1987), degree of direct contact between raters and the target (Pollack & Pollack, 1996), and chronic tendencies to be self-aware (Church, 1997; Fletcher & Baldry, 2000). These sources of computational self-awareness variability would traditionally be interpreted as types of systematic measurement error, contributing perhaps to attenuated intraclass (inter-rater) correlations, and likely accompanied by random sources of error that potentially contribute to measured self-other discrepancies.

Greguras and Robie (1998) have documented poor inter-rater reliabilities within rating source categories. Given this finding, one has to ask how reliable computational self-awareness estimates should be expected to be. That is, measured differences in computational self-awareness may reflect a measurement artifact, resulting from measurement error. Correlational findings that are typically employed in computational self-awareness research (cf., Atwater & Yammarino, 1992; Fletcher & Baldry, 2000) may therefore not be meaningful in and of themselves. If true group differentiation does exist, it may be reflected in discrepant responses to items (holding ability level constant; i.e., bias). Analyses of differential item functioning (DIF) allow for this type of investigation.

Differential Item Functioning

The administration of a standard 360° rating form to different rater populations/groups does not necessarily elicit measurement of the same concept, as the administered materials may convey different meanings to members of the different groups (Straus, 1969). This question of relative meaningful interpretation of performance dimensions is one of measurement equivalence (Byrne et al., 1989; Drasgow & Kanfer, 1985). Essentially, measurement equivalence involves a determination of internal rating scale calibration across groups. In other words, it is concerned with how one group’s subjective interpretation of a concept is related to the objective or true state, and whether or not this relationship is similar across groups.

Differential item functioning (DIF) has been a useful analytical tool for the identification of item bias. The method is particularly useful in situations where it is unknown whether group differences are reflective of real, underlying ability differences between groups or rather item bias, where groups are interpreting or responding to items/ instruments in a different manner. DIF, therefore, has traditionally been applied to situations in which bias is a concern by identifying items that contribute to overall assessment bias and/or adverse impact. Subsequent to identification of item bias, commonalities may be assessed across identified items, and the source of bias may be dealt with. This is accomplished through either deleting items or further editing item content.

High versus Low Computationally Self-aware Individuals

Measurement equivalence of multi-source feedback instruments has been thoroughly investigated across rating source (i.e., peer, self, superior, and subordinate groups; Goudy, 1998; Klimoski & London, 1978; Maurer et al., 1998; Prien & Liske, 1962; Thornton, 1980), but has not been investigated within rating source categories. The closest investigations identify traits/skills/qualities of individuals and then assess measurement equivalence across identified groups, such as Introverted and Extraverted groups (cf., Flanagan, 1998). It is important to assess whether or not measurement equivalence exists between high and low computationally self-aware groups if the recorded differences between these two groups are to be truly understood.

Research Questions

The present study is largely descriptive and exploratory in nature. The employed methodology uses subjective content analysis of item commonalities, and specific a priori hypotheses regarding the source or content of computational self-awareness differentiation were not made. Rather, a series of research questions were posed, with the primary questions concerning: 1) the reliability of the discrepancy-based self-awareness construct, 2) the ‘fit’ of the Samejima (1969) graded response (SGR) model to the current data set, 3) whether or not differential item functioning would occur across low and high computationally self-aware groups, if it did, 4) the source of inter-group differentiation.

METHOD

Participants

Sixty thousand six hundred and five individuals provided 360˚ ratings. Ratings were made by self, boss, superior, peer, direct report, and other rater groups.

Materials

The 360˚ assessment used in the current study was the CCL BENCHMARKS® instrument. This instrument assesses skills and behaviors thought to be relevant for managerial effectiveness. Items are grouped into four sections; the current study focuses only on items in Section One of this survey. The 115 Section One items aggregate to 16 scales: 1) Resourcefulness, 2) Doing whatever it takes, 3) Being a quick study, 4) Decisiveness, 5) Leading employees, 6) Confronting problem employees, 7) Building and mending relationships, 8) Compassion and sensitivity, 9) Straightforwardness and composure, 10) Balance between personal life and work, 11) Self-Awareness, 12) Putting people at ease, 13) Differences matter, 14) Participative management, 15) Career management, and 16) Change management.

Procedure

The method to assess the measurement equivalence of this 360˚ data progressed through the following steps: 1) identification of dimensionality of the 360˚ data set, 2) preparation of data for IRT analyses, 3) application of the SGR to generate item and ability parameters, 4) assessment of the adequacy of the graded response model to the current data set, 5) analysis of differential item functioning, and 6) interpretation of content commonalities across differentially functioning items.

RESULTS

Instrument Dimensionality

Based on the scree plot of eigenvalues (Figure 1), a two (cumulative variance accounted for = 42.9%) factor solution was considered for factor retention. Oblique rotation (direct oblimin) of the two-factor solution was first implemented for preliminary scale identification. This solution resulted in no cross-identified items, but did result in the exclusion of thirteen items (item numbers 2, 16, 19, 26, 36, 56, 66, 68, 91, 97, 99, 108, and 109) from the two-factor solution using a .4 pattern loading magnitude criterion. Examination of item content reflects noted leadership dimensions of Consideration/Relationship-Oriented Behavior (Factor 1) and Initiating Structure/Task-Oriented Behavior (Factor 2; cf., Yukl, 1994). All subsequent analyses were conducted separately for each of these two unidimensional scales.

Computational Self-Awareness Identification

In order to identify high and low computationally self-aware individuals, an absolute difference score was computed for self- and aggregated other-ratings. The frequency distribution of this difference score (retaining only those individuals with valid responses to each scale item) was next split into tertiles. This method resulted in absolute differences ranging from 0 to .19 (lower tertile) and .42 to 2.45 (upper tertile) difference scale points for the Initiating Structure scale (mdifference = .35, s = .27), and 0 to .18 (lower tertile) and .40 to 2.41 (upper tertile) for the Consideration scale (mdifference = .34, s = .27). Both distributions exhibit moderate positive skew, with most individuals rating themselves similarly to others’ aggregated ratings. Intra-class correlations suggested that this category inclusive level of self-other contrast was appropriate.

Computational Self-Awareness Reliability Estimation

The estimated reliability of the computational self-awareness construct was assessed via the formula presented in Johns (1981):

Because this formula requires the computation of a correlation between component scores, the data were first aggregated to the appropriate scale level through the computation of a mean score, and then other ratings were matched to self ratings through linkage with the rated individual. Measures of internal consistency were utilized for reliability estimates. Internal consistency estimates were computed separately for the ‘other’ and ‘self’ rater groups. Table 1 presents component score variances, reliability estimates, and correlations. Corresponding estimated reliability coefficients for the computational self-awareness construct were .95 for the Consideration scale, and .94 for the Intitiating Structure scale.

Item Response Theory Parameter Estimation

In order to assess not only DIF, but also the fit of the SGR model to the current data set, the four samples (i.e., High Self-Aware Initiating Structure, Low Self-Aware Initiating Structure, High Self-Aware Consideration, and Low Self-Aware Consideration) were further split into calibration and validation samples. These splits were conducted through rank ordering cases based on the date of completion of the BENCHMARKS® instrument, and then assigning every other case to the calibration or validation data set.

In order to ensure convergence of the item and ability parameter estimation procedure, it was also necessary in the current data set to collapse the three lowest response option categories due to insufficient response frequencies. Specifically, the response options of ‘strongly disagree’, ‘disagree’, and ‘neutral’ were all coded identically. This lack of response can result in inaccurate parameter estimation (specifically, high standard errors of estimates), and may be dealt with by combining low-frequency categories – low frequency categories being defined as categories with less than 100-200 respondents (J. Bjørner, personal communication, January 15, 2002). The necessity of this modification should not be considered entirely surprising, given the nature of self-serving response biases with self-ratings of managerial effectiveness (i.e., infrequent low ratings). The effective result is that the distribution of self-ratings was constrained to a range of “not positive” to “strongly positive” instead of “strongly negative” to “strongly positive”.

Item and ability parameter estimates were generated through a marginal maximum likelihood (MML) procedure. This estimation process was conducted separately for each of the four calibration samples (i.e., Low Self-Aware Consideration, High Self-Aware Consideration, Low Self-Aware Initiating Structure, and High Self-Aware Initiating Structure).

Model Fit

Fit of the SGR model to the current data set was facilitated through use of the MODFIT program. Specifically, through comparison of the empirically derived item parameter estimates with the validation samples, fit plots were constructed and chi-square statistics were computed. The use of these dual criteria in the determination of SGR model fit follow the recommendations of Drasgow, Levine, Tsien, Williams, and Mead (1995). Based on inspection of fit plots, the graded response model provided a good fit to the BENCHMARKS® data. In contrast with the fit plots, the adjusted chi-square to degrees of freedom ratios indicate moderately adequate model fit. Specifically, unadjusted ratios indicate good fit, but adjusting to the suggested sample size of 3,000 indicates only adequate model fit. Tables 2 and 3 present chi-square values for both scales.