Instructions for Use of SPSS Syntax forInter-rater Reliabilityfor Nominal Data

This document contains directions for six separate syntaxes, each producing least squares, jackknife, and bootstrap solutions for an inter-rater reliability statistic. The specific statistics are Bennett’s S, Fleiss’ generalized kappa, Cohen’s kappa (Gwet’s expansion), Light’s kappa, Conger’s kappa, and Gwet’s AC1. The statistic not included is Krippendorff’s alpha. Syntax for this statistic has been written by Andrew Hayes at the OhioStateUniversity and is available at his website Before explaining the syntax, a summary of each statistic is provided.

Disclaimer – This syntax is distributed without any warranty on the part of the authors. The syntax has been tested with datasets used by the originators of the agreement statistics and has produced the same results. The syntax has also been tested thoroughly under all conditions that were anticipated, including one in which three of eight categories were not used, and two others were used only once by different raters; however one can never fully rule out the possibility that a circumstance will arise which has not been tested. The authors would appreciate suggestions and reports of errors encountered in the use of this syntax so that we might remedy the difficulties. Please enclose the data file that you used when you correspond with us. The address by which to reach us is:

Brian Dates

Southwest Counseling Solutions

1700 Waterman

Detroit, MI

USA

Notation Convention – The notation used throughout this document will combine that of Siegel and Castellan (1988) and Gwet (2001). There are a total of subjects or items, numbered = 1 through ; there are categories, numbered = 1 through ; and there are raters, numbered =1 through .

Bennett’s S–Bennett, Alpert, and Goldstein(1954) suggested adjusting the level of inter-rater agreement to accommodate the proportion of rater agreement that might be expected by chance. To this end, they proposed an index which provided the adjustment based on the number of categories employed. The formula is:

(1)

where: = number of categories,

, the proportion of observed

rater agreement,(2)

= the total number of subjects, or items,

= the number of ratings in each cell (for item and

category ), and

= the number of raters.

The premise is that the probability of agreement by chance is equal to . When is equal to , , indicating that the observed agreement is no greater than that expected by chance. The range of is -1 through 1. Values less than 0 are interpreted as observed agreement less than that expected by chance, while values above 0 mean that the level of observed rater agreement is greater than that expected by chance. Because the value of is dependent only on and , both of which are constants, does not vary in value with changes in the distribution of responses among raters or categories.

Bennett provided a formula for the variance of the statistic as well:

(3)

where: = number of categories,

= the proportion of rater agreement, and

= the number of subjects, or items.

Under the Central Limit Theorem, the distribution of is estimated to be close to normal. As a result, the significance of the values of can be computed. First the value of is divided by the standard deviation of to derive a value for, the number of standard deviations from the mean of 0 (representing observed agreement equal to chance agreement) that the value of represents:

(4)

The probability of obtaining the value of can be determined by subtracting the cumulative distribution function below from 1, as in:

(5)

where: = the probability of the obtained value of the random

variable, and

= the obtained value of the random variable from

Equation (4).

Finally, the 95% confidence limits can be expressed as:

and (6)

where: = the upper limit of the confidence interval,

including 2.5% of the normal curve,

= the lower limit of the confidence interval,

including 2.5% of the normal curve,

= Bennett’s , and

= the standard deviation,.

Fleiss’ generalized kappa–In 1955, Scott developed a measure of inter-rater agreement, which he termed pi. While he believed that a term representing the chance agreement among raters was dependent upon the use of categories, his disagreement with Bennett’s S was based on his belief that chance agreement should take into account the marginal probabilities of agreement. His formula for two raters and two categories is:

(7)

where: = the chance corrected agreement coefficient,

= the calculated proportion of rater agreement, and

= the proportion of rater agreement expected by

chance, which for two raters and two categories

is expressed as:

(8)

where: = the number of responses in category one for rater

one,

= the number of responses in category one for rater

two,

= the number of responses on category two for rater

one, and

= the number of responses in category two for rater

two.

Fleiss (1971) developed a generalized kappacoefficient, which computationallyproved to be an expansion of Scott’s for any number of raters and categories. The general format of these statistics, termed kappa-type statistics, is:

(9)

where: = the chance corrected agreement coefficient,

= the observed proportion of rater agreement,

calculated using Equation 2,

= the proportion of rater agreement expected by

chance, and

= the maximum value that rater agreement can take

in the case of all raters agreeing on all cases, which

is 1.

What differentiates the various kappa-type statistics is the formulation of the chance agreement proportion, . Fleiss’formula for the expanded chance corrected term, , is:

(10)

where: = category number,

= total number of categories, and

= the square of the proportion of responses in category

, which can be found by the following equation:

(11)

where: = the number of raters selecting category for

subject/item , and

= the number of raters.

Combining these equations, the formula for is:

(12)

, then,is the sum across all categories of the square of the proportion of rater use in each category. Following the general form of Equation9, Fleiss’ formula for kappais:

(13)

Fleiss (1971) first published a formula for the variance of kappa. Fleiss, Nee and Landis (1979) presented a corrected version:

(14)

where: = variance of kappa conditional on the rater sample,

= proportion of responses in category from Equation

11,

= total number of categories,

= number of subjects/items, and

= number of raters.

Kappa, and before it, was developed under the assumption that the characteristic or attribute being categorized was reasonably equally distributed among the subjects, or items. Under this assumption the probability of agreement by chance rests evenly on the proportion in each category. In cases which violate the assumption, with appreciable disparities among the proportion of assignments to each category, increases, and the value of decreases even though the proportion of observed agreement, , remains the same. This phenomenon is discussed in detail by Feinstein and Cicchetti (1990). The kappa-type statistics which display decrement as marginal heterogeneity increases are referred to as fixed marginal statistics because they depend on the proportion of responses in each category or in each rater-category interaction. This description is in opposition to the free marginal statistics, which do not depend on the proportion of responses in a category. Bennett’s is considered a free marginal statistic.

The range of is -1 through 1. Values of less than 0 are interpreted as observed agreement less than that expected by chance, while values above 0 indicate that the level of observed rater agreement is greater than that expected by chance.

The distribution of is estimated to be close to normal, so the values of and both and can be obtained by substituting for and for in Equation 4; and for and for in Equation 6.

Fleiss (1971) also presented formulas for category-specific kappa coefficients. The proportion of observed rater agreement in a category,, the counterpart of , is calculated as follows:

(15)

where: = the total number of subjects, or items,

= the number of ratings in each cell of category ,

= the proportion of responses in category from

Equation11,and

= the number of raters.

The proportion of rater agreement expected by chance in a category, , is equal to , the proportion of responses in category . So the entire equation for determining the generalized kappa for a category is:

(16)

The variance of each category-specific kappa is:

(17)

where:= the total number of subjects, or items,

= the number of raters,

= the proportion of responses in category from

Equation 8, and

=

Values of and both and can be obtained for each category just as they were for the overall kappa by adapting Equations 4 and 6.

Conger’s kappa – Conger (1980) conceived of in a manner similar to Fleiss, with the difference that the sum of the variance of rater marginals across all categories,, was subtracted from Fleiss’ . Conger created an expression of proportion of agreement due to chance that eliminated the rater effect, which he believed was embedded in Fleiss’ expression. He calling the adjusted statistic Fleiss’ Exact kappa, but it is typically referred to as Conger’s kappa. The formula for Conger’s chance agreement statistic took the form:

(18)

where = Conger’s expression of chance agreement,

= Fleiss’ expression for chance agreement,

,(19)

= the number of subjects placed in category by

rater , and

= number of raters.

Therefore:

(20)

is calculated, for any of the kappa statistics, by Equation (2).

The computation of the variance for Conger’s kappa depends on constructing the individual item observed agreement, :

,(21)

where: = the number of raters selecting category for

subject/item .

The individual item kappa’s are then derived by the general formula for kappa:

(22)

Finally, the difference between each and the overall Conger’s kappa, , is squared, summed over all items, and divided by :

(23)

As with Fleiss’ kappa, Conger’s kappa is a fixed marginal statistic and will reflect decrement in value as marginal heterogeneity increases. The range of is -1 through 1. Values less than 0 are interpreted as observed agreement less than that expected by chance, while values above 0 indicate that the level of observed rater agreement is greater than that expected by chance. The distribution of is estimated to be close to normal, so the values of and both and can be obtained as they were for Fleiss’ kappa (see Equations 4 and 6).

Calculation of category-specific kappa’s follows a path identical to that of Fleiss. The observed proportion of rater agreement for any category for all agreement statistics can be found by using Equation 15. For Conger’s kappa, the proportion of agreement expected by chance in any category, , is given by:

(24)

where:= the proportion of responses in category from

Equation11,

= the total number of subjects, or items,

= the number of times category was used, and

= the number of raters.

Following Equation 13, the formula for Conger’s kappa for any category is therefore

(25)

The formula for the variance of Conger’s kappa for any category is constructed similarly to that for the overall kappa. First the individual item proportion of observed agreement, , is determined for each item in category :

(26)

where: = the number of raters selecting category for

subject/item , and

= the number of raters.

The kappa for that item in that category is calculated according to the formula:

(27)

The last step is to sum the square of the differences between the individual item kappa’s for category and the overall kappa for category , , across all items, and divide by :

(28)

where = the variance of Conger’s kappa for category ,

= the total number of subjects, or items,

= the individual item kappa for category , and

= the kappa for category .

Cohen’s kappa – Cohen (1960) took a different approach regarding the nature of the estimator of chance agreement, , than hadScottand Fleiss. He believedthat the chance agreementcoefficient should be specific to rater-category use rather than to category use alone. His formula for chance agreement for two raters and two categories took the form:

(29)

where: = the number of responses in category one for rater

one,

= the number of responses in category one for rater

two,

= the number of responses on category two for rater

one, and

= the number of responses in category two for rater

two.

Cohen based his chance agreement correction on the sum across categories of the product of rater proportions within each category. This is equivalent to the joint probability of agreement among raters. Based on extension of this logic, according to Gwet (2001), the appropriate expression for the expanded Cohen estimator for any number of raters and categories,is:

(30)

where ,

= the number of raters,

= the number of subjects/items for which rater

selected category , and

= number of subjects/items.

As one equation, these are:

(31)

Cohen’s kappa is therefore, by the now familiar general formula:

(32)

The variance of Cohen’s kappa can be expressed as:

(33)

where = variance of ,

,(34)

= the number of raters selecting category for

subject/item ,

= proportion of rater agreement expected by chance,

= overall kappa, and

= number of subjects/items.

Cohen’s kappa is a fixed marginal statistic and ranges in value from -1 through 1. The distribution of is considered normal, so the values of and both and can be obtained as they were for both Fleiss’ and Conger’s kappa’s (see Equations 4 and 6).

Category-specific kappa’s follow a computational path similar to that for the overall kappa, using joint rater agreement as the foundation of, the proportion of category-based rater agreement expected by chance. The formula for category-specific kappa follows the familiar format:

(35)

where,(36)

= the number of ratings in each cell of category ,

= the total number of subjects, or items,

= the proportion of responses in category from

Equation 11, and

= the number of raters.

, and

= the number of subjects/items for which rater

selected category .

The category-specific variance for each of Cohen’s category kappa’s can be expressed as:

(37)

where = the variance of Cohen’s kappa for category ,

= the total number of subjects, or items,

,

= the number of raters selecting category for

subject/item ,

= the number of raters, and

= the kappa for category .

Light’s kappa–Light (1971) suggested a solution for the expansion of Cohen’s kappa for multiple raters and categories that was based on the degree of disagreement among raters. He expressed kappa as:

(38)

where: = the observed proportion of disagreement among

raters, and

= the proportion of disagreement expected by

chance.

Light did not pursue the actual development of a formula for multiple raters and categories; however, he did provide an example using three raters, which Conger (1980) generalized. Light’s expansion, frequently referred to as Light’s kappa, is the average of all pairwise Cohen kappa’s. Expressed as a function of disagreement, Light’s kappa is:

(39)

where: = observed proportion of disagreement among all

raters across all categories,

= Cohen’s kappa for raters and , and

= the number of rater pairs.

This is equivalent (Conger, 1980) to:

(40)

Where ,

, the proportion of observed

agreement between rater pair ,

= the number of items/subjects assigned to category

by each rater of rater pair , and

, the sum across categories of the product

of proportions of items/subjects in each category

of each rater of rater pair .

The variance is the average of all rater pair variances using the formula provided in Equations 33 and 34 in the section on Cohen’s kappa:

(41)

where = overall variance of Light’s kappa, and

= variance of rater pair .

Because Light’s kappa is the average of all the pairwise Cohen’s kappa’s, as with Cohen’s kappa it is a fixed marginal statistic and ranges in value from -1 through 1. The distribution of is considered normal, so the values of and both and can be obtained using the formulas outlined in the section on Cohen’s kappa.

The formula for Light’s kappa for each category is a variation on the configuration of Equation 40:

(42)

where = Light’s kappa for category , , and

, the observed proportion of

agreement for rater pair in category ,

= the number of items/subjects assigned to

category by each rater of rater pair ,

= the proportion of items/subjects assigned to

category by rater pair , and

, the product of proportions of

items/subjects in category of each rater

pair .

The variance of Light’s kappa for each category is given by the equation:

(43)

where = category variance of Light’s kappa, and

= variance in category of rater pair , determined

via Equation 37.

Gwet’s AC1 - Gwet (2001) has proposed an alternate “kappa type” statistic, (Agreement Coefficient 1), which considers only category usage as does Scott’s pi, but also takes into account both the number of categories and the probability of category non-use. His formula for the proportion of agreement expected by chance for two categories and two raters is:

(44)

where:

= the number of responses in category one for rater

one,

= the number of responses in category one for rater

two,

Gwet’s chance correction factor is calculated based on the product of the proportion of category use and the proportion of non-use of that category. So for example, in the case of two categories the equation above is capable of expressing solely in terms of because (1 - ) is equal to . Because Gwet’s formula for is based on category use, its relationship to Scott’s pi can be expressed as:

(45)

where: = the proportion of rater agreement expected by