Alka IndurkhyaEPI 8269/13/99

Lecture 4: Matched Data McNemar's Test. Rosner Section 10.6

  • Two sample test for Binomial proportions for matched pairs
  • Test for correlated proportions
  • Exact test for correlated proportions

Kappa Statistic. Rosner Section 10.13

Guidelines for evaluation of Kappa

К > .75 is excellent

К between .4 and .75 is good

К less than .4 is marginal

More on Kappa….

This is a summary of discussion on Kappa that took place on Epidemiol an on line forum for epidemiologists to voice their questions.

1.I have a problem with Kappa index. Why do we use Kappa? I would appreciate if someone

could help me to identify the flaw in my thinking (or lack of thinking).

I understand the percentage of agreement if I want to know how an instrument is calibrated, or how many times interobservers or intraobserver results are the same.

The argument in favor of Kappa states that the percentage of agreement doesn't take into consideration the chance factor. The chance factor in Kappa is dealt with in a similar way to the test of association. It calculates the "agreement by chance" based on the "marginal totals" in a cross tabulation.

I don't understand why the chance of two measurements from two distinct concepts (test of association among two proportions) can work the same way as the chance of two measurements from the same concept (replicability). In my understanding, the chance factor in replicability can be resolved by a reasonable sample size or when the increasing number of observations doesn't change the percentage of agreement.

Let's see the interpretation of the following tables: (from Abramson JH, making sense of data):

Observer B

NoYes

No81585

Observer A

Yes 85 15

We have 83% agreement and a Kappa of 5.6%

Using Kappa I should consider completely unreliable these measurements between observers. Using the percentage of agreement, I can consider the clinical significance of a misdiagnoses of 17% and use this result in my decisionmaking process. For instance, I can introduce a third observer to reclassify the disagreement results in a way that everybody will be a No/No or a Yes/Yes with at least two observers in agreement. I can also retrain my observers or review the disagreement cases with them and redo the measurements. I can make a rule that every measurement that has less than 95 percent of replicability needs two more measures of the same concept before any conclusion is drawn.

What I am trying to say is that the percentage of agreement makes sense in a clinical practice. Why use a more complex/abstract index with a controversial way of dealing with chance (Kappa) for the same purpose?

2.I'll try to answer why use Kappa.

Kappa is the best way we know to assess ' the agreement between two categorical measurements, because it is a statistic free of the effect of chance.

Let me try to explain it without deep mathematical thoughts: Imagine a clinical situation where the patients can have a given disease and an experienced clinician (or a gold standard) is able to make a correct diagnosis in all the cases. Now we would like to test an new (cheaper and simpler) method to diagnose the disease. We will be willing to accept it only if it has a reasonable agreement with our gold standard. The first thing we can do is apply both the gold standard and the new method in a large enough sample of patients and compute the percentage of agreement between the two methods. Of course, if the new method is as ideal as the gold standard is, their percentage of agreement will be 100 percent but ... what if it is not as perfect? The percent of agreement will not necessarily tend to be 0 percent. By chance you are always going to make some correct diagnosis. How many? Here is where the percent of agreement starts to be a mess.

Even if the new method is useless, say, it assigns patients to either one diagnosis or the other absolutely at random, we are still going to obtain some agreement greater than 0 percent. In an extreme case, if the prevalence of the disease is 100 percent, our useless and absolutely random method agrees with the gold standard in around 50 percent of the cases. Another example, if the prevalence is extremely low and our useless method assigns patients always to the healthy group, the percentage of agreement will tend to 100 percent. We may not want to use such an influential percent as an indicator of agreement.

That is the way we use Kappa. It not only provides a unified scale of agreement but also has a very straightforward interpretation: [the excess of the agreement (between the two methods) over that expected purely by chance] divided by [the maximum possible excess of agreement over that expected just by chance, that is, one minus the expected agreement]. It varies from 1 to 1. A Kappa0 indicates that the observed agreement equals that expected just by chance (that is why values lower than 0 are not common), Kappa 1 indicates perfect agreement. Several generalizations for ordinal variables are also available.

3.A very good example on the use of Kappa statistics has been presented by Sackett et al.

Clinical Epidemiology.

I will put their ideas into the example below. I think this will make clear why to use Kappa. In Davi Rumels' example:

Observer B

NoYesTotal

No81585900

Observer A

Yes8515100

Total900100

Let us suppose observer B makes his observation by means of a random number table, if a zero is met, he says yes, any other number, he says no.

What is then the expected agreement? The result is that 90 percent of 'no' cases by observer A, would also be 'no' cases by observer B; and 10 percent of 'yes' cases for observer A, would also be 'no' cases by observer B; and 10 percent of 'yes' cases for observer A, would also be 'yes' for observer B.

Observer B

NoYesTotal

Observer Ano90% of 900=8 1010% of 900=90900

yes90% of 100= 9010% of 100=10100

Total9001001000

The (expected) concordance would be then (810+10)/100082%. This high concordance is entirely by chance. I think this, is sufficient to disconsider the observed agreement as a clinical relevant index. We have seen in this example that observed agreement was of 83 percent and expected agreement was of 82 percent. Kappa is the way to combine these into a clinically useful index.

Observed agreement was of 83 percent. Agreement expected on the basis of chance was of 82 percent. Actual agreement beyond chance: 83%82%=1%. Potential agreement beyond chance: 100%82%=18%. Kappa;actual agreement beyond chance/potential agreement beyond chance kappa = 1% / 18% = 0.056

The results represent the proportion of potential agreement that was actually achieved. It should also be stated that there are some tests to evaluate the significance of Kappa statistics (see Fleiss-Rates and proportions 1986 ed. John Wiley).

4. Maybe there is a difference between wanting to know about reliability/replicability/observer

agreement in a particular instance and wanting to make generalizable statements about the

performance of a measuring instrument.

For the latter purpose, I think Kappa is preferable to the percent agreement exactly because it discounts the percent agreement for chance agreement and chance agreement would vary from population to population. Essentially this also seems to be the conclusion arrived at by Feinstein and Cicchetti in the J Clin Epi papers already cited in this discussion. Kappa may not be perfect but for certain purposes it improves on percent agreement.

Bye the way, Feinstein and Cicchetti propose two new agreement indices: Ppos and Pneg.

5.Kappa is telling you that despite the high proportion of agreement, the observers are not

actually telling you anything consistent about the state of nature. An observer who said 'no'

all the time would bejust as good as either of the current observers. In fact if you replaced

A with such an observer, the percent agreement would go up from 81 percent to 90 percent!

Assessing in inter-rater agreement

The two measures commonly used to assess interobserver agreement are 1) percent agreement p0 and 2) kappa (К). Percent agreement is calculated as percent of all observations falling on the diagonal of the C x C table. Kappa, adjusts the observed agreement to that expected purely by chance. Both measures have their limitations and both are affected by the distribution of marginal frequencies. Both percent agreement and Kappa will be calculated for the various variables in the data base. Dr. A’s coding will be considered "gold standard". An acceptable agreement will be determined as a shift in marginal iequencies of the abstractor as compared to the frequency distribution of the "gold standard" by frequency of no more than 1 across the categories. Some possible scenarios are described below:

a) For a dichotomous variable

Gold standard

YesNo

AbstractorYes909% agreement =(9+10)/20=.95

No11011Kappa = (.95-.5)/.5=.9

101020

b)For variables with 3 levels

Gold standard

123

16006Percent Agreement = 19/20 = 95%

Abstractor20606Kappa=

31078

76720

The above two scenarios represent the best possible situations since the hypothesized marginal frequencies are uniform across the levels of the variable. Kappa is highly effected by the distribution of marginal frequencies. Thus, when distributions across categories of the given variable are not uniform, we can still have very high agreement but much lower kappa.

Gold Standard

YesNo

Yes101Percent Agreement = 19/20 = 95%

AbstractorNo21819Kappa =.95 .86 =.64

1.86

218 20

However, this is the maximum Kappa attainable given these specific marginal frequency distributions. Both percent of agreement and Kappa will be calculated for each variable compared. Our criteria for an acceptable agreement will be a shift in marginal frequency of the abstractor as compared to the gold standard of no more than I misclassification. If more than I misclassification occurs, this will be identified as an area that will need review and retraining of the abstractor.