173Sj2 Analysis of Categorical Data Rjg 2000 H/O 1

DEPARTMENT OF ACTUARIAL MATHEMATICS AND STATISTICS

SCHOOL OF MATHEMATICAL AND COMPUTER SCIENCES

Name: ……………………………

Lecturer: Roger J Gray

April 2007

(i)

CONTENTS

Preface 1

Aims of module

Summary

Content/structure

Assessment

Module website

Timetable

Reading

Background - computing

Review of selected Rmaterial3

Examples4

§1Introduction11

§2Poisson process and associated distributions12

§3Single classifications19

§4Twoway classifications24

§5A brief introduction to generalised linear models35

§6Models for single classifications 36

§7Logistic regression43

§8Models for twoway and threeway classifications48

Tutorial and lab questions 59

A special 8-page document called R reference sheets will be given out, as will a separate handout on GLMs – theory and examples. Both documents are also available for downloading from the module website .

Preface

Aims of module

to present theory and techniques for the analysis of categorical data

to develop students’ abilities in understanding and solving practical statistical problems

involving categorical data

to enable students to learn how to choose appropriate techniques, to analyse categorical

data, and present results

Summary

The module is based on this workbook, which contains necessary background and theory, data sets, and worked examples. Lecture time will be used to illustrate some important points in the workbook and to guide students through worked examples. Practical applications will be emphasised throughout using R (and very occasionally Minitab).

Students will be expected to learn some of the material for themselves both from studying the content of the workbook and from doing the practical work contained in the tutorials. Not every part of the work will be taught as such. There will be regular computer labs in weeks 2  6; RJG

will be present as far as possible and as required at one or more labs in weeks 2 - 6.

Content/structure

Weeks 1 – 5: there will be about 15 lectures given over the first 5 weeks of the term

+ computer labs as required. Exact details will be given during class (and by email) in advance

and in good time.

§1 Introduction

§2 Poisson process and associated distributions

2.1 Bernoulli trials and related distributions

2.2 Poisson process and related distributions

2.3 Inference for the Poisson distribution

2.4 Dispersion and LR statistics and tests for Poisson data

§3 Single classifications

3.1 Binary classifications

3.2 Qualitative categories

3.3 Ordered categories

3.4 Goodness-of-fit tests for frequency distributions

3.5 Residuals

► Major illustration 1 – publish and be modelled available from the website

► Major illustration 2 – birds in hedges  available from the website

§4 Twoway classifications

4.1 Factors and responses

4.2 Distribution theory and tests for rs tables

4.3 The 22 table

4.4 Log odds, collapsing tables, interactions

§5 A brief introduction to generalised linear models (GLMs)

§6 Models for single classifications

6.1 Single classifications  trend models

6.2 Single classifications – including a “deterministic denominator”

► Major illustration 3 – onset of leukaemia  cyclic model  available from the website

§7 Logistic regression

§8 Models for twoway and threeway classifications

8.1 Log-linear models for twoway classifications

8.2 Twoway classifications – including a “deterministic denominator”

8.3 Log-linear models for threeway classifications

8.4 Hierarchic log-linear models

► Major illustration 4 – Project Spring 2006  numbers and proportions of policies  available

from the website

Assessment

The module is the second of a linked pair. The pair of modules is assessed on the basis of three projects, one of which relates to this module. It will be given out at the end of week 4, for return at the end of week 6.

Module website

There is a link from the Department’s main Information for Current Students page to an introductory page. There is a link from this page to the detailed module site, which you should bookmark.

The webpages will be updated as required. Data for the project, and for a few of the tutorial questions, will be accessible from the site.

Timetable

See third year timetable issued by the Department. There will also be an announcement at the first meeting of the class. More slots may be reserved than will actually be used.

Reading

The workbook (plus separate material handed out or made available on the website) is intended to be sufficiently comprehensive on its own as support material. However, if you want to read more widely now or later, the following texts are suggested.

Everitt, BS (1992) The Analysis of Contingency Tables, Chapman and Hall

Feinberg, SE(1991) The Analysis of Cross-Classified Categorical Data, MIT

Plackett, RL (1981) The Analysis of Categorical Data (2nd ed.), Griffin1

Venables, WN, & Ripley, BD Modern Applied Statistics with S-Plus, Springer 2

1This classic text is out of print, but a number of copies are held by the library.

2The main S-Plus reference book

Background - computing

Review of R

A 8-page document called R reference sheets is available as a handout and also on the module website ( The material on linear modelling is repeated here, together with the recommended code for downloading a data set from the module website

Linear models

> model1 = lm(y ~ x) normal linear regression model of y on x

> model2 = lm(y1~x1+x2, data = illus3) normal linear regression model of y1 on x1 and x2,

data held in data frame “illus3”

Information from fitted models

> summary(mod3)displays parameter estimates and st. errors, deviance, and

correlation matrix

> summary.aov(mod5)displays the analysis of variance for the fitted model

> fitted(mod4) > resid(mod4) fitted values, residuals, in fitted model

> coef(mod4) coefficients in fitted model

> f4 = fitted(mod4) vectors containing fitted values, residuals, coefficients in

> r2 = resid(mod2) fitted model

> c3 = coef(mod3)

> plot(fitted(mod3), resid(mod3)) plot of residuals against fitted values

> abline(mod3) adds fitted line to current data plot

> abline(h=0, lty=2) adds a horizontal dashed line at y = 0 to current data plot

plot(model3) supplies 4 plots associated with the fitted model “model3”: click on the

command window (and then return) each time to get each plot;

1 residuals v fitted, 2 normal Q-Q plot,

3 scale-location plot, and 4 Cook’s distance plot

Generalised linear models

Log linear models

> model2 = glm(n ~ rc + cc, family = poisson)

> model3 = glm(n ~ age, family = poisson)

> model4 = glm(n ~ attitude + age + gender, family = poisson)

> model5 = glm(n ~ attitude*age + gender, family = poisson))

> model6 = glm(n ~ attitude*age*gender, family = poisson)

Logistic regression models

> model7 = glm(propdead ~ dose + age, weights = groupsize, family = binomial)

> model8 = glm(propdead ~ dose*age, weights = groupsize, family = binomial)

Downloading a data file from the module website directly into a data frame in R

> failframe = read.table(“

EXAMPLES  SINGLE CLASSIFICATIONS

Example 1 Eye colours: eye colours of males visiting an optician, in four categories

Colour / A B C D
Frequency observed / 89 66 60 85

Example 2 Prussian cavalry deaths: numbers of cavalry soldiers killed by horsekicks in each of 14 units of the Prussian army over a 20-year period (1875-1894).

(a) Numbers killed in each unit in each year: frequency table.

Number killed / 0 1 2 3 4 5 / Total
Frequency observed / 144 91 32 11 2 0 / 280

(b) Numbers killed in each unit in each year: raw data (in random order):

0 0 1 0 0 2 0 0 0 0 0 1 0 2 1 1 0 3 0 1 3 0 0 1 0 1 1 0 1 0 0 0

0 0 2 0 1 0 1 2 0 1 1 3 0 0 0 1 1 0 1 0 2 2 0 0 0 3 0 1 1 0 1 0

0 0 0 0 0 1 3 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 4 1

0 0 3 1 0 0 0 1 0 1 3 2 0 0 0 0 1 0 1 0 2 1 1 1 1 1 0 0 0 0 1 0

1 0 0 2 1 2 0 0 0 0 1 2 1 1 0 0 1 0 2 2 0 1 1 2 1 1 0 1 0 0 0 0

2 0 0 1 0 0 0 2 1 0 1 0 3 0 0 1 1 0 1 1 2 2 2 0 1 0 0 0 1 0 1 0

0 3 0 0 1 0 1 0 3 1 1 1 0 1 1 1 0 2 2 1 2 0 0 1 2 1 0 0 4 0 0 0

0 1 1 0 2 0 0 1 0 1 0 1 0 2 1 1 0 2 1 2 0 0 0 1 0 1 2 1 0 2 0 2

3 0 0 1 0 0 2 1 0 0 1 0 0 1 0 0 1 1 2 0 1 0 1 1

1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘94
3 5 7 9 10 18 6 14 11 9 5 11 15 6 11 17 12 15 8 4

Example 3 Fingerprints: type of print on the left first finger of fathers.

Type of print / Arch Small loop Large loop Composite Whorl / Total
Frequency observed / 41 139 53 28 68 / 329

Example 4 Political views: classified from Left to Right.

1 (very L) 2 3 4 (centre) 5 6 7 (very R) / Don’t know / Total
46 179 196 559 232 150 35 / 93 / 1490

Example 5 Leukaemia: cases of acute lymphatic leukaemia reported to the British Cancer

Registration Scheme during a 15-year period, classified by month of clinical onset.

Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec / Total
40 34 30 44 39 58 51 55 36 48 33 38 / 506

Example 6 Tosses of two coins: results of 1000 tosses of:

(a)normal coin: (b) biased coin:

Heads Tails / Total / Heads Tails / Total
473 527 / 1000 / 679 321 / 1000

Example 7 Vehicle repair visits: frequency distribution of the number of repair visits for army vehicles.

Number of visits / 0 1 2 3 4 5 6 / Total
Frequency observed / 295 190 53 5 5 2 0 / 550

Example 8 Accidents to machinists: accidents to 414 machinists over a 3 month period.

Number of accidents / 0 1 2 3 4 5 / Total
Frequency observed / 296 74 26 8 4 6 / 414

Example 9 Gender in Swedish families of size  4: numbers of each gender among the first 4 children.

Number of girls / 0 1 2 3 4 / Total
Frequency observed / 246 875 1250 789 183 / 3343

Example 10 Gender of twins: numbers of each combination in twins born in Denmark (5 yr period)

Number of girls / 0 1 2 / Total
Frequency observed / 1620 1646 1413 / 4679

Example 11 Suicides: in France, classified by day of week.

Mon Tue Wed Thu Fri Sat Sun / Total
1001 1035 982 1033 905 737 894 / 6587

Example 12 Time of birth: times of birth (24-hr clock) in hospitals in Birmingham over a 4

week period

Hour / 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 / Total
No. of births / 42 54 52 65 50 44 42 28 36 36 36 32 39 42 33 31 39 28 41 37 31 35 39 43 / 955

Example 13 Remembering stressful events: numbers of stressful events reported for previous

18 months, classified by ‘date’  number of months before interview.

Date / 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 / Total
Number of cases / 15 11 14 17 5 11 10 4 8 10 7 9 11 3 6 1 1 4 / 147

EXAMPLES  TWOWAY CLASSIFICATIONS

Example 14 Mice: Numbers of mice bearing tumours in treated and control groups.

Treated / Control /

Total

Tumours

/ 4 / 5 / 9
No tumours / 12 / 74 / 86

Total

/ 16 / 79 / 95

Example 15 Patients in clinical trial: 50 patients are given a drug (treated group) and 50 are given a placebo (control group); the table gives the numbers suffering particular side-effects.

Drug / Placebo / Total
Side-effects / 15 / 4 / 19
No side-effects / 35 / 46 / 81
Total / 50 / 50 / 100

Example 16 Tonsils: Relationship between nasal carrier status for Streptococcus pyogenes and size of tonsils among 1398 children aged 0-15 years.

Normal

Enlarged

Much enlarged

Total

Carriers

/ 19 / 29 / 24 / 72

Non-carriers

/ 497 / 560 / 269 / 1326
Total / 516 / 589 / 293 / 1398

Example 17 Bronze: Bronze Age tombs in four regions of Denmark: 267 tombs classified by the amount of bronze found and by geographical region.

0-200g / 200-400g / >400g

The Islands

/ 24 / 25 / 35
NE Jutland / 22 / 13 / 19
NW Jutland / 11 / 9 / 58
S Jutland / 23 / 6 / 22

Example 18 Investors: A sample of 972 investors in the stock market are classified by age and “rate believed attainable”.

Rate believed attainable

0-5% / 6-10% / 11-15% / over 15%
under 45 / 15 / 51 / 51 / 29 / 146

Investor

/ 45-54 / 31 / 133 / 70 / 48 / 282
age / 55-64 / 59 / 139 / 35 / 20 / 253
65 and over / 84 / 157 / 32 / 18 / 291
189 / 480 / 188 / 115 / 972

Example 19 Homework: Classification of 1019 children according to conditions under which homework was carried out (I), and teacher’s rating of homework (J); each scale is graded, ‘A’ being highest.

I = A / B / C / D / E / Total

J = A

/ 141 / 67 / 114 / 79 / 39 / 440
B / 131 / 66 / 143 / 72 / 35 / 447
C / 36 / 14 / 38 / 28 / 16 / 132
Total / 308 / 147 / 295 / 179 / 90 / 1019

Example 20 Education: 1125 married couples in the US, classified by the length of their education; I = category of wife, J = category of husband; 1 = up to 11 years, 2 = 12 years,

3 = 13-15 years, 4 = more than 15 years.

I = 1 / 2 / 3 / 4 / Total

J = 1

/ 283 / 141 / 25 / 4 / 453
2 / 82 / 180 / 43 / 14 / 319
3 / 20 / 104 / 43 / 20 / 187
4 / 4 / 52 / 41 / 69 / 166
Total / 389 / 477 / 152 / 107 / 1125

Example 21 Eyesight: Unaided vision of 7477 women, aged 30-39; left (I) and right (J) eyes

graded on a scale from 1 (best) to 4.

I = 1 / 2 / 3 / 4 / Total

J = 1

/ 1520 / 266 / 124 / 66 / 1976
2 / 234 / 1512 / 432 / 78 / 2256
3 / 117 / 362 / 1772 / 205 / 2456
4 / 36 / 82 / 179 / 492 / 789
Total / 1907 / 2222 / 2507 / 841 / 7477

Example 22 Fingerprints: Fingerprints of the right hand classified by the numbers of

whorls (I) and small loops (J).

J=0 / 1 / 2 / 3 / 4 / 5 / Total

I=0

/ 78 / 144 / 204 / 211 / 179 / 45 / 861

1

/ 106 / 153 / 126 / 80 / 32 / - / 497
2 / 130 / 92 / 55 / 15 / - / - / 292
3 / 125 / 38 / 7 / - / - / - / 170
4 / 104 / 26 / - / - / - / - / 130
5 / 50 / - / - / - / - / - / 50
Total / 593 / 453 / 392 / 306 / 211 / 45 / 2000

Example 23 Dentists: The table shows the numbers of candidates, by year and sex, passing

the final examination of the RoyalDanishSchool of Dentistry.

1962 / 1963 / 1964 / 1965 / 1966 / 1967 / 1968 / 1969 / Total

Men

/ 57 / 51 / 66 / 54 / 62 / 54 / 68 / 80 / 492
Women / 40 / 43 / 38 / 32 / 30 / 36 / 64 / 55 / 338
Total / 97 / 94 / 104 / 86 / 92 / 90 / 132 / 135 / 830

Example 24 Social mobility: The data are from an inter-generational social mobility survey in the UK in the 1950s. The categories for father and son are:

A: professional, high administrativeB: managerial, executive. high supervisory

C: low inspectional, supervisory D: routine non-manual, skilled manual

E: semi-skilled and unskilled manual

Son

A / B / C / D / E
A / 50 / 45 / 8 / 18 / 8
B / 28 / 174 / 84 / 154 / 55

Father

/ C / 11 / 78 / 110 / 223 / 96
D / 14 / 150 / 185 / 714 / 447
E / 0 / 42 / 72 / 320 / 441

Example 25 Grouse: Numbers of young raised by samples of female grouse, treated against worms (T) and untreated (C), in each of 3 years; the figures in brackets show the numbers of mothers in each sample.

1981

1982

1983

Total

T

/ 54 (8) / 30 (13) / 68 (15) / 152 (36)

C

/ 34 (14) / 4 (11) / 45 (22) / 83 (47)
Total / 88 (22) / 34 (24) / 113 (37) / 235 (83)

Example 26 Mortality: Numbers of deaths, nx, in a year in groups of people of the stated ages (x, years) and with the given exposure (Ex). Data available on module website in file deaths.txt

Age x / 60 / 61 / 62 / 63 / 64 / 65 / 66 / 67 / 68 / 69
Exposure Ex / 1029 / 1091 / 1075 / 996 / 963 / 1029 / 1108 / 1130 / 1147 / 1037
No. of deaths nx / 14 / 14 / 18 / 20 / 19 / 21 / 29 / 26 / 30 / 23

Example 27 Failures: Numbers of test specimens (metal bars) which failed, in groups of specimens which were subjected to the stated measures of pressure. The group sizes are as stated. The groups contained bars of one of two types, as stated. Data available on module website in file faildata.txt

No. failed / 3 / 4 / 10 / 19 / 27 / 29 / 32 / 38 / 39 / 1 / 3 / 5 / 8 / 20 / 26 / 32 / 32 / 38
Group size / 35 / 30 / 35 / 40 / 40 / 35 / 35 / 40 / 40 / 40 / 40 / 35 / 30 / 45 / 40 / 40 / 35 / 40
Pressure / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Type / 1 / 1 / 1 / 1 / 1 / 1 / 1 / 1 / 1 / 2 / 2 / 2 / 2 / 2 / 2 / 2 / 2 / 2

EXAMPLES  THREEWAY CLASSIFICATIONS

Example 28 Cuttings: Survival (numbers dead and alive) of root cuttings in relation to length of cutting (long/short) and time of planting (at once/in the spring). 240 cuttings were used for each combination of length of cutting and time of planting.

Time of planting /

At once

In spring

Length of cutting /

Long Short

Dead

/ 84 / 133 / 156 / 209

Survival

Alive

/ 156 / 107 / 84 / 31
240 / 240 / 240 / 240

Example 29 Lizards: Perch preferences for two species of lizard, Anolis sagrei and Anolis distichus (‘high’ is perch height > 5 feet ; ‘large’ is ‘diameter > 2.5 inches’).

A. sagrei / A. distichus
perch diameter / small / large / small / large
high / 32 / 11 / 61 / 41
perch height
low / 86 / 35 / 73 / 70

Example 30 Accidents: Numbers of traffic accidents in Sweden for 18 week periods in each of two years, classified according to type of road and speed limit.

Year 1

Year 2

Highways

Other roads

Highways

Other roads

Speed limit

/ 8 / 42 / 11 / 37

No speed limit

/ 57 / 106 / 45 / 69

Example 31 Coalminers: classified by age, breathlessness, and wheeze.

Age group

in years /

Breathlessness

No breathlessness

Total

Wheeze

No wheeze

Wheeze

No wheeze

20-24 / 9 / 7 / 95 / 1841 / 1952
25-29 / 23 / 9 / 105 / 1654 / 1791
30-34 / 54 / 19 / 177 / 1863 / 2113
35-39 / 121 / 48 / 257 / 2367 / 2783
40-44 / 169 / 54 / 273 / 1778 / 2274
45-49 / 269 / 88 / 324 / 1712 / 2393
50-54 / 404 / 117 / 245 / 1324 / 2090
55-59 / 406 / 152 / 225 / 967 / 1750
60-64 / 372 / 106 / 132 / 526 / 1136
Total / 1827 / 600 / 1833 / 14022 / 18282

Example 32 Diabetics: The data arise from a random sample of 96 diabetic patients, each being classified according to (H) family history of diabetes – yes/no, (D) dependence on insulin injections – yes/no, and (A) age at onset – A1 (age <35), A2(35  age < 50), A3(age  50).

Family history of diabetes

/ yes / no

Dependence on insulin injections

/ yes / no / Yes / no

/ 10 / 3 / 30 / 4
Age at onset / A2 / 6 / 32 / 8 / 40
A3 / 8 / 39 / 10 / 56

§1 INTRODUCTION

Categorical data arise whenever counts (as opposed to measurements) are made. Subjects (sample items) are classified as belonging to one of a set of categories and the numbers in the categories (the frequencies) are recorded. [The set of categories may be finite or countable.]

One of the simplest illustrations involves coin tossing. At each trial we observe an outcome which is in one of the two categories “head” or “tail”. At the end of a series of such trials, we count the total numbers of heads and tails, giving us the frequencies of the two categories.

See Examples (pp4  9)

We usually think of the category as the explanatory variable and the distribution of the items among the categories as the response variable (see the eye colours data  Example 1).

Sometimes data can be classified in more than one way (see the Prussian cavalry deaths data – Example 2).

Classifications can themselves be classified as

(1)qualitative (no structural relation between categories) (e.g. Examples 1, 6)

(2)ordered (e.g. Examples 4, 5)

(3)quantitative (e.g. Examples 2(a), 7, 8)

A frequency distribution is a summary of raw data on some classification variable (which may be of any type).

Classifications with only two categories (heads/tails, male/female; yes/no; on/off) are called dichotomous or binary (e.g. Example 6). In this case, the distinctions (1) – (3) above make no difference to the analysis.

A twoway classification arises when each item may be classified according to two separate criteria. We get a twoway table or crosstabulation (e.g. Examples 14, 15, 16)

► Illustration 1.1

Suppose that 100 deaths are classified by

Criterion A: smoking status of deceased (smoked, did not smoke), and

Criteron B: cause of death (lung cancer, other)

B: Cause of death
Cancer / Other
A: Smoking
status / Smoker / 30 / 20
Not smoker / 15 / 35

Was each of the 100 deaths randomly selected from some larger group and then classified according to each of the two binary criteria ?

or were 50 “smoker” deaths and 50 “not smoker” deaths examined and then classified according to cause of death ? or was something else done?

Each classification can be either an explanatory variable or a response variable, depending on how the data have arisen. We must consider at least one of the classification variables to be a response variable. If the numbers in the categories were chosen in advance, the classification variable is an explanatory variable.

We will have to consider how the analysis and the interpretation of results will depend on which of A, B are response variables.

The sources of categorical data include:

 Bernoulli trials ( binomial data)

 number of successes in n i.i. trials ~ binomial

e.g. Example 6 NH ~ B(1000,p) where p = P(H)

 Trials with more than two possible outcomes ( multinomial data)

 numbers in categories in n i.i. trials ~ multinomial

e.g. Example 3 N = (N1, N2, …, N5) ~ Mn(329, p) where pi = P(type i) , i = 1, 2, …, 5

 Poisson Process (Poisson data)

This is of fundamental importance in modelling categorical data. It is the continuous-time

analogue of Bernoulli trials.

§2 POISSON PROCESS AND ASSOCIATED DISTRIBUTIONS

2.1 Bernoulli trials and related distributions

Bernoulli trials  sequence of independent trials, each with two outcomes.

We consider identical trials : each has same probability of success, say P(success) = p .

Number of successes

Sn, the number of successes in the first n trials, has the binomial distributionB(n,p).The distribution has pf (probability function)

mean np, variance npq.

Conditional distribution of success ‘times’

(1) Given that there is one success in the first n trials, when did it occur?

Since the trials are i.i.d., each possible time (x = 1, 2, … , n) has the same probability. The appropriate distribution is therefore uniform on {1, 2, …, n}.

Formally:

For example, given that there is exactly 1 success in the first 5 trials, the probability that this success occurs at the first trial (or indeed any specified trial) is 0.2.

(2) Given that there are m successes in the first n trials, when did they occur?

Each possible set of values (1 t1 < t2 < … < tm n) has the same probability

so again the (joint) distribution is uniform.

For example, given that there are 2 successes in the first 5 trials, the probability that these successes occur at the first and last trials (or indeed any two specified trials) is 0.1.

2.2 The Poisson process and related distributions

The Poisson process in one dimension (time) at constant rate/intensity can be thought of as the continuous-time analogue of independent, identical Bernoulli trials.

We can think informally of a Poisson process as a series of events occurring one after another through time “at random” and at a constant rate. The rate of events occurring is also called the

intensity of the process, denoted 

Some things you need to know about the Poisson process and related Poisson distributions

The number of events which occur in a time interval of length t, denoted Nt, has a Poisson

distribution with mean t, so

E[Nt] = V[Nt] = t. We write Nt ~ Pn(t).

The numbers of events which occur in non-overlapping time intervals are independent.

P(Ns+t = 0) = P(Ns = 0)P(Nt = 0) []