DEPARTMENT OF ACTUARIAL MATHEMATICS AND STATISTICS
SCHOOL OF MATHEMATICAL AND COMPUTER SCIENCES
Name: ……………………………
Lecturer: Roger J Gray
April 2007
(i)
CONTENTS
Preface 1
Aims of module
Summary
Content/structure
Assessment
Module website
Timetable
Reading
Background - computing
Review of selected Rmaterial3
Examples4
§1Introduction11
§2Poisson process and associated distributions12
§3Single classifications19
§4Twoway classifications24
§5A brief introduction to generalised linear models35
§6Models for single classifications 36
§7Logistic regression43
§8Models for twoway and threeway classifications48
Tutorial and lab questions 59
A special 8-page document called R reference sheets will be given out, as will a separate handout on GLMs – theory and examples. Both documents are also available for downloading from the module website .
1
Preface
Aims of module
to present theory and techniques for the analysis of categorical data
to develop students’ abilities in understanding and solving practical statistical problems
involving categorical data
to enable students to learn how to choose appropriate techniques, to analyse categorical
data, and present results
Summary
The module is based on this workbook, which contains necessary background and theory, data sets, and worked examples. Lecture time will be used to illustrate some important points in the workbook and to guide students through worked examples. Practical applications will be emphasised throughout using R (and very occasionally Minitab).
Students will be expected to learn some of the material for themselves both from studying the content of the workbook and from doing the practical work contained in the tutorials. Not every part of the work will be taught as such. There will be regular computer labs in weeks 2 6; RJG
will be present as far as possible and as required at one or more labs in weeks 2 - 6.
Content/structure
Weeks 1 – 5: there will be about 15 lectures given over the first 5 weeks of the term
+ computer labs as required. Exact details will be given during class (and by email) in advance
and in good time.
§1 Introduction
§2 Poisson process and associated distributions
2.1 Bernoulli trials and related distributions
2.2 Poisson process and related distributions
2.3 Inference for the Poisson distribution
2.4 Dispersion and LR statistics and tests for Poisson data
§3 Single classifications
3.1 Binary classifications
3.2 Qualitative categories
3.3 Ordered categories
3.4 Goodness-of-fit tests for frequency distributions
3.5 Residuals
► Major illustration 1 – publish and be modelled available from the website
► Major illustration 2 – birds in hedges available from the website
§4 Twoway classifications
4.1 Factors and responses
4.2 Distribution theory and tests for rs tables
4.3 The 22 table
4.4 Log odds, collapsing tables, interactions
§5 A brief introduction to generalised linear models (GLMs)
2
§6 Models for single classifications
6.1 Single classifications trend models
6.2 Single classifications – including a “deterministic denominator”
► Major illustration 3 – onset of leukaemia cyclic model available from the website
§7 Logistic regression
§8 Models for twoway and threeway classifications
8.1 Log-linear models for twoway classifications
8.2 Twoway classifications – including a “deterministic denominator”
8.3 Log-linear models for threeway classifications
8.4 Hierarchic log-linear models
► Major illustration 4 – Project Spring 2006 numbers and proportions of policies available
from the website
Assessment
The module is the second of a linked pair. The pair of modules is assessed on the basis of three projects, one of which relates to this module. It will be given out at the end of week 4, for return at the end of week 6.
Module website
There is a link from the Department’s main Information for Current Students page to an introductory page. There is a link from this page to the detailed module site, which you should bookmark.
The webpages will be updated as required. Data for the project, and for a few of the tutorial questions, will be accessible from the site.
Timetable
See third year timetable issued by the Department. There will also be an announcement at the first meeting of the class. More slots may be reserved than will actually be used.
Reading
The workbook (plus separate material handed out or made available on the website) is intended to be sufficiently comprehensive on its own as support material. However, if you want to read more widely now or later, the following texts are suggested.
Everitt, BS (1992) The Analysis of Contingency Tables, Chapman and Hall
Feinberg, SE(1991) The Analysis of Cross-Classified Categorical Data, MIT
Plackett, RL (1981) The Analysis of Categorical Data (2nd ed.), Griffin1
Venables, WN, & Ripley, BD Modern Applied Statistics with S-Plus, Springer 2
1This classic text is out of print, but a number of copies are held by the library.
2The main S-Plus reference book
3
Background - computing
Review of R
A 8-page document called R reference sheets is available as a handout and also on the module website ( The material on linear modelling is repeated here, together with the recommended code for downloading a data set from the module website
Linear models
> model1 = lm(y ~ x) normal linear regression model of y on x
> model2 = lm(y1~x1+x2, data = illus3) normal linear regression model of y1 on x1 and x2,
data held in data frame “illus3”
Information from fitted models
> summary(mod3)displays parameter estimates and st. errors, deviance, and
correlation matrix
> summary.aov(mod5)displays the analysis of variance for the fitted model
> fitted(mod4) > resid(mod4) fitted values, residuals, in fitted model
> coef(mod4) coefficients in fitted model
> f4 = fitted(mod4) vectors containing fitted values, residuals, coefficients in
> r2 = resid(mod2) fitted model
> c3 = coef(mod3)
> plot(fitted(mod3), resid(mod3)) plot of residuals against fitted values
> abline(mod3) adds fitted line to current data plot
> abline(h=0, lty=2) adds a horizontal dashed line at y = 0 to current data plot
plot(model3) supplies 4 plots associated with the fitted model “model3”: click on the
command window (and then return) each time to get each plot;
1 residuals v fitted, 2 normal Q-Q plot,
3 scale-location plot, and 4 Cook’s distance plot
Generalised linear models
Log linear models
> model2 = glm(n ~ rc + cc, family = poisson)
> model3 = glm(n ~ age, family = poisson)
> model4 = glm(n ~ attitude + age + gender, family = poisson)
> model5 = glm(n ~ attitude*age + gender, family = poisson))
> model6 = glm(n ~ attitude*age*gender, family = poisson)
Logistic regression models
> model7 = glm(propdead ~ dose + age, weights = groupsize, family = binomial)
> model8 = glm(propdead ~ dose*age, weights = groupsize, family = binomial)
Downloading a data file from the module website directly into a data frame in R
> failframe = read.table(“
4
EXAMPLES SINGLE CLASSIFICATIONS
Example 1 Eye colours: eye colours of males visiting an optician, in four categories
Colour / A B C DFrequency observed / 89 66 60 85
Example 2 Prussian cavalry deaths: numbers of cavalry soldiers killed by horsekicks in each of 14 units of the Prussian army over a 20-year period (1875-1894).
(a) Numbers killed in each unit in each year: frequency table.
Number killed / 0 1 2 3 4 5 / TotalFrequency observed / 144 91 32 11 2 0 / 280
(b) Numbers killed in each unit in each year: raw data (in random order):
0 0 1 0 0 2 0 0 0 0 0 1 0 2 1 1 0 3 0 1 3 0 0 1 0 1 1 0 1 0 0 0
0 0 2 0 1 0 1 2 0 1 1 3 0 0 0 1 1 0 1 0 2 2 0 0 0 3 0 1 1 0 1 0
0 0 0 0 0 1 3 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 0 0 0 0 4 1
0 0 3 1 0 0 0 1 0 1 3 2 0 0 0 0 1 0 1 0 2 1 1 1 1 1 0 0 0 0 1 0
1 0 0 2 1 2 0 0 0 0 1 2 1 1 0 0 1 0 2 2 0 1 1 2 1 1 0 1 0 0 0 0
2 0 0 1 0 0 0 2 1 0 1 0 3 0 0 1 1 0 1 1 2 2 2 0 1 0 0 0 1 0 1 0
0 3 0 0 1 0 1 0 3 1 1 1 0 1 1 1 0 2 2 1 2 0 0 1 2 1 0 0 4 0 0 0
0 1 1 0 2 0 0 1 0 1 0 1 0 2 1 1 0 2 1 2 0 0 0 1 0 1 2 1 0 2 0 2
3 0 0 1 0 0 2 1 0 0 1 0 0 1 0 0 1 1 2 0 1 0 1 1
(c) Total numbers killed each year.
1875 ’76 ’77 ’78 ’79 ’80 ’81 ’82 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ‘943 5 7 9 10 18 6 14 11 9 5 11 15 6 11 17 12 15 8 4
Example 3 Fingerprints: type of print on the left first finger of fathers.
Type of print / Arch Small loop Large loop Composite Whorl / TotalFrequency observed / 41 139 53 28 68 / 329
Example 4 Political views: classified from Left to Right.
1 (very L) 2 3 4 (centre) 5 6 7 (very R) / Don’t know / Total46 179 196 559 232 150 35 / 93 / 1490
Example 5 Leukaemia: cases of acute lymphatic leukaemia reported to the British Cancer
Registration Scheme during a 15-year period, classified by month of clinical onset.
Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec / Total40 34 30 44 39 58 51 55 36 48 33 38 / 506
5
Example 6 Tosses of two coins: results of 1000 tosses of:
(a)normal coin: (b) biased coin:
Heads Tails / Total / Heads Tails / Total473 527 / 1000 / 679 321 / 1000
Example 7 Vehicle repair visits: frequency distribution of the number of repair visits for army vehicles.
Number of visits / 0 1 2 3 4 5 6 / TotalFrequency observed / 295 190 53 5 5 2 0 / 550
Example 8 Accidents to machinists: accidents to 414 machinists over a 3 month period.
Number of accidents / 0 1 2 3 4 5 / TotalFrequency observed / 296 74 26 8 4 6 / 414
Example 9 Gender in Swedish families of size 4: numbers of each gender among the first 4 children.
Number of girls / 0 1 2 3 4 / TotalFrequency observed / 246 875 1250 789 183 / 3343
Example 10 Gender of twins: numbers of each combination in twins born in Denmark (5 yr period)
Number of girls / 0 1 2 / TotalFrequency observed / 1620 1646 1413 / 4679
Example 11 Suicides: in France, classified by day of week.
Mon Tue Wed Thu Fri Sat Sun / Total1001 1035 982 1033 905 737 894 / 6587
Example 12 Time of birth: times of birth (24-hr clock) in hospitals in Birmingham over a 4
week period
Hour / 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 / TotalNo. of births / 42 54 52 65 50 44 42 28 36 36 36 32 39 42 33 31 39 28 41 37 31 35 39 43 / 955
Example 13 Remembering stressful events: numbers of stressful events reported for previous
18 months, classified by ‘date’ number of months before interview.
Date / 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 / TotalNumber of cases / 15 11 14 17 5 11 10 4 8 10 7 9 11 3 6 1 1 4 / 147
6
EXAMPLES TWOWAY CLASSIFICATIONS
Example 14 Mice: Numbers of mice bearing tumours in treated and control groups.
Treated / Control /Total
Tumours
/ 4 / 5 / 9No tumours / 12 / 74 / 86
Total
/ 16 / 79 / 95Example 15 Patients in clinical trial: 50 patients are given a drug (treated group) and 50 are given a placebo (control group); the table gives the numbers suffering particular side-effects.
Drug / Placebo / TotalSide-effects / 15 / 4 / 19
No side-effects / 35 / 46 / 81
Total / 50 / 50 / 100
Example 16 Tonsils: Relationship between nasal carrier status for Streptococcus pyogenes and size of tonsils among 1398 children aged 0-15 years.
Normal
/Enlarged
/Much enlarged
/Total
Carriers
/ 19 / 29 / 24 / 72Non-carriers
/ 497 / 560 / 269 / 1326Total / 516 / 589 / 293 / 1398
Example 17 Bronze: Bronze Age tombs in four regions of Denmark: 267 tombs classified by the amount of bronze found and by geographical region.
0-200g / 200-400g / >400gThe Islands
/ 24 / 25 / 35NE Jutland / 22 / 13 / 19
NW Jutland / 11 / 9 / 58
S Jutland / 23 / 6 / 22
7
Example 18 Investors: A sample of 972 investors in the stock market are classified by age and “rate believed attainable”.
Rate believed attainable
0-5% / 6-10% / 11-15% / over 15%under 45 / 15 / 51 / 51 / 29 / 146
Investor
/ 45-54 / 31 / 133 / 70 / 48 / 282age / 55-64 / 59 / 139 / 35 / 20 / 253
65 and over / 84 / 157 / 32 / 18 / 291
189 / 480 / 188 / 115 / 972
Example 19 Homework: Classification of 1019 children according to conditions under which homework was carried out (I), and teacher’s rating of homework (J); each scale is graded, ‘A’ being highest.
I = A / B / C / D / E / TotalJ = A
/ 141 / 67 / 114 / 79 / 39 / 440B / 131 / 66 / 143 / 72 / 35 / 447
C / 36 / 14 / 38 / 28 / 16 / 132
Total / 308 / 147 / 295 / 179 / 90 / 1019
Example 20 Education: 1125 married couples in the US, classified by the length of their education; I = category of wife, J = category of husband; 1 = up to 11 years, 2 = 12 years,
3 = 13-15 years, 4 = more than 15 years.
I = 1 / 2 / 3 / 4 / TotalJ = 1
/ 283 / 141 / 25 / 4 / 4532 / 82 / 180 / 43 / 14 / 319
3 / 20 / 104 / 43 / 20 / 187
4 / 4 / 52 / 41 / 69 / 166
Total / 389 / 477 / 152 / 107 / 1125
Example 21 Eyesight: Unaided vision of 7477 women, aged 30-39; left (I) and right (J) eyes
graded on a scale from 1 (best) to 4.
I = 1 / 2 / 3 / 4 / TotalJ = 1
/ 1520 / 266 / 124 / 66 / 19762 / 234 / 1512 / 432 / 78 / 2256
3 / 117 / 362 / 1772 / 205 / 2456
4 / 36 / 82 / 179 / 492 / 789
Total / 1907 / 2222 / 2507 / 841 / 7477
8
Example 22 Fingerprints: Fingerprints of the right hand classified by the numbers of
whorls (I) and small loops (J).
J=0 / 1 / 2 / 3 / 4 / 5 / TotalI=0
/ 78 / 144 / 204 / 211 / 179 / 45 / 8611
/ 106 / 153 / 126 / 80 / 32 / - / 4972 / 130 / 92 / 55 / 15 / - / - / 292
3 / 125 / 38 / 7 / - / - / - / 170
4 / 104 / 26 / - / - / - / - / 130
5 / 50 / - / - / - / - / - / 50
Total / 593 / 453 / 392 / 306 / 211 / 45 / 2000
Example 23 Dentists: The table shows the numbers of candidates, by year and sex, passing
the final examination of the RoyalDanishSchool of Dentistry.
1962 / 1963 / 1964 / 1965 / 1966 / 1967 / 1968 / 1969 / TotalMen
/ 57 / 51 / 66 / 54 / 62 / 54 / 68 / 80 / 492Women / 40 / 43 / 38 / 32 / 30 / 36 / 64 / 55 / 338
Total / 97 / 94 / 104 / 86 / 92 / 90 / 132 / 135 / 830
Example 24 Social mobility: The data are from an inter-generational social mobility survey in the UK in the 1950s. The categories for father and son are:
A: professional, high administrativeB: managerial, executive. high supervisory
C: low inspectional, supervisory D: routine non-manual, skilled manual
E: semi-skilled and unskilled manual
Son
A / B / C / D / EA / 50 / 45 / 8 / 18 / 8
B / 28 / 174 / 84 / 154 / 55
Father
/ C / 11 / 78 / 110 / 223 / 96D / 14 / 150 / 185 / 714 / 447
E / 0 / 42 / 72 / 320 / 441
Example 25 Grouse: Numbers of young raised by samples of female grouse, treated against worms (T) and untreated (C), in each of 3 years; the figures in brackets show the numbers of mothers in each sample.
1981
/1982
/1983
/Total
T
/ 54 (8) / 30 (13) / 68 (15) / 152 (36)C
/ 34 (14) / 4 (11) / 45 (22) / 83 (47)Total / 88 (22) / 34 (24) / 113 (37) / 235 (83)
9
Example 26 Mortality: Numbers of deaths, nx, in a year in groups of people of the stated ages (x, years) and with the given exposure (Ex). Data available on module website in file deaths.txt
Age x / 60 / 61 / 62 / 63 / 64 / 65 / 66 / 67 / 68 / 69Exposure Ex / 1029 / 1091 / 1075 / 996 / 963 / 1029 / 1108 / 1130 / 1147 / 1037
No. of deaths nx / 14 / 14 / 18 / 20 / 19 / 21 / 29 / 26 / 30 / 23
Example 27 Failures: Numbers of test specimens (metal bars) which failed, in groups of specimens which were subjected to the stated measures of pressure. The group sizes are as stated. The groups contained bars of one of two types, as stated. Data available on module website in file faildata.txt
No. failed / 3 / 4 / 10 / 19 / 27 / 29 / 32 / 38 / 39 / 1 / 3 / 5 / 8 / 20 / 26 / 32 / 32 / 38Group size / 35 / 30 / 35 / 40 / 40 / 35 / 35 / 40 / 40 / 40 / 40 / 35 / 30 / 45 / 40 / 40 / 35 / 40
Pressure / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9 / 1 / 2 / 3 / 4 / 5 / 6 / 7 / 8 / 9
Type / 1 / 1 / 1 / 1 / 1 / 1 / 1 / 1 / 1 / 2 / 2 / 2 / 2 / 2 / 2 / 2 / 2 / 2
EXAMPLES THREEWAY CLASSIFICATIONS
Example 28 Cuttings: Survival (numbers dead and alive) of root cuttings in relation to length of cutting (long/short) and time of planting (at once/in the spring). 240 cuttings were used for each combination of length of cutting and time of planting.
Time of planting /At once
/In spring
Length of cutting /Long Short
/Long Short
Dead
/ 84 / 133 / 156 / 209Survival
Alive
/ 156 / 107 / 84 / 31240 / 240 / 240 / 240
Example 29 Lizards: Perch preferences for two species of lizard, Anolis sagrei and Anolis distichus (‘high’ is perch height > 5 feet ; ‘large’ is ‘diameter > 2.5 inches’).
A. sagrei / A. distichusperch diameter / small / large / small / large
high / 32 / 11 / 61 / 41
perch height
low / 86 / 35 / 73 / 70
10
Example 30 Accidents: Numbers of traffic accidents in Sweden for 18 week periods in each of two years, classified according to type of road and speed limit.
Year 1
/Year 2
Highways
/Other roads
/Highways
/Other roads
Speed limit
/ 8 / 42 / 11 / 37No speed limit
/ 57 / 106 / 45 / 69Example 31 Coalminers: classified by age, breathlessness, and wheeze.
Age group
in years /Breathlessness
/No breathlessness
/Total
Wheeze
/No wheeze
/Wheeze
/No wheeze
20-24 / 9 / 7 / 95 / 1841 / 195225-29 / 23 / 9 / 105 / 1654 / 1791
30-34 / 54 / 19 / 177 / 1863 / 2113
35-39 / 121 / 48 / 257 / 2367 / 2783
40-44 / 169 / 54 / 273 / 1778 / 2274
45-49 / 269 / 88 / 324 / 1712 / 2393
50-54 / 404 / 117 / 245 / 1324 / 2090
55-59 / 406 / 152 / 225 / 967 / 1750
60-64 / 372 / 106 / 132 / 526 / 1136
Total / 1827 / 600 / 1833 / 14022 / 18282
Example 32 Diabetics: The data arise from a random sample of 96 diabetic patients, each being classified according to (H) family history of diabetes – yes/no, (D) dependence on insulin injections – yes/no, and (A) age at onset – A1 (age <35), A2(35 age < 50), A3(age 50).
Family history of diabetes
/ yes / noDependence on insulin injections
/ yes / no / Yes / noA1
/ 10 / 3 / 30 / 4Age at onset / A2 / 6 / 32 / 8 / 40
A3 / 8 / 39 / 10 / 56
11
§1 INTRODUCTION
Categorical data arise whenever counts (as opposed to measurements) are made. Subjects (sample items) are classified as belonging to one of a set of categories and the numbers in the categories (the frequencies) are recorded. [The set of categories may be finite or countable.]
One of the simplest illustrations involves coin tossing. At each trial we observe an outcome which is in one of the two categories “head” or “tail”. At the end of a series of such trials, we count the total numbers of heads and tails, giving us the frequencies of the two categories.
See Examples (pp4 9)
We usually think of the category as the explanatory variable and the distribution of the items among the categories as the response variable (see the eye colours data Example 1).
Sometimes data can be classified in more than one way (see the Prussian cavalry deaths data – Example 2).
Classifications can themselves be classified as
(1)qualitative (no structural relation between categories) (e.g. Examples 1, 6)
(2)ordered (e.g. Examples 4, 5)
(3)quantitative (e.g. Examples 2(a), 7, 8)
A frequency distribution is a summary of raw data on some classification variable (which may be of any type).
Classifications with only two categories (heads/tails, male/female; yes/no; on/off) are called dichotomous or binary (e.g. Example 6). In this case, the distinctions (1) – (3) above make no difference to the analysis.
A twoway classification arises when each item may be classified according to two separate criteria. We get a twoway table or crosstabulation (e.g. Examples 14, 15, 16)
► Illustration 1.1
Suppose that 100 deaths are classified by
Criterion A: smoking status of deceased (smoked, did not smoke), and
Criteron B: cause of death (lung cancer, other)
B: Cause of deathCancer / Other
A: Smoking
status / Smoker / 30 / 20
Not smoker / 15 / 35
Was each of the 100 deaths randomly selected from some larger group and then classified according to each of the two binary criteria ?
or were 50 “smoker” deaths and 50 “not smoker” deaths examined and then classified according to cause of death ? or was something else done?
12
Each classification can be either an explanatory variable or a response variable, depending on how the data have arisen. We must consider at least one of the classification variables to be a response variable. If the numbers in the categories were chosen in advance, the classification variable is an explanatory variable.
We will have to consider how the analysis and the interpretation of results will depend on which of A, B are response variables.
The sources of categorical data include:
Bernoulli trials ( binomial data)
number of successes in n i.i. trials ~ binomial
e.g. Example 6 NH ~ B(1000,p) where p = P(H)
Trials with more than two possible outcomes ( multinomial data)
numbers in categories in n i.i. trials ~ multinomial
e.g. Example 3 N = (N1, N2, …, N5) ~ Mn(329, p) where pi = P(type i) , i = 1, 2, …, 5
Poisson Process (Poisson data)
This is of fundamental importance in modelling categorical data. It is the continuous-time
analogue of Bernoulli trials.
§2 POISSON PROCESS AND ASSOCIATED DISTRIBUTIONS
2.1 Bernoulli trials and related distributions
Bernoulli trials sequence of independent trials, each with two outcomes.
We consider identical trials : each has same probability of success, say P(success) = p .
Number of successes
Sn, the number of successes in the first n trials, has the binomial distributionB(n,p).The distribution has pf (probability function)
mean np, variance npq.
Conditional distribution of success ‘times’
(1) Given that there is one success in the first n trials, when did it occur?
Since the trials are i.i.d., each possible time (x = 1, 2, … , n) has the same probability. The appropriate distribution is therefore uniform on {1, 2, …, n}.
Formally:
For example, given that there is exactly 1 success in the first 5 trials, the probability that this success occurs at the first trial (or indeed any specified trial) is 0.2.
13
(2) Given that there are m successes in the first n trials, when did they occur?
Each possible set of values (1 t1 < t2 < … < tm n) has the same probability
so again the (joint) distribution is uniform.
For example, given that there are 2 successes in the first 5 trials, the probability that these successes occur at the first and last trials (or indeed any two specified trials) is 0.1.
2.2 The Poisson process and related distributions
The Poisson process in one dimension (time) at constant rate/intensity can be thought of as the continuous-time analogue of independent, identical Bernoulli trials.
We can think informally of a Poisson process as a series of events occurring one after another through time “at random” and at a constant rate. The rate of events occurring is also called the
intensity of the process, denoted
Some things you need to know about the Poisson process and related Poisson distributions
The number of events which occur in a time interval of length t, denoted Nt, has a Poisson
distribution with mean t, so
.
E[Nt] = V[Nt] = t. We write Nt ~ Pn(t).
The numbers of events which occur in non-overlapping time intervals are independent.
P(Ns+t = 0) = P(Ns = 0)P(Nt = 0) []