1.1Categorical Response Data

1.1

1.Introduction

1.1Categorical response data

What is a categorical (qualitative) variable?

Field goal result – success or failure
Patient survival – yes or no
Criminal offense convictions – murder, robbery, assault, …
Highest attained education level – HS, BS, MS, PhD
Food for breakfast – cereal, bagel, eggs,…
Annual income - <15,000, 15,000-<25,000,
25,000-<40,000, 40,000
Religious affiliation

We live in a categorical world!

Types of categorical variables

1)Ordinal – categories have an ordering

Education, income

There are ways to take advantage of the ordering. Agresti is currently updating his Analysis of Ordinal Categorical Data book on the subject!

2)Nominal – categories do not have an ordering

Food for breakfast, criminal offense convictions, religious affiliation

Organization of Agresti (2007)

Chapters 1-2: “Standard methods” of analysis mostly developed prior to 1960

Chapters 3-8: Modeling approaches such as logistic regression and loglinear models

Chapters 9-10: More advanced applications – modeling correlated observations and incorporating random effects

Chapter 11: Historical overview – See the “feud” between Pearson and Yule.

For a more in depth discussion of categorical data analysis, see Agresti’s (2002) book entitled Categorical Data Analysis.

1.2Probability distributions for categorical data

Binomial distribution

Binomial distribution: P(Y=y) =
for y=0,1,…,n

Notes:

= n choose y
Y is a random variable denoting the number of “successes” out of n trials
y denotes the observed value of Y
Y has a fixed number of possibilities – 0, 1, …, n
n is a fixed constant
 is a parameter denoting the probability of a “success”. It can take on values between 0 and 1.

Example: Field goal kicking [CB1]

Suppose a field goal kicker attempts 5 field goals during a game and each field goal has the same probability of being successful (the kick is made). Also, assume each field goal is attempted under similar conditions; i.e., distance, weather, surface,….

Below are the characteristics that must be satisfied in order for the binomial distribution to be used.

1)There are nidentical trials.

n=5 field goals attempted under the exact same conditions

2)Two possible outcomes of a trial. These are typically referred to as a success or failure.

Each field goal can be made (success) or missed (failure)

3)The trials are independent of each other.

The result of one field goal does not affect the result of another field goal.

4)The probability of success, denoted by , remains constant for each trial. The probability of a failure is 1-.

Suppose the probability a field goal is good is 0.6; i.e., P(success) =  = 0.6.

5)The random variable, Y, represents the number of successes.

Let Y=number of field goals that are good. Thus, Y can be 0,1,2,3, 4, or 5.

Since these 5 items are satisfied, the binomial probability distribution can be used and Y is called a binomial random variable.

Bernoulli distribution: P(Y=y) = for y=0 or 1

This is a special case of the binomial with n =1.

Suppose Y1, Y2,…, Yn are independent Bernoulli random variables with probability of success . Then is a Binomial random variable with n trials and  as the probability of success for each trial.

Mean and variance for Binomial random variable

E(Y) =

= since y=0 does not

contribute to the sum

= where x=y-1

= since a binomial distribution with n-1 trials is

inside the sum.

= n

Var(Y) = n(1-)

Example: Field goal kicking [CB2] (binomial.R)

Suppose =0.6, n=5. What is the probability distribution?

P(Y=0) =  0.0102

For Y=0,…,5:

Y / P(Y=y)
0 / 0.0102
1 / 0.0768
2 / 0.2304
3 / 0.3456
4 / 0.2592
5 / 0.0778

E(Y) = n = 50.6 = 3 and

Var(Y) = n(1-) = 50.6(1-0.6) = 1.2

R code and output:

#N=5, pi=0.6

> pdf<-dbinom(x = 0:5, size = 5, prob = 0.6)

> pdf

[1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776

pdf <- dbinom(0:5, 5, 0.6)

> pdf

[1] 0.01024 0.07680 0.23040 0.34560 0.25920 0.07776

#Make the printing look a little nicer

save <- data.frame(Y = 0:5, prob = round(pdf, 4))

> save

Y prob

1 0 0.0102

2 1 0.0768

3 2 0.2304

4 3 0.3456

5 4 0.2592

6 5 0.0778

plot(x = save$Y, y = save$prob, type = "h", xlab = "Y",

ylab = "P(Y=y)", main = "Plot of a binomial

distribution for n=5, pi=0.6", panel.first=

grid(col="gray", lty="dotted"), lwd = 2)

> abline(h = 0)

Example: Simulating observations from a binomial distribution(binomial.R)

The purpose of this example is to show how one can “simulate” observing a random sample of observations from a population characterized by a binomial distribution.

Why would someone want to do this?

[CB3]

Use the rbinom() function in R.

> #Generate observations from a Binomial distribution

> set.seed(4848)

bin5<-rbinom(n = 1000, size = 5, prob = 0.6)

> bin5[1:20]

[1] 3 2 4 1 3 1 3 3 3 4 3 3 3 2 3 1 2 2 5 2

> mean(bin5)

[1] 2.991

> var(bin5)

[1] 1.236155

> hist(x = bin5, main = "Binomial with n=5, pi=0.6, 1000 bin. observations", col = "blue")

Notes:

The “0” is a little off. Here’s how to fix this problem:

> save.count<-table(bin5)

> save.count

bin5

0 1 2 3 4 5

12 84 215 362 244 83

> barplot(height = save.count, names = c("0", "1", "2", "3", "4", "5"), main = "Binomial with N=5, pi=0.6, 1000 bin. observations", xlab = "x")

The table() function counts the observations.

The shape of the histogram looks similar to the shape of the actual binomial distribution.
The mean and variance are close to what we expect them to be!

Multinomial distribution

Instead of just two possible outcomes, suppose there are two or more. When this happens, a multinomial distribution is appropriate. The binomial distribution is a special case when there are only two possible outcomes.

Set up:

n trials
k different outcomes or categories
ni denotes the number of observed responses for category i (count for the ith category) for i=1,…,k
i denote the probability of observing ith category – these are parameters

The multinomial distribution is:

Note that n =n1+n2+…+nk.

Example: Highest college degree received

Outcomes: None, Associate, Bachelors, Masters, Doctorate

Suppose a random sample of size n is taken from a population of interest.

n1 denotes number of people out of n with no college degree

n2 denotes number of people out of n with associate degree as their highest



Example: Sample from a multinomial distribution (multinomial_gen.R)

Suppose k=5, 1=0.25, 2=0.35, 3=0.2, 4=0.1, and 5=0.1.

> set.seed(2195)

> save<-rmultinom(n = 1, size = 1000, prob = c(0.25, 0.35,

0.2, 0.1, 0.1))

> save

[,1]

[1,] 242

[2,] 333

[3,] 188

[4,] 122

[5,] 115

> save/1000

[,1]

[1,] 0.242

[2,] 0.333

[3,] 0.188

[4,] 0.122

[5,] 0.115

> #Height of bars correspond to the columns so need to

transpose save

> barplot(height = t(save), names = c("1", "2", "3", "4",

"5"), col = "red", main = "Sample from multinomial

distribution with \n pi=(0.25, 0.35, 0.2, 0.1, 0.1)")

Notes:

The first option in rmultinom( ) tells R how many sets of multinomial samples of size = 1000 (in this example) to find.
The c() function combines or concatenates values into a vector.
Multinomial sampling will be VERY important later in the course.
The dmultinom() function can be used to find probabilities with the multinomial distribution. See the program for examples.

Question:Why examine probability distributions?[unl4]

1.3Statistical inference for a proportion

Introduction to maximum likelihood estimation (FG kicking)

Suppose the success or failure of a field goal in football can be modeled with a Bernoulli() distribution. Let Y=0 if the field goal is a failure and Y=1 if the field goal is a success. Then the probability distribution for Y is:

P(Y=y) =

where  denotes the probability of success. Suppose we would like to estimate  for a 40 yard field goal. Let y1,…,yn denote a random sample of observed field goal results at 40 yards. Thus, these yi’s are either 0’s or 1’s.

Given the resulting data (y1,…,yn), the “likelihood function” measures the plausibility of different values of :

Suppose = 4 and n =10. Given this observed information, we would like to find the corresponding parameter value for thatproducesthe largest probability of obtaining this particular sample. The following table can be formed to help find this parameter value:

 /
0.2 / 0.000419
0.3 / 0.000953
0.35 / 0.001132
0.39 / 0.001192
0.4 / 0.001194
0.41 / 0.001192
0.5 / 0.000977

Note that  = 0.4 is the “most plausible” value of  for the observed data since this maximizes the likelihood function. Therefore, 0.4 is the maximum likelihood estimate(MLE).

In general, the maximum likelihood estimate can be found as follows:

Find the natural log of the likelihood function,
Take the derivative of with respect to .
Set the derivative equal to 0 and solve for  to find the maximum likelihood estimate. Note that the solution is the maximum of provided certain “regularity” conditions hold (see Mood, Graybill, Boes, 1974).

For the field goal example:

where log means natural log.

Therefore, the maximum likelihood estimator of  is the proportion of field goals made. To avoid confusion between a parameter and a statistic, we will denote the estimator as p = /n. Often, others will put a ^ on  to denote the estimator. Agresti uses p.

Agresti derives the estimator from a binomial point of view. See the top of p.6 for the connection between the Bernoulli and binomial way of looking at the problem.

Maximum likelihood estimation will be extremely important in this class!!!

For additional examples with maximum likelihood estimation, please see Section 9.15 of my STAT 380 lecture notes at

Properties of maximum likelihood estimators

For a large sample, maximum likelihood estimators can be treated as normal random variables.
For a large sample, the variance of the maximum likelihood estimator can be computed from the second derivative of the log likelihood function.

Why do you think these two properties are important? [b5]

The use of “for a large sample” can also be replaced with the word “asymptotically”. You will often hear these results talked about using the phrase “asymptotic normality of maximum likelihood estimators”.

You are not responsibleto do derivations as shown in the next example (on exams). This example is helpful to see in order to understand how R will be doing these and more complex calculations.

Example: Field goal kicking

The large sample variance of p is

To find this, note that,

since only the Yi’s

are random variables

Since  is a parameter, we replace it with its corresponding estimator to obtain .

Thus, = .

This same result is derived on p. 474 of Casella and Berger (2002).

See Chapter 18 of Ferguson (1996) for more on the “asymptotic normality” of maximum likelihood estimators[CB6].

Likelihood Ratio Test (LRT)

The LRT is a general way to test hypotheses. The LRT statistic, , is the ratio of two likelihood functions. The numerator is the likelihood function maximized over the parameter space restricted under the null hypothesis. The denominator is the likelihood function maximized over the unrestricted parameter space. The test statistic is written as:

Wilks (1935, 1938) shows that –2log() can be approximated by a for a large sample and under Howhere u is the difference in dimension between the alternative and null hypothesis parameter spaces. See Casella and Berger (2002) for more on the LRT.

See the chi-square (2) distribution review in the Chapter 1 additional notes if needed.

Example: Field goal kicking

Continuing the field goal example, suppose the hypothesis test Ho:=0.5 vs. Ha:0.5 is of interest. Remember that and n = 10.

The numerator of  is the maximum possible value of the likelihood function under the null hypothesis. Since =0.5 is the null hypothesis, the maximum can be found by just substituting =0.5 in the likelihood function:

Then

The denominator of  is the maximum possible value of the likelihood function under the null OR alternative hypotheses. Since this includes all possible values of  here, the maximum is achieved when the maximum likelihood estimate is substituted for  in the likelihood function! As shown previously, the maximum value is 0.001194.

Therefore,

Then –2log() = -2log(0.8179) = 0.4020 is the test statistic value. The critical value is = 3.84 using =0.05. There is not sufficient evidence to reject the hypothesis that =0.5.

Questions:

Suppose the ratio is close to 0, what does this say about Ho? Explain.
Suppose the ratio is close to 1, what does this say about Ho? Explain.

Confidence interval for the population proportion

A typical form of a confidence interval for a parameter is

Estimator  (distributional value)(standard deviation of estimator)

In this case, p = /n is the estimator, the distribution is normal, and the standard deviation is . Thus, the approximate (1-)100% confidence interval for  is

where Z1-/2 is the 1-/2 quantile from a standard normal distribution.

Confidence intervals which use the asymptotic normality of a maximum likelihood estimator are called “Wald” confidence intervals.

Example: Field goal kicking

Suppose =4 and n=10. Then the 95% confidence interval is

0.0964 ≤≤ 0.7036

Problems!!!

1)Remember the interval “works” if the sample size is large. The field goal kicking example has n=10 only!

2)The discreteness of the binomial distribution often makes the normal approximation work poorly even with large samples.

The result is a confidence interval that is often too “liberal”. This means when 95% is stated as the confidence level, the true confidence level is often lower.

The problems with this particular confidence interval have been discussed for a long time in the statistical literature. A few recent papers on this topic include:

Agresti, A. and Coull, B. A. (1998). Approximate is better than “exact” for interval estimation of binomial proportions. The American Statistician 52(2), 119-126.
Agresti, A. and Caffo B. (2000). Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. The American Statistician 54(4), 280-288.
Agresti, A. and Min, Y. (2001). On small-sample confidence intervals for parameters in discrete distributions. Biometrics 57, 963-971.
Blaker, H. (2000). Confidence curves and improved exact confidence intervals for discrete distributions. The Canadian Journal of Statistics 28, 783-798.
Blaker, H. (2001). Corrigenda: Confidence curves and improved exact confidence intervals for discrete distributions. The Canadian Journal of Statistics 29, 681.
Borkowf, C. B. (2006). Constructing binomial confidence intervals with near nominalcoverage by adding a single imaginary failure or success. Statistics in Medicine 25, 3679-3695.
Brown, L. D., Cai, T. T., and DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science 16(2), 101-133.
Henderson, M. and Meyer, M. (2001). Exploring the confidence interval for a binomial parameter in a first course in statistical computing. The American Statistician 55(4), 337-344.
Newcombe, R. G. (2001). Logit confidence intervals and the inverse sinh transformation. The American Statistician 55(3), 200-202.
Vos, P. W. and Hudson, S. (2005). Evaluation Criteria for Discrete Confidence Intervals: Beyond Coverage and Length. The American Statistician59(2), 137-142.

Brown et al. (2001) serves as a summary of all the proposed methods and gives the following recommendations:

For n40, use the “Agresti and Coull” (1998) interval.

Let be an adjusted estimate of . The approximate (1-)100% confidence interval is

Note that when =0.05, Z1-/2=1.962. Then

Thus, two successes and two failures are added. Agresti and Coull (1998) refer to this as an “adjusted” Wald confidence interval.

When n<40, the Agresti and Coull interval is generally still better than the Wald interval.

For n<40, use the Wilson or Jeffrey’s prior interval.

Wilson interval:

Where does this come from?

Consider the hypothesis test for Ho:=0 vs. Ha:0 using the test statistic of

The limits of the Wilson confidence interval come from “inverting” the test. This means finding the set of 0 such that

is satisfied. Through solving a quadratic equation (see #1.18 in the homework), the interval limits are derived.

Jeffrey’s interval: Please see the additional notes for Chapter 1. This interval will not be on a test, but could be on a project.

Example: Field goal kicking (fg.R)

Below is the code used to calculate each confidence interval.

> sum.y<-4

n<-10

> alpha<-0.05

> p<-sum.y/n

> #Wald C.I.

> lower<-p-qnorm(p = 1-alpha/2)*sqrt(p*(1-p)/n)

> upper<-p+qnorm(p = 1-alpha/2)*sqrt(p*(1-p)/n)

> cat("The Wald C.I. is:", round(lower,4) , "<= pi <=",

round(upper,4), "\n \n")

The Wald C.I. is: 0.0964 <= pi <= 0.7036

> #Agresti and Coull C.I.

> p.tilde<-(sum.y+qnorm(p = 1-alpha/2)^2 /2)/(n+qnorm(p =

1-alpha/2)^2)

> lower<-p.tilde-qnorm(p = 1-alpha/2)*sqrt(p.tilde*(1-

p.tilde)/(n+qnorm(p = 1-alpha/2)^2))

> upper<-p.tilde+qnorm(p =1-alpha/2)*sqrt(p.tilde*(1-

p.tilde)/(n+qnorm(p =1-alpha/2)^2))

> cat("The Agresti and Coull C.I. is:", round(lower,4) ,

"<= pi <=", round(upper,4), "\n \n")

The Agresti and Coull C.I. is: 0.1671 <= pi <= 0.6884

> #Wilson C.I.

> lower<-p.tilde-qnorm(p =1-alpha/2)*sqrt(n)/(n+qnorm(p =1-

alpha/2)^2) * sqrt(p*(1-p)+qnorm(p = 1-alpha/2)^2/

(4*n))

> upper<-p.tilde+qnorm(p = 1-alpha/2)*sqrt(n)/(n+qnorm(p =

1-alpha/2)^2) * sqrt(p*(1-p)+qnorm(p = 1-alpha/2)^2/

(4*n))

> cat("The Wilson C.I. is:", round(lower,4) , "<= pi <=",

round(upper,4), "\n \n")

The Wilson C.I. is: 0.1682 <= pi <= 0.6873

The intervals produced are:

The Wald C.I. is: 0.0964 <= pi <= 0.7036

The Agresti and Coull C.I. is: 0.1671 <= pi <= 0.6884

The Wilson C.I. is: 0.1682 <= pi <= 0.6873

How useful would these confidence intervals be?

[CB7]

Note that Agresti now gives code for some of these intervals at This code contains actual functions to perform the calculations. Also, the binom package in R provides a simple function to do these calculations as well. Here is an example of how I used its function:

> library(binom)

> binom.confint(x = sum.y, n = n, conf.level = 1-

alpha, methods = "all")

method x n mean lower upper

1 agresti-coull 4 10 0.4000000 0.16711063 0.6883959

2 asymptotic 4 10 0.4000000 0.09636369 0.7036363

3 bayes 4 10 0.4090909 0.14256735 0.6838697

4 cloglog 4 10 0.4000000 0.12269317 0.6702046

5 exact 4 10 0.4000000 0.12155226 0.7376219

6 logit 4 10 0.4000000 0.15834201 0.7025951

7 probit 4 10 0.4000000 0.14933907 0.7028372

8 profile 4 10 0.4000000 0.14570633 0.6999845

9 lrt 4 10 0.4000000 0.14564246 0.7000216

10 prop.test 4 10 0.4000000 0.13693056 0.7263303

11 wilson 4 10 0.4000000 0.16818033 0.6873262

Below is a comparison of the performance of the four confidence intervals (Brown et al., 2001). The values on the y-axis represent the true confidence level (coverage) of the confidence intervals. Each of the confidence intervals are supposed to be 95%!

 2010 Christopher R. Bilder

1.1

 2010 Christopher R. Bilder

1.1

What does the “true confidence” or “coverage” level mean?

Suppose a random sample of size n=50 is taken from a population and a 95% Wald confidence interval is calculated.
Suppose another random sample of size n=50 is taken from the same population and a 95% Wald confidence interval is calculated.
Suppose this process is repeated 10,000 times.
We would expect 9,500 out of 10,000 (95%) confidence intervals to contain .
Unfortunately, this does not often happen. It only is guaranteed to happen when n= for the Wald interval.
The true confidence or coverage level is the percent of times the confidence intervals contain or “cover” .

How can this “true confidence” or “coverage” level be calculated?

Simulate the process of taking 10,000 samples and calculating the confidence interval each time using a computer. This is called a Monte Carlo simulation.

The plots of the previous page do this for many possible values of  (0.0005 to 0.9995 by 0.0005). For example, the coverage using the Wald interval is approximately 0.90 for =0.184.

Note: The authors of the paper actually use a different method which does not require computer simulation (see the additional Chapter 1 notes for further information). However, computer simulation will provide just about the same answer if the number of times the process (take a sample, calculate the C.I.) is repeated a large number of times. Plus, it can be used in many more situations where computer simulation is the only option.

Brown et al. (2001) also discuss other confidence intervals. After the paper, there are a set of responses by other statisticians. Each agrees the Wald interval should NOT be used; however, they are not all in agreement with which other interval to use!

Example: Calculate estimated true confidence level for Wald (CI_sim.R)

##############################################################

# NAME: Chris Bilder #

# DATE: 1-23-02 #

# UPDATE: 12-7-03 #

# PURPOSE: C.I. pi simulation #

# #

# NOTES: See plot on p. 107 of Brown, et al. (2001) #

##############################################################

############################################################

# Do one time

n <- 50

> pi <- 0.184

> sum.y <- rbinom(n = 1, size = n, prob = pi)

> sum.y

[1] 6

> alpha <- 0.05

> p <- sum.y/n

> #Wald C.I.

lower <- p - qnorm(p = 1 - alpha/2) * sqrt((p * (1 - p))/n)