Poisson Regression

Poisson Distribution: P(y) = , where y=0,1,2,3,…. Mean(Y)=Var(Y)=

Model: ; that is, the parameter for each observation depends upon the covariates.

Link function: ln()

Inverse Link function: exp()

Example data: Attendance data on 316 high school juniors from two urban high schools. The response variable of interest is days absent, daysabs. The variables math and langarts give the standardized test scores for math and language arts respectively. The variable male is a binary indicator of student gender. The response variable of interest is days absent, daysabs. The variables math and langarts give the standardized test scores for math and language arts respectively. The variable male is a binary indicator of student gender.

> p<-read.table("http://www.ats.ucla.edu/stat/R/dae/poissonreg.csv",sep=",",header=TRUE)

> attach(p)

> p[1:4,] # first 4 rows of data

id school male math langarts daysatt daysabs

1 1001 1 1 56.98883 42.45086 73 4

2 1002 1 1 37.09416 46.82059 73 4

3 1003 1 0 32.27546 43.56657 76 2

4 1004 1 0 29.05672 43.56657 74 3

> pairs(cbind(daysabs,male,math,langarts)) # scatter plots of data

> table( daysabs ) # How many students were absent 1,2,3,...,45 days

daysabs

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26

62 46 31 26 22 20 15 12 9 11 5 6 8 9 3 1 5 1 3 2 2 1 2 1 1 1

27 28 30 31 34 35 41 45

3 1 1 1 1 2 1 1

>

> # Some general graphing information:

> # The function colors() lists all 657 colors

> sample( colors(), 20 ) # show names of 20 random numbers

[1] "purple1" "mistyrose3" "blue1" "navajowhite4" "lightgray"

[6] "hotpink" "gray99" "lightsalmon1" "grey75" "darkorchid"

[11] "sienna3" "green4" "coral1" "gray17" "dodgerblue3"

[16] "orangered1" "grey72" "gray94" "tomato4" "gray97"

> windows()

> hist( daysabs, breaks=seq(from=-.5, to=45.5,by=1),xlim=c(0,50), ylim=c(0,70), col="darkseagreen4" )

> # Could have used barplot(), but would have needed to provide 0 values for unobserved values less than 45

>

> fit1 <- glm( daysabs ~ male + math + langarts, family=poisson )

> summary( fit1 )$coef ## Notice "math" p-value is 0.05, on edge of significance

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.687665907 0.072651205 36.994100 1.424696e-299

male -0.400920916 0.048412173 -8.281407 1.217282e-16

math -0.003523222 0.001821349 -1.934402 5.306367e-02

langarts -0.012152122 0.001834826 -6.623039 3.518885e-11

> anova( fit1, test="Chisq" ) ## Notice "math" is very significant

Analysis of Deviance Table

Model: poisson, link: log

Response: daysabs

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 315 2409.8

male 1 45.362 314 2364.5 1.638e-11 ***

math 1 86.492 313 2278.0 < 2.2e-16 ***

langarts 1 43.420 312 2234.6 4.416e-11 ***

---

>

> fit2 <- glm( daysabs ~ male + langarts + math, family=poisson ) # swith math & langarts

> summary( fit2 )$coef ##notice, no change

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.687665907 0.072651205 36.994100 1.424696e-299

male -0.400920916 0.048412173 -8.281407 1.217282e-16

langarts -0.012152122 0.001834826 -6.623039 3.518885e-11

math -0.003523222 0.001821349 -1.934402 5.306367e-02

> anova( fit2, test="Chisq" ) ## Math p-value is now 0.05

Analysis of Deviance Table

Model: poisson, link: log

Response: daysabs

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 315 2409.8

male 1 45.362 314 2364.5 1.638e-11 ***

langarts 1 126.141 313 2238.3 < 2.2e-16 ***

math 1 3.771 312 2234.6 0.05214 .

>

> # as residual deviance suggests, much extra variation remains

> windows()

> plot( fitted(fit1), daysabs-fitted(fit1) )

> # Math and Langarts are highly correlated, even when broken down by gender

> cor( langarts, math )

[1] 0.688792

> cor( langarts[male==1], math[male==1] )

[1] 0.7021633

> cor( langarts[male==0], math[male==0] )

[1] 0.6791752

>

> # Explore more parsimonious model (fewer variables)

> fit3 <- glm( daysabs ~ male + langarts, family=poisson )

> summary( fit3 )$coef

Estimate Std. Error z value Pr(>|z|)

(Intercept) 2.6469765 0.069776423 37.935113 0.000000e+00

male -0.4093528 0.048219172 -8.489419 2.076726e-17

langarts -0.0146700 0.001293449 -11.341770 8.147924e-30

> anova( fit3, test="Chisq" )

Analysis of Deviance Table

Model: poisson, link: log

Response: daysabs

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 315 2409.8

male 1 45.362 314 2364.5 1.638e-11 ***

langarts 1 126.141 313 2238.3 < 2.2e-16 ***

> # See what AIC thinks. (Smaller is better)

> summary(fit1)$aic

[1] 3103.942

> summary(fit3)$aic

[1] 3105.713

> # Predict typical male number of absences (fit1)

> mean(langarts[male==1])

[1] 46.98139

> mean(math[male==1] )

[1] 47.76657

> # log(lambda) for typical male

> 2.7876 + -0.4009 + -.0035*47.7666 + -0.0122*46.9814

[1] 1.646344

> # predicted number of absences for typical male

> exp( 2.7876 + -0.4009 + -.0035*47.7666 + -0.0122*46.9814)

[1] 5.187977

>

> # If model were correct, typical male number of absences

> # would be Poisson distributed with mean of 5.19 days

>

> # female

> mean(langarts[male==0])

[1] 52.99398

> mean(math[male==0] )

[1] 49.68684

> exp( 2.7876 + -.0035*49.6568 + -0.0122*52.9940)

[1] 7.151159

> # If model correct, typical female number of absences

> # would be Poisson distributed with mean of 7.15 days

>

> detach(p)

> ###############################################################

> ### Use of offset() to account for rates in Poisson Regression

> ###############################################################

> # Treat lambda as the expected count per unit time or population

> # t*exp(b0+b1X1+....) = exp(b0+b1X1+...+ln(t))

> # Note: No coefficient for the ln(t) term, it is "offset"

>

> # Number of lung cancer cases in 4 Danish cities 1968-1971.

> # From: James K Lindsey (1995). Modelling frequency and count data, Clarendon Press, page 157.

> # Reference: Andersen, E. B. (1977).

> # Multiplicative Poisson models with unequal cell rates,

> # Scandinavian Journal of Statistics, 4, pages 153–158.

> # age column: midpoint of Age column, with 75 chosen for >74.

>

> lungs <- read.table("http://www.humboldt.edu/~mar13/Data.dir/DanishLung.txt",

+ header=TRUE,skip=7)

> attach( lungs )

> lungs

Cases Pop Age City age

1 11 3059 40-54 Fredericia 47

2 11 800 55-59 Fredericia 57

3 11 710 60-64 Fredericia 62

4 10 581 65-69 Fredericia 67

5 11 509 70-74 Fredericia 72

6 10 605 >74 Fredericia 75

7 13 2879 40-54 Horsens 47

8 6 1083 55-59 Horsens 57

9 15 923 60-64 Horsens 62

10 10 834 65-69 Horsens 67

11 12 634 70-74 Horsens 72

12 2 782 >74 Horsens 75

13 4 3142 40-54 Kolding 47

14 8 1050 55-59 Kolding 57

15 7 895 60-64 Kolding 62

16 11 702 65-69 Kolding 67

17 9 535 70-74 Kolding 72

18 12 659 >74 Kolding 75

19 5 2520 40-54 Vejle 47

20 7 878 55-59 Vejle 57

21 10 839 60-64 Vejle 62

22 14 631 65-69 Vejle 67

23 8 539 70-74 Vejle 72

24 7 619 >74 Vejle 75

> fit <- glm( Cases ~ age + offset( log(Pop) ), family=poisson, data=lungs)

> summary(fit)

Call:

glm(formula = Cases ~ age + offset(log(Pop)), family = poisson,

data = lungs)

Deviance Residuals:

Min 1Q Median 3Q Max

-4.2389 -0.4559 0.1840 0.8213 2.0335

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -8.188971 0.421262 -19.439 <2e-16 ***

age 0.056483 0.006533 8.646 <2e-16 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 129.908 on 23 degrees of freedom

Residual deviance: 50.698 on 22 degrees of freedom

AIC: 151.09

Number of Fisher Scoring iterations: 4

>

> windows()

> plot(fitted(fit),Cases-fitted(fit),ylab="Residual", xlab="Predicted Cases")

> abline(h=0)

> fit2 <- glm( Cases ~ age + I(age^2) + offset( log(Pop) ), family=poisson, data=lungs)

> summary(fit2)$coef

Estimate Std. Error z value Pr(>|z|)

(Intercept) -21.434384043 3.0887571330 -6.939485 3.935317e-12

age 0.501159509 0.1020193966 4.912394 8.997086e-07

I(age^2) -0.003633982 0.0008289412 -4.383884 1.165821e-05

> anova(fit2, test="Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: Cases

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 23 129.908

age 1 79.210 22 50.698 < 2.2e-16 ***

I(age^2) 1 19.732 21 30.966 8.911e-06 ***

> plot(fitted(fit2),Cases-fitted(fit2),ylab="Residual", xlab="Predicted Cases")

> abline(h=0)

>

> fitCity <- glm(Cases ~ age + I(age^2)+ City + offset( log(Pop) ), family=poisson, data=lungs)

> summary(fitCity)$coef

Estimate Std. Error z value Pr(>|z|)

(Intercept) -21.451864835 3.0889482407 -6.944715 3.792254e-12

age 0.509650433 0.1021173677 4.990830 6.012036e-07

I(age^2) -0.003701243 0.0008297504 -4.460670 8.170370e-06

CityHorsens -0.332668142 0.1814706365 -1.833179 6.677597e-02

CityKolding -0.375524185 0.1877886250 -1.999717 4.553079e-02

CityVejle -0.274024990 0.1878413800 -1.458811 1.446173e-01

> anova(fitCity, test="Chisq")

Analysis of Deviance Table

Model: poisson, link: log

Response: Cases

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev P(>|Chi|)

NULL 23 129.908

age 1 79.210 22 50.698 < 2.2e-16 ***

I(age^2) 1 19.732 21 30.966 8.911e-06 ***

City 3 4.949 18 26.017 0.1756

> # Use fit2 and find the predicted number of cases for the first row of data

> fitted( fit2 )[1]

1

8.31857

> # Show how this was obtained

> fit2$coef # coefficients

(Intercept) age I(age^2)

-21.434384043 0.501159509 -0.003633982

> lungs[1,] # explanatory data

Cases Pop Age City age

1 11 3059 40-54 Fredericia 47

> 3059* exp( -21.4344 + 0.50116*47 - 0.003634*47^2 )

[1] 8.318295

> # Mathematical equivalent

> exp( -21.4344 + 0.50116*47 - 0.003634*47^2 + log( 3059 ) )

[1] 8.318295

>

> detach(lungs)

1