Modeling the Number of Breakfasts Eaten at Rice University’s Central Kitchen over the course of a Semester

A Study by Brooke Lathram

Introduction

This study examines the total number of breakfasts eaten each day at all of Rice University’s eight residential college kitchens during the fall 2000 and spring 2001 semesters. The main problem this paper addresses is whether a model fit to the fall semester breakfast data can sufficiently explain the number of breakfasts eaten each day in the spring semester. A model with such predictive ability would potentially be of benefit to Rice University’s Housing and Dining, as it would aid in making decisions about how much food and how many workers are necessary on given days throughout a semester. Last year, Food and Housing (now known as Housing and Dining) proposed to save money by serving hot food only at four of the eight colleges. It was estimated that closing four of the colleges’ hot breakfast services each day would save $78,000, or roughly $45 per student per year (Rice Thresher, Oct 27, 2000). The goal of this paper is to provide a tool for Housing and Dining to address this and other cost-cutting measures.

Methods

The data provided by Housing and Dining for the fall semester includes the number of breakfasts eaten at each college for six-day weeks (Monday through Saturday) between August 28 and December 20, with the exception of November 23 through 25, when no breakfasts were served for Thanksgiving break. The data collected for the spring semester includes six-day weeks between January 16 and May 9, with the exception of the week of February 19 through 24, when no data was collected, and March 5 through 10, when no breakfasts were served for spring break. Using the statistical software S-Plus, we fit an auto-regressive model to the fall semester data and examine its predictive ability for the spring.

Dealing with Missing Values

There are two unexplained missing values that we must estimate before we attempt to fit a model. There is no data recorded for Weiss College on Thursday, Sept 14. There is also no data recorded for Baker College on Wednesday, November 22, the day before Thanksgiving. Since neither Weiss nor Baker was closed on these days, these missing values must be due to errors in collecting the data. Possibly the card readers at Rice and Baker were broken on these days. There are also three days in which no breakfasts were served at any of the colleges. These days are November 23 through 25, the three days of Thanksgiving vacation.

First we estimate the missing value for the number of breakfasts eaten at Weiss College on Thursday, September 14. We fill in the missing values for this Thursday, the Tuesday of fall break when four of the colleges, including Weiss, were closed, and the Thursday and Friday corresponding to Thanksgiving vacation using linear interpolation. Then, we examine the mean-differenced data before we fit an auto-regressive model. It is necessary to use the zero-mean process when fitting an ARIMA model; however, from the ACF and PACF, it is apparent that no other differencing is necessary, as there appears to be no seasonal structure. After fitting a model, we use that model to make a one-step prediction. We will use that prediction to estimate the number of breakfasts eaten on Thursday, September 14. Because Weiss College was closed on Saturdays during the fall semester, we fit a model to five-day weeks (omitting Saturdays). Below is a plot of the original data.

We plot the mean-differenced series and its histogram, autocorrelation function (ACF) and partial autocorrelation function (PACF).

It appears that the ACF tails off after lag 1 and the PACF cuts off after lag 1, so an auto-regressive model of order 1 seems appropriate. We use the S-Plus function arima.mle to fit an AR(1) model and examine the standardized residuals from a one-step prediction to see how well the model fits.

The top graph shows that there are four outliers that fall outside 95th percentile error bounds. The outliers that fall well beyond error bounds correspond to Labor Day (t=6) and the first day of Fall Break (t=36). Here we come across the issue of “holiday values,” that is, lower than predicted numbers of breakfasts eaten around vacation times. This is a problem that we will also encounter when we fit a model to the total number of breakfasts eaten. Despite the outliers, the ACF (second graph from the top) and PACF (second graph from the bottom) at all lags fall within 95th percentile error bounds, and the P-values for the Ljung-Box Chi-Squared statistics fail to reject the null hypothesis that the model is adequate. To predict the number of breakfasts eaten at Rice on Thursday, September 14, we need to look at the one-step prediction to see what value it gives us for time t = 14. Using the S-Plus function arima.filt, we make a one-step prediction for the mean-differenced data and add back the mean.

Since t=14 is not too close to the values t=6 or t=36, the one-step prediction should give an adequate estimate of the number of breakfasts eaten on that day.

There is one other missing value to consider. There is no data recorded for Baker College on the Wednesday before Thanksgiving. Because fewer students ate breakfast across the colleges on the Wednesday before Thanksgiving, we make the judgment call that the best predictor for the number of breakfasts eaten on that day at Baker would be an average of the breakfasts eaten at the other colleges on that day.

Now that we have estimated the breakfasts eaten at two of the colleges with missing values, we are prepared to examine the total number of breakfasts eaten at all of the colleges during the fall semester. We note that there are three days—November 23 through November 25—in which no breakfasts were served at any of the colleges. We use linear interpolation to estimate the number eaten for the Thursday and Friday of Thanksgiving; however, we note that the linearly interpolated value for Saturday, November 25, is artificially high due to the fact that fewer people eat breakfasts on Saturday and choose instead to average the number of breakfasts eaten across Saturdays to fill in this value.

Modeling the Fall 2000 Breakfast Data

Shown below is a plot of the number of breakfasts eaten across the colleges during the fall semester.

Before we fit a model, we examine the mean-differenced data. As stated before, we must use the zero-mean process to fit an ARIMA model. Shown below are plots of the ACF and PACF of the mean-differenced data.

We notice two things from these summary statistics: There is a seasonal component of period 6 in the ACF, and the PACF cuts off at lag=7, after exhibiting strong values at lags 6 and 7. We examine two competing models to explain this behavior:

i)An AR(7) to account for the cutting off of the PACF at lag=7

ii) A multiplicative seasonal auto-regressive model, AR(1)xAR(1)6. We arrive at this model by fitting an AR(1)6 to remove the seasonal component, examining the residuals, then noting that the PACF of the residuals cuts off at lag=1. (See graph of the ACF and PACF of the residuals from the purely seasonal model below.)

We fit both the AR(7) and AR(1)xAR(1)6 to the mean-differenced data and compare the two models. Looking at the residuals and the PACF and ACF of the residuals of the two models, we see that both appear to fit the data well; however, the distinguishing feature of the multiplicative seasonal AR model is that the P-values of the Ljung-Box Chi-Squared statistics fails to reject the adequacy of the model (see two graphs below). We believe the AR(1)xAR(1)6 to be the better fit. Intuitively, a seasonal model with period six makes sense because of the weekly nature of the data.

Note the following Q-Q plot and histogram of the residuals from the seasonal model:

The values that deviate from the linear trend in the Q-Q plot and the tail distribution in the histogram indicate the problem of outliers. These can be explained by what we referred to previously as “holiday values”. If we had several semesters of data, we would hope to be able to model these “holiday values”; however, given that we only have two semesters of data, we cannot model this trend. We note the same problem when plot the one-step prediction.

The first outlier at t=13 is due to an under-prediction caused by the low turnout on Labor Day (t=7). The next three outliers occur because of fall break. Time t=43 corresponds to the Monday of fall break; since fewer people ate breakfast on this Monday, the model over-predicts the number. The model also over-predicts the number eaten on the Tuesday before fall break (t=44), though the over-prediction is not as high due to the lower value on Monday. The low values on that Monday and Tuesday lead to under-prediction for the following Wednesday (t=45) and Monday (t=49). This analysis would lead us to argue that a model that takes into account holidays would better explain the outliers.

Using the Fall 2000 model to explain the Spring 2001 Data

We now determine if the multiplicative seasonal autoregressive model fit to the fall 2000 data adequately predicts the spring 2001 data. To do this, we first adjust the data collected for the spring 2001 to account for the difference in the number of meal plans each semester. We do this by multiplying each value collected by the number of meal plans in the fall divided by the number of meal plans in the spring. Then we fit a one-step prediction to the data differenced for the mean and graph the one-step prediction plus the mean with the original data (first graph) and the standardized residuals (second graph).

We note that the mean of the standardized residuals is -0.1367513, the variance

0.9940432, and the histogram roughly normal (See S-Plus code.), supporting

our assumption that the standardized residuals are roughly N(0,1).

The outliers in this model can be explained using the same reasoning applied to the analysis for the one-step prediction for the fall data. The first outlier is the result of an over-prediction for t=63, which corresponds to the Thursday of spring recess, and the next outlier is the result of an under-prediction for the Friday of the week following. This under-prediction is a result of a low value on the Friday of fall break, which was a week prior to that day. Other than these outliers, there are only two values that barely fall beyond 95th percentile confidence intervals. This means that four residuals out of 78, or 5.1 %, fall outside of 95% confidence intervals, and we conclude that the model fits the data within two standard deviations of the mean.

Conclusion

We successfully fit a multiplicative seasonal AR model to the number of breakfasts eaten in the fall and used it to model the spring semester data within 95th percentile error bounds. The seasonal component of order 6 makes sense intuitively due to the weekly nature of the data. We explained the outliers as “holiday values,” which leads us to suggest that a model based on several semesters of data that takes into account the holiday trends would better explain the data. Despite these outliers, however, the model presented here explains the weekly trend within two standard deviations. We suggest that it could be a tool for Housing and Dining to use in deciding how much breakfast needs to be prepared on given days throughout the semester.

S-Plus Code

#The Weiss data is stored in weiss

#We use the functions kdescripc and karima.diag from class.

#First, we use the approx function in SPlus to linearly interpolate the four missing points.

p <- rep(0,76)

for (i in 1:13) p[i] <- i

for (i in 14:35) p[i] <- i + 1

for (i in 36:61) p[i] <- i + 2

for (i in 62:76) p[i] <- i + 4

y <- na.omit(weiss)

approx(p, y, xout=c(14,37,64,65))

#Now we fit a model to the original Weiss data plus the interpolated values.

w<-rts(weissInterp[,1])

ts.plot(rts(w))

ww<-w-mean(w)

kdescripc(rts(ww),"Breakfast eating habits of students at Weiss")

# Based on the PACF value at lag=1, it appears that there is a strong AR(1) pattern.

#We fit an AR(1) model

weissar<-arima.mle(ww,model=list(order=c(1,0,0)))

weissard<-arima.diag(weissar, lag.max=50)

weissardk<-karima.diag(weissard$std.resid,weissar$sigma2,1,0,80)

#Now let's use that model to make a one-step prediction

onestepahead<-arima.filt(ww, model=weissar$model)

ts.plot(ww+mean(w), onestepahead$pred+mean(w),xlab="Days",ylab="Number of Breakfasts")

title(main="One-step Prediction")

legend(55,120,legend=c("Weiss Data","Prediction plus mean"),col=c(4,3),lty=c(1,6))

#The value of the one-step prediction plus mean at t=14 is 87.49, so we estimate the missing

#value to be 87.

#Baker college is missing data for the number of breakfasts eaten on the Wednesday before

#Thanksgiving. The Wednesday before Thanksgiving is a special case to consider because many

#students go home before Thanksgiving and thus aren't around to have breakfast that morning.

#I propose that the best way to determine the number of breakfasts eaten that day is to

#average the number eaten at the other colleges on that day. That average is 59.

#Now, the goal is to fit a model to the the sum of each college's

#breakfasts on each day. Note that there are two estimated values: 87 breakfasts eaten

#at Weiss at t=16 (It was t=14 when I looked at Weiss data without Saturdays.), and 59

#breakfasts eaten at baker for t=75.

#The data is stored by college in column vectors in allcolleges

total<-rowSums(allcolleges,na.rm=F)

#There are three missing values (corresponding to Thanksgiving Thursday and the following

#Friday and Saturday). These occur at t=76,77,77. Use approx to find linear estimation.

q<-rep(0,96)

for (i in 1:75) q[i]<- i

for (i in 76:96) q[i]<- i+3

r<-na.omit(total)

approx(q,r,xout=c(76,77,78))

#From this linear interpolation, it appears that the missing values would be

#506,539,572. Of course, this is very suspicious, because a Saturday value would not

#likely be so great as 572. In fact, because Saturday is a special case, it is reasonable

#to use the linearly interpolated values for time=76,77 and an average of the number of breakfasts

#eaten on Saturdays at other colleges for t=78.

#The average for Saturday is 87.6 ~ 88.

#The estimated values for t=76,77,78 are 506,539,88 respectively.

total[76]<-506

total[77]<-539

total[78]<-88

#Now we plot the data

tot<-rts(total)

ts.plot(tot,xlab="Days", ylab="Number of Breakfasts")

title("Total Number of Breakfasts Eaten at All of the Colleges")

t<-tot-mean(tot)

#We plot the ACF and PACF of the mean-differenced data

plot(acf(t,lag=60))

plot(acf(t,lag=60,type="partial"))

#The PACF has strong values at lag=6, lag=7. Let's try fitting an AR(7) model and see what the residuals

#look like.

ar7<-arima.mle(t,model=list(order=c(7,0,0)),xreg=1)

ard<-arima.diag(ar7, lag.max=50)

ardk<-karima.diag(ard$std.resid,ar7$sigma2,7,0,80)

#Let's try fitting a seasonal AR model of period 6 and examine the residuals.

ars<-arima.mle(t,model=list(order=c(1,0,0),period=6),xreg=1)

arsd<-arima.diag(ars, lag.max=50)

arsdk<-karima.diag(arsd$std.resid,ars$sigma2,6,0,80)

#After that, there appears to be an AR(1) structure, based

#on the strong value of the PACF at lag=1, so let's try

#an AR(1)xAR(1)period=6 model.

arnew<-arima.mle(t,model=list(list(order=c(1,0,0)),list(order=c(1,0,0),period=6)),xreg=1)

arnewd<-arima.diag(arnew, lag.max=50)

arsnewk<-karima.diag(arnewd$std.resid,arnew$sigma2,6,0,80)

#We accept the AR(1)xAR(1)pd6 model

#Let's examine the one-step prediction for the fall data (since we will do the same for the spring)

onestep2<-arima.filt(t,model=arnew$model)

ts.plot(tot,rts(onestep2$pred+mean(tot,na.rm=T)),xlab="Days",ylab="Number of Breakfasts")

title("One-Step Prediction of Total Fall Breakfast Data")

legend(55,800,legend=c("Fall Data","Prediction plus mean"),col=c(4,3),lty=c(1,6))

#Now plot the standardized residuals and see if they are approximately N(0,1)

stdresid2<-rts(tot-(onestep2$pred+mean(tot)))/sqrt(onestep2$var.pred)

ts.plot(stdresid2,rts(1.96*rep(1,99)),rts(-1.96*rep(1,99)), xlab="Days",ylab="Standardized Residuals")

title("Standardized Residuals From One-Step Prediction For the Fall Data")

#Now, see if the model fit for the fall semester fits the data given for the spring semester.

#The data for the spring is stored in springSemester in columns by colleges

totalSpring<-rowSums(springSemester,na.rm=F)

#Adjust the spring data for the percentage change in meal plans

#These are the counts given for total meal plans in the fall and spring

#Fall 00 - 1826

#Spring 01 - 1780

totalSpringAdjusted<-rts(totalSpring*(1826/1780))

#The first value of the totalSpringAdjusted is NA since breakfasts started on a Tuesday,

#so we create a new vector to remove that value

tsa<-rep(0,86)

for (i in 1:86) tsa[i]<-totalSpringAdjusted[i+1]

onestep<-arima.filt(tsa-mean(tsa,na.rm=T),model=arnew$model)

ts.plot(tsa,rts(onestep$pred+mean(tsa,na.rm=T)),xlab="Days",ylab="Number of Breakfasts")

title("One-Step Prediction of Spring Data Using Fall Model")

legend(55,730,legend=c("Spring Data","Prediction plus mean"),col=c(4,3),lty=c(1,6))

#Now plot the standardized residuals and see if they are approximately N(0,1)

stdresid<-rts((tsa-(onestep$pred+mean(tsa,na.rm=T)))/sqrt(onestep$var.pred))

ts.plot(stdresid,rts(1.96*rep(1,86)),rts(-1.96*rep(1,86)), xlab="Days",ylab="Standardized Residuals")

title("Standardized Residuals From One-Step Prediction For the Spring Data")

hist(stdresid)

mean(stdresid,na.rm=T)

var(stdresid,na.method="a")