The data set “foxbooks.txt” shows:
- The number of customers recorded for 400 daysnot in any order. This is what we want to predict.
- The temperature recorded that dayin Fahrenheit.
- The humidity recorded that day (humidity can be over 100%).
- The number of advertisements per hour paid for that day.
- The budget for in-store marketing (determined by a computer in dollars each day).
- Whether there was “Childrens” hour held at the bookstore.
- Whether the cashier was “patient” or “mean”, as well as whether they were “Fast” or “slow”.
- The number of treats given to customers for free
- The number of books that are marked at a discount price that day
- There is a data point that is clearly an error.
- Fix that error first.
- Then create the modelCustomers~temperature+humidity+adnum+marketing+childrens+cashier+treats+discounts
- Look at the plot of fit$residuals based off each off the other variables
- In the residual plots you should see “bowtie/fan” shapes indicating numerical interactions.
- Fix those bowties by including the appropriate interaction(s)
- Which variables were interacting?
- Run the model that still has all the variables including the interactions that you need from part A
- Now in the residuals there is a variable which shows curvature
- Fix the curvature by including the appropriate polynomial (with the linear term)
- Which variable was it, and how high did the polynomial need to go?
- Run the model with all the variables, the interactions from part A, and the curvature from part B
- In the residual plot you should see indications of a categorical interaction
- There are two categorical variables: childrens has two categories, cashier has four
- You should be able to tell which variables interact just by these plots
- Plot the residuals with the command col=data$childrens or col=data$cashier to spot it
- Fix the pattern in the residuals by including the correct interaction
- Which two variables needed to interact?
- Run the model with all the variables and the needed pieces from parts A, B, and C
- The obvious patterns are gone. Look to see if there are any subtle patterns to include
- Include whatever is needed until the residuals are as good as they can be
- Go to the summary. There will be several terms that are not significant
- Find a final model by removing terms that you believe are not needed in the model
- Note there is more than one correct answer here
- Copy and paste the output from summary(fit)
- Let’s make sure you know how to create the appropriate graphs you will need for the report
- For the interaction you found in Part A
- Graph one of the variables with three different lines based off the second variable
- Graph the second variable with three different lines based off the first variable
- For the curved variable from part B
- Graph the curve (without the data and zoomed in to show the curve)
- For the interaction in part C
- Graph the numerical variable with different lines for each category
- Let us prove you have a good model by making a prediction
- For the day June 4th
- Temperature will be 47°
- Humidity will be 75%,
- There will be one ad per hour
- The marketing budget will be 20
- There will be children’s hour at the bookstore
- Apatient fast cashier will be at the front.
- There will be 100 treats available
- There will be 40 books offered at a discount
- How many customers do you predict will come?
- How much can you reasonably expect your answer to be off by?
Bonus for interested students: The variable cashier has some added tricks that you are not required to catch but do help explain the variability in cashier levels.