Statistic 378 Applied Regression Analysis

StatisticS 378 – applied regression analysis

Serious Crime Analysis in Metropolitan U.S. Regions (1976-1977)

Terence Fung

Kamal Harris

Colin Sproat

Prit Phul

statistical analysis

Table of Contents

Introduction

Data Explanation

Conclusion

Best Model

Interpretation of the model

Confidence Interval

Correlation

Importance of Variables

Shortcomings of Model

Methodology

Data Splitting

Checking assumptions

Independence

Normality

Homoscedasticity

Fitting the best model

Collinearity

Selection Method

Interaction Terms

Final Model and influence

Fitting model to prediction data

Appendix

Correlation

Interpretation of best model

Collinearity

Selection Method

Influential Points – Training Group

Fitted Model - Holdout Group

Collinearity – Holdout Group

Influential Points – Holdout Group

Introduction

Information regarding 141 Standard Metropolitan Statistical Areas (SMSAs) in the United States is providedin Neter et al. (1996)[1]. The study contains 11 variables that pertain to that specific metropolitan area between the years of 1976-1977. We would like to find the relationship and importance of our regressors through the least-squareregression method. By finding this relationship, we want to see how well our model explains serious crimes in any given metropolitan area. The goal of the study is to evaluate what is the best model that can estimate the crime rate for a metropolitan area.

Data Explanation

We defined serious crime as our dependent variable. Unlike many of the other variables, serious crime is a random variable that has potential to be predicted by the other variables through the least-square method. X2, X3, X4, X5, X6, X7, X8 X9, X10which correspond with their variable number from the data set given.Variable X12is a logic term.To deal with this, we introduced two extra variables: X13 and X14.This way we can use binary term to indicate which region it is located and by using this method our model will not be influenced by these values in an unnecessary way.We use binary terms to indicate the regions as follows:if the area is in NE, then (X12, X13, X14) = (1, 0, 0); if the area is in NC, then (X12, X13, X14) = (0, 1, 0); if the area is in S, then (X12, X13, X14) = (0, 0, 1); if the area is in W, then (X12, X13, X14) = (0, 0, 0).

Conclusion

Best Model

Interpretation of the model

From the residual analysis of our data, the best fit model for these 141 observations is:

Yp = -2.70029 + 1123.22515tX2 – 26.94736tX6 + 0.1153c8 – 229.5947c10 -0.44122X12 – 0.18717X13 + ε.

It should be noted that this model simply predicts number of crimes in an area but will not provide an exact value. An increase of 1 unit in the inverse of the land area squared will increase our yp by 11223.22515 with all other variables held constant. An increase of 1 unit in the inverse of the active physicians in the area will decrease our yp by 26.94736 with all other variables held constant. An increase of 1 unit in the centered percentage of high school graduates in the area will increase our yp by 0.1153 with all other variables held constant. An increase of 1 unit in the centered inverse of the total personal income in the area will decrease our yp by 229.5947 with all other variables held constant. If the observation occurs within NE region, the yp is expected to decrease by 0.44122, and if the observation occurs within NC region, the yp is expected to decrease by 0.18717.

The value of R2 for the complete dataset is 0.7676. 76.76% of the variability in y can be explained by the regression model. If we add enough variables to our model R2 will increase, meaning that R2is biased towards the number of variables. A better indicator would be R2adj, which is unbiased towards the number of variables in the model. R2adjfor the complete dataset is 0.6746. 67.46% of the variability in y can be explained by the regression model. R2adj is a value that can be used to safeguard us from over-fitting the model, so if we have added any unnecessary terms to the model, R2adjwill decrease significantly, which is not the case.

Mean squared error (MSE) is 0.04999, which is the lowest amongst all of the models available to us via the all possible regression method. MSE is an unbiased estimator of population variance in normal distributed data. The smaller we can make MSE, the smaller the variance in the model will be, which is ideal.

Mallow’s Cp for our data = 7. When values fall close to p, regression equations show minimum bias. It is preferable to have a small Cp statistic but it must be large enough to be a good estimator. Cp always equals p for the full model, hence this statistic was of no use to us to find our best model.

Confidence Interval

If yp = ln((y/(X3*1000))/(1-(y/(X3*1000)))) then y = X3*1000exp{yp}/(1+exp{yp}). To find the 95% confidence interval for the dataset, we found the 95% confidence interval for the means of each observation. Then we took the average of the lower and upper bounds. Then, by using the mean of the X3 values in place of X3, we converted the confidence interval into a 95% confidence interval for the number of serious crimes in the area. We found that the 95% CI for yp was (-3.6664, -2.2564). Considering multiple areas at a time and total number of serious crime, we found the 95% CI to be (45379.02, 56762.94).

Correlation

A correlation matrix was generated to see which variables are correlated with one another (See Appendix – Correlation).The matrix shows that only c10 and tX6 (Total personal income versus Number of active physicians) are strongly positively correlated. This relationship is valid because physicians are generally earning higher personal income. As a result, the more physicians in the area, it is the expected that the total personal income of the area will also increase. We did not find any strongly correlated variables.

Importance of Variables

Through the forward and stepwise selection methods, it was found that X12 – NE region was the most important variable in indicating the serious crimes. The second most important variable is tX6 – Number of active physicians in the area. Base on our model, if the area is located in the NW region, the predicted number of serious crimes is predicted to be lower than another region. This relationship also holds with the number of physicians in the area.

Shortcomings of Model

Some of our observations have been determined to be influential points, therefore our model must be used with caution. This is because influential points exert great leverage on the model such that our model is explained by the 5 or 6 observations as opposed to the entire dataset. We found that the influential points had a large impact on β1, the coefficient corresponding to the land area. The difference between β1 for the in the holdout group with and without influential point in the dataset was 13178.Normally, robust regression would be performed in order to correct for influential points. Robust regression will decrease the effect of influential points on the model and is extremely resilient to outliers unlike normal least-square method.

Methodology

Data Splitting

Our data set was split into two groups: training and holdout. This was achieved by using the DUPLEX algorithm. This method begins with defining zjj = [xij - / (Sjj)0.5] where Sjj = ∑(xij – )2. Then, using the Cholesky method we obtained ZTZ = TTT where T is an upper triangular matrix. We then define W = ZT-1. This gives us an orthogonal matrix with unit variance. The W-Matrix was calculated using the R-statistical package and then entered into Microsoft Excel where we calculated the Euclidean distances of all data points. There were a total of 141C2 = 9,870 distances calculated. Next we found the maximum distance in our data and found it to be (1,27). We put observations1 and 27 into the training group. We then deleted the point and found the next largest distance as (4,7) and we put observations 4 and 7 into the holdout group. Now the next point that was the farthest away from the observation already in our training group will be added to the training group. This was done by adding the distances of all the prior training group observations to a specific observation.This variable is taken out of the distance matrix and then the same method is done for the holdout group. Since our dataset is relatively small, we only placed 25 values in our holdout group and the remaining 116 data points were placed in our training group.

The observations entered into our holdout group (in ascending order) were: 2, 3, 4, 7, 10, 12, 15, 17, 19, 24, 42, 48, 59, 68, 73, 78, 92, 107, 112, 113, 124, 133, 137, 138, and 139.

The remaining observations were placed into our training group.

To check the validity of the split we calculated with 1/p = 1/13.

The generated determinant using the R-Statistical package generates 142.1728 and 127.9630 for training and holdout respectively. Applying it to the equation, we found a value of 2.983374. For even data splitting this value should be close to 1, but because we only put 25 observations into our holdout group. The value under 3 is acceptable.

Checking assumptions

Independence

In order to perform least-square linear regression we must account for the three assumptions: independence normality, and homoscedasticity. Each variable from the observations have no effect on the variables of the other observations. Assuming that the standard areas were selected randomly, we can say that the observations are independent of each other. In addition, the least-square errors of the model should be uncorrelated and normally distributed. When the values are uncorrelated this implies that covariance is zero and under normal distribution this implies independence.

Normality

The split dataset was entered into SAS. We must check our dataset for normality and constant variance. Without these assumptions we cannot create a valid model.

It is clear from this graph that normality assumption does not hold because it has a light-tail distribution. To fix this, we will have to perform a transformation on Y.

The transformation we used was: yp = ln((y/(X3*1000))/(1-(y/(X3*1000)))).

The variable X3 – Total Population, was given in terms of thousand, so we multiplied it by 1,000. Y is the number of serious crimes in the area. A more plausible description would be the serious crimes in the area in comparison to the population of the area. Serious crime is a random variable that follows Poisson distribution, which is a discrete distribution from [0,∞). We converted this into a normal distribution which is continuous from (-∞, ∞).

Y* = y/(X3 * 1000) is a proportion, hence the range of the dataset is [0,1]. If we divide Y* by 1-Y*, we get: Y*/ (1-Y*) which has a range of [0, ∞). Now if we take the natural logarithm, we get yp = ln(Y*/(1-Y*)). Since the ln(0) = -∞, the new range of the yp is (-∞, ∞) which is the same as a normal distribution.

Once we did our transform of y, we again check for normality of the model:

It is clear from this graph that normality assumption holds because it follows a linear pattern. Now we check for homoscedasticity.

Homoscedasticity

Homoscedasticity does not hold for this dataset. We want observations in this graph to hold along a horizontal band. For this to occur, transformation of our variables is required.

Looking at the residuals versus regressor graphs, the following transforms were made:

tX2= 1/X22

tX3 = 1/X3

tX6 = 1/X6

tX7 = 1/X7

tX9 = 1/X9

tX10 = 1/X10

where tX* is the new transform variable.

This graph shows that homoscedasticity holds for the model after thetransformations were made. This also implies that the expected value of the least-square error is zero, which is also an assumption that must hold.

Fitting the best model

Now that our assumptions are satisfied, we may now begin to fit our best model.

Collinearity

Before selecting our best model, we must check for collinearity between the variables (See Appendix – Collinearity-Training Group). We found that tX3, tX6, X8, and tX10 had collinearity problems. To fix this, we centered the variables by taking the mean of the variables subtracting from the variable in the observations (c* = Xi* - mean(X*)).

c3 = tX3 - 0.0021298;

c8 = X8 - 54.35;

c9 = tX9 - 0.0047905;

c10 = tX10 - 0.000347226;

After centering our data, we removed the collinearity problems and now we may try to find best model.

Selection Method

We performed forward, backward, and stepwise selection method to remove variables not deemed to be important in finding yp (See Appendix – Selection Method).

Forward Method: tX2, tX6, c8, X12, X13.

Backward Method: tX2, c8, c10, X12, X13.

Stepwise Method: tX2, tX6, c8, X12, X13.

After reducing the model, we performed the all possible regression method using R2, Cp, R2Adjand MSE as criterion on tX2, tX6, c8, c10, X12, X13select the best model. R20 is the R2adequate subset. Any R2value under R20 is considered unusable. R20 = 1 – (1- R2k+1)(1+ da,n,k) where da,n,k = kFa,k,n-k-1/(n-k-1). At theα= 0.05 significance level, da,n,k ~ 0.121682. R20 = 1 – (1- 0.5074)(1+ 0.121682) = 0.447459. Any R2value below this is deemed unacceptable. MSE should be such that it is at the minimum, and it is not a value that is increasing. Mallow’s Cp should be a value that is as close to our p as possible because Cp helps out determine how many variables we use. We found that the model could not be narrowed down further with R2 = 0.5074, Cp = 7, R2Adj= 0.4803, and MSE = 0.05081. (See Appendix – Selection Method)

Interaction Terms

We then attempted to see if interaction terms could be added to our model in order to give us a better model that can give us better estimation. We used our judgment to include what we feel to be logical interaction terms. We felt that ‘Number of physicians’ with ‘Total personal income’ and ‘Percent of high school graduates’ with ‘Total personal income’ could possibly interact with one anther. The all regression method was performed on the models with interaction terms. We found for the 8 variable model, R2 = 0.5516, Cp = 9, R2Adj = 0.5181, MSE = 0.04711. For the 7 variable model with x610 added, R2 = 0.5282, Cp = 8, R2Adj = 0.4983, MSE = 0.04905.And for another 7 variable model with x810 added, R2 = 0.5259, Cp = 8, R2Adj = 0.4952, MSE = 0.04935. We feel that with all three possible models, there are no significant differences between the new models and our original best model and therefore no interaction terms were added to the model.

Final Model and influence

We found that our best model from the training group was:

yp = -2.69327+1106.12535tX2 -24.47901tX6 + 0.01299c8 – 332.16264c10 –0.45876X12 – 0.20684X13 + ε.

This model must now be tested for influential points in the training group. If influential points are in our model, our model must be used with caution because only a few other observations have an affect on our model. We will use six tests to check for possible influential points: R-Student, hii, Cook’s D, DFFITS, Covariance Ratio, and DFBETAS (See Appendix – Influential Points-Training Group).

R-Student detects possible outliers, these occur when R-student > 3 with R-student > 2 indicating possible outliers. Variables targeted under R-student criteria were: 1, 42, 49, 66, 85, 88, 106, 116.

Cooks D “measures square distance between the least-square estimates based on all points.”Points have considerable influence on the least-square estimate if cooks D > 1.The variable targeted under Cooks D was: 49.

hii is the diagonal element of the Hat matrix, these values range from 0 <hii < 1.The values on the diagonal measure the distance of the i-th observation from the center of the x – space. Variables far away from the center will have significant leverage over the dataset. The criteria of hii is that the values above 2p/n = 2(7)/116 = 0.1207 have leverage over the dataset. The values targeted under hii: 49, 116.

Covariance Ratio expresses the role of the i-th observation on the precision of the estimate. A covariance ratio outside the range of 1 ± 3p/n. = 1 ± 3(7)/116 = [0.8190,1.1810] would indicate that the i-th data point has large influence. Values targeted by the covariance ratio were: 1, 49, 66, 85, 106.

DFFITS isthe value of R-student multiplied by the leverage of the ith observation, if the data point is an outlier, we will have a large R-student value hence a large value for DFFITS. |DFFITS| > |2√p/n| = |2√7/116|= 0.4956 is considered a possible influential point. Values targeted by DFFITS were: 1, 42, 49, 66, 85, 106, 107, 112, 116.

DFBETAS is the value that indicates how much the regression coefficient changes in standard deviation units. |DFBETAS|> |2/√ n|=|2/√ 116|=0.1857 are considered possible influential points. Values targeted under DFBETAS were: 1, 42, 49, 66, 85, 106, 116.

Using the criteria, we felt that observations 1, 49, 66, 106, and 116 were potential influential points.

To see if they have influence on our model we remove them and refitted our model to see if any significant change occurred. From the dataset and see how the model changes. A large change in this model indicates these values are influential points.

To test whether or not the values are outliers or influential points, we removed them from our model and then fit the new least-square line to see if the influential points made a significant change in out best fit model. After removing the influential points, our model changed to

yp = -2.64213 – 4110.19878tX2 – 57.57935tX6 + 0.01022c8 – 129.71397c10 – 0.47682x12 – 0.18943X13+ ε.

From looking at the change in β1, we can see a change in our model when influential points are removed from the original training group dataset. Our model was built with influential point hence must be interpret with caution.

Fitting model to prediction data

Now that we have found our best model using the estimation data, we must now check the validity of the model by fitting our model through the prediction dataset. For the center transform of X8 and X10 we use the means of X8 and X10 columns from the prediction dataset. We then checked the model for collinearity and we found CN for our prediction data was 11.19037 which is less then the required value of 30 to be considered to have collinearity. Collinearity in the prediction set is not a factor.

To see how well our model can be used to predict the holdout group data we checked the R2 and R2Adj. If the absolute difference between R2 estimation and R2 prediction is greater then 0.1 this model is no good in predicting serious crime in the holdout group. We found that the difference for R2 was 0.6544 – 0.5646 = 0.0898and for R2Adj, the difference was 0.5393 – 0.5392 = 0.0001. So our model seems good for predicting serious crime in the holdout group.

The best fit model in the holdout group is

yp = - 2.80488 + 23394tx2 + 26.66752tx6 + 0.00974c8 – 340.22446c10 - 0.5151x12 – 0.10199x13+ ε.

Next, the holdout group must be tested for influence points (See Appendix – Influential Points-Holdout Group).

From these criterions, we feel that observations 7, 15, and 22 could be potential influential points in the holdout group. We decided we would remove observations 7, 15, 22 to see how the model was affected. After removing the observations, we found the new model for the holdout group is

yp = -3.01936 + 36572tx2 + 179.4668tx6 + 0.00688c8 – 1116.6124c10 – 0.6844x12 – 0.19492x13+ ε.

Again, we found significant change in the coefficients of the two models and hence we feel that observations 7, 15, 22 are influential points and this model must be used with caution.