Multiple Linear Regression ~ Illicit Drug Abuse in U.S.

Data File: / Illicit_Drug_Abuse.JMP
Background: / This dataset contains data collected by myself about % Illicit Drug Abuse and several other factors that may or may not influence % Illicit Drug Abuse. The web sites used to collect this data were Statistical Abstract of the United States and Substance Abuse and Mental Health Services Administration.
Variables: / * State: Name of State
* % Illicit Drug Abuse: Percent you have used Any Illicit Drugs at least once in the past month {Any Illicit Drug indicates use at least once of marijuana/hashish, cocaine (including crack), inhalants, hallucinogens (including PCP and LSD), heroin, or any prescription-type psychotherapeutic used nonmedically}
* % Binge Drinking: % you have participated in Binge Drinking at least once in the last month{"Binge" Alcohol Use is defined as drinking five or more drinks on the same occasion on at least 1 day in the past 30 days. By "occasion" is meant at the same time or within a couple hours of each other}
* % Poverty: % of population below poverty level
* % HS Dropout: Not in regular school and who has not completed 12 grade or received their GED
* Per Capita Income: Personal Per Capita Income (in Dollars)
Goal: / Investigate the relationship between % Illicit Drug Abuse (Y) and the remaining variables, i.e. % Binge Drinking ( X1), % Poverty ( X2 ), %HS Dropout ( X3 ), and Per Capita Income ( X4 ).

Assumptions:

1) The response variable (Y) can be modeled using the Xs in the following form

2) The variability in the response variable (Y) must be the same for all specified values of
the X variables, i.e.

3) The response measurements (Y) should be independent of each other.

4) The response measurements (Y) should follow a normal distribution.

5) You should also take the time to identify outliers. Outliers can be very problematic in
any regression model.

We will examine how to check these assumptions near the end of this handout.

Correlations and the Scatter Plot Matrix:
(An initial investigation of the relationships between the variables)

Correlation grid


The correlations are obtained under Analyze > Multivariate. Select the Y variable first, then select all X variables to be used in the model.

Now, we want correlations with the Y variable to be away from 0because the hope is that these X variables allow us to better understand. That is, we hope that these X variables are related to, or influence Y to some degree.

A "new" problem arises in multiple linear regression. The main goal of multiple linear regression is to describe Y to the best of our ability while using a model that is as simple as possible. There is absolutely no reason to include two X variables in a model if they are giving essentially the same information. A model with X variables that are strongly correlated (either positively or negatively) is said to be effected by multicollinearity and this should be avoided!

Checking for Multicollinearity:

1. / Check all pair-wise correlations between all pairs of X variables
2. / Multicollinearity is present if two X variables have a correlation measurement greater than 0.80 or less than -0.80.
3. / To remove multicolliearity, we must not include one of the X variables in the model. As a general rule the X variable that has a lower correlation with Y should not be used in the model.


Looking at our data:

In the output below a dotted line has been drawn across the correlation matrix. We want strong correlations above this line because these are correlations of X’s with the response Y. Below the dotted line is where we have the potential for multicollinearity. For this example, we do not have multicollinearity issues because none of the correlations below the line are above 0.80 nor are any below -0.80.

A visualization of correlation matrix is the scatter plot matrix. You should never examine correlations between two variables with out looking at a scatter plot showing the relationship between them.


Fitting the model

We did not have any multicollinearity issues, so we can fit the model that was initially proposed. Recall, % Illicit Drug Abuse is our Y variable and X variables are % Binge Drinking ( X1), % Poverty ( X2 ), %HS Dropout ( X3 ), and Per Capita Income ( X4 ). Select Analyze Fit Model as follows


Initial Regression Output:


Conclusion for this overall regression test:

Backward Selection:

Backward selection is the process of getting rid of variables that are NOT statistically useful in helping to describe Y. If an X variable, say X1,influences Y, then as X1 changes there is either an increase or decrease in Y. In particular, if X1 influences Y, then the slope in the X1 direction is not zero. We should remove any variables from the model that exhibit a slope of zero.

Steps for backward selection:

1. / Check either the Prob > F values in the Effect Tests table or the Prob > |t| values in the Parameter Estimates table, i.e. p-values, for testing whether or not the slope is zero.
2. / If none of these Prob values are greater than a (your pre-determined error rate), then STOP you are finished with backward selection. If one or more of the Prob (p-values) values are greater than a, then select the largest.
3. / Refit the model excluding the variable with the largest p-value from step 2 and repeat step 1. Continue until all p-values values are less than a.


Backward selection for this problem:

Refitting the model without % Poverty

We are finished with backward selection because all remaining variables have non-zero slopes. This is referred to as the "final" model. Upon checking the residuals, this would be the most appropriate model to use.

Determining the usefulness of the model:

Discussions:

The estimated regression model

Interpretations of Parameter Estimates

Interpretations:

Y-Intercept: -4.54 % , meaningless because no states have 0 for all three predictors!

Slope in the % Binge direction: .17*(20.1 – 16.4) = .17*3.7 = .63 %

STATE 1 / STATE 2
% Binge Drinking / 16.4% / 20.1%
% HS Dropout / 10.4% / 10.4%
Per Capita Income / $ 27081 / $ 27081

When comparing two states with 20.1% and 16.4% reporting binge drinking respectively and holding everything else constant we estimate the percent of respondents reporting illicit drug use for the state with more binge drinking to be .63 percentage points higher.

More simply holding everything else constant a 1 point increase in the binge drinking percent is associated with a .17 point increase in the percent of illicit drug use.

Slope in the % HS direction: .21*(11.8 – 8.7) = .21*(.31) = .65 %

STATE 1 / STATE 2
% Binge Drinking / 18.5% / 18.5%
% HS Dropout / 8.7% / 11.8%
Per Capita Income / $ 27801 / $ 27801

When comparing two states with 11.8% and 8.7% HS drop outs respectively and holding everything else constant we estimate the percent of respondents reporting illicit drug use for the state with more drop outs to be .65 percentage points higher. More simply, a 1 percentage point increase in the HS drop out rate is associated with a .21 percentage point increase in the illicit drug use.

Slope in the Per Capita Income direction: .00014*(30000-25000) = .7%

STATE 1 / STATE 2
% Binge Drinking / 18.5% / 18.5%
% HS Dropout / 10.4% / 10.4%
Per Capita Income / $25000 / $30000

When comparing two states with median incomes of $30000 and $25000 respectively and holding everything else constant we estimate the percent of respondents reporting illicit drug use for the state with the higher median income to be .7 percentage points higher. More simply, a $1000 increase in the median income for a state is associated with a .14 percentage point increase in the illicit drug use.

Comment: We cannot easily graph the relationship anymore because this model is in 4-dimensions.

Checking the Assumptions:

Checking the assumptions is similar to what was done in simple linear regression. The only change is that the assumptions need to be checked in each of the X directions. Unfortunately, JMP does not automatically create all of the necessary plots.

Plots needed to check assumptions:

1. / Plot of Predicted Values versus the Residuals (this one is provided near the bottom of the Fit Model output)
2. / Plot of each X variable versus the Residuals (save the residuals into your data set and make a scatterplot using Graph * Overlay). When finished you should have a plot for each X variable.
3. / Make a histogram of the Residuals to check the normality assumption.

Checking assumptions for our example:

The plot of the Predicted/Fitted Values versus Residuals is automatically given.

From the Response % Illicit_Drug_Use pull-down menu select Save Columns > Predicted Values and Residuals. We will need the residuals in particular for constructing the plots described on the following page.

Creating the remaining plots:

Select Graph Overlay and place the Residuals in the Y box and each of the X variables in the X box.

The resulting plot

The residuals plots for the other two variables remaining in the "final" model:

Don't forget about the histogram of the residuals to check the normality assumption.

The list (assumptions 1-4 must be checked on each scatter plot):

1) Model Appropriate:

2) Constant Variance:

3) Independence:

4) Identify Outliers:
5) Normality Assumption (histogram or normal quantile plot):

Some final comparisons:

States for which % Illicit Drug Use is lower than expected (negative residuals):

States for which % Illicit Drug Use is higher than expected (positive residuals):

Additional Notes:

12