PRR 475: SPSS for WINDOWS - BASICS, LAB Nov 17-24

PRR389 SPSS & Statistics Page

Survey data analysis with SPSS 10.0

We will practice data analysis with the dataset from the 1996 Huron-Clinton Metro-parks visitor survey.

To prepare for lab: (November 11)

1. Read the brief description of HCMA survey methods (page 2).

2. Study questionnaire (handed out in class) and the codebook (end of this handout) to become familiar with questions asked and how variables are coded in the computer data file.

3. Review basic statistical procedures (page 3-8, particularly 4 and 5)

4. Go to lab and walk through portions of the SPSS Tutorial, start at beginning

5. Skim SPSS Procedure summary (pages 9-15)

6. In lab we will walk you through the practice exercises (Page 13)

TIP: This exercise requires you have some familiarity with the HCMA survey, some knowledge of basic statistics, and some familiarity with SPSS procedures. Thinking about management, planning or policy questions that suggest particular analysis of this dataset is also helpful. DON’T COME TO MICRO-LAB “COLD” without having looked at questionnaire, attended class on Thursday, or reviewed any of these materials.

We will return to micro-lab on NOV 26 to provide individual help. You should first try to complete the exercise yourself.

1996 Huron Clinton Metroparks User Survey

BACKGROUND. The Huron Clinton Metropolitan Parks Authority (HCMA) manages a system of 13 parks in southeast Michigan. As part of HCMA’s continuing effort to meet the needs of people of Southeast Michigan, a user survey was conducted during 1995-96. The results of the survey will be used to update HCMA’s 5 year plan.

OBJECTIVES

1. Describe characteristics and patterns of use of HCMA park users

2. Identify trends in user characteristics and patterns via comparisons with previous surveys.

3. Identify and profile managerially relevant market segments

4. Evaluate visitor satisfaction with HCMA parks and measure visitor preferences for new facilities and programs.

METHODS: A self-administered survey of HCMA visitors was conducted between Dec 1, 1995 and November 30, 1996. Four page questionnaires were distributed to a sample of visitors in vehicles entering one of the 13 HCMA units during this period.

The sample was stratified by park, season, weekend-weekday and time of arrival at the park. Sampling was disproportionate across these strata to assure an adequate size sample for each park and season. Weights adjust the sample to the actual distribution of visits in 1995-96. Each park distributed questionnaires on 10-12 dates during each season. Dates were uniformly distributed throughout each season and divided evenly between weekends and weekdays. Gate attendants distributed surveys to each vehicle entering the park during the first 5 minutes of each hour on sampling dates. During busy periods surveys were given to every other vehicle and during slow periods sampling was conducted for the first 10 minutes of each hour. Visitors could return surveys at drop boxes located at each park exit or by return mail.

The four page questionnaire was developed from the 1990 HCMA survey instrument. Questions cover party characteristics, use of daily vs annual permits, activities in the park, importance & satisfaction with park attributes (for an I-P analysis), knowledge and use of HCMA parks, preferences for new programs and facilities, and a set of household characteristics.

You will be analyzing data covering winter, spring, summer, and fall seasons. A total of 4,031 surveys were completed over this period (overall response rate of 42%). Surveys by park range from 815 at Kensington to just over 80 at some of the more lightly used parks.

SUGGESTED ANALYSIS

1. Use descriptive statistics to profile parks users - some sample results at website (hcma study).

2. Compare two or more subgroups (maybe market segments defined by age, income, use of annual or daily permit, etc. Develop segments by classifying visitors into useful subgroups and then describing important differences between the subgroups. Example - see activity segment table at hcma results website (link is in (1)).

3. Test for a relationship between two variables using CROSSTABS (Chi square) or COMPARE MEANS (T/F-Test)

We will be using SPSS-PC to analyze this survey. The data file HCMA96.SAV is a specially coded SPSS data file that can be retrieved from within SPSS. It is available in course AFS space.

STATISTICS - SUMMARY

1. Functions of statistics

a. description: summarize a set of data

b. inference: make generalizations from sample to population. parameter estimates, hypothesis tests.

2. Types of statistics

i. Descriptive statistics: describe a set of data

a. frequency distribution - SPSS Frequency

b. central tendency: mean, median (order statistics), mode. SPSS - Descriptives

c. dispersion: range, variance & standard deviation in Descriptives

d. Others: shape skewness, kutosis.

e. EDA procedures (exploratory data analysis) . SPSS Explore

Stem & leaf display: ordered array, freq distrib. & histogram all in one.

Box and Whisker plot: Five number summarymin.,Q1, median, Q3, and max.

Resistant statistics: trimmed and winsorized means,midhinge, interquartile deviation.

ii. Inferential statistics: make inferences from samples to populations.

a. Parameter estimation – compute confidence intervals around population parameters

b. Hypothesis testing - test relationships between variables

iii. Parameteric vs nonparametric statistics

a. parametric : assume interval scale measurements and normally distributed variables.

b. nonparametric (distribution free statistics) : generally weaker assumptions: ordinal or nominal measurements, don't specify the exact form of distribution.

3. General rules for interpreting hypothesis tests.

i. You test a NULL hypothesis - The NULL hypothesis is a statement of NO relationship between the two variables (e.g., means are the same for different subgroups, correlation is zero, no relationship between row and column variable in a crosstab table).

a. Pearson Correlation rxy =0.

b. T-Test x =y

c. One Way ANOVA M1=M2=M3=...=Mn

d. Chi square : No relationship between X and Y. Formally, this is captured by the "expected table", which assumes cells in the X-Y table can be generated completely from row and column totals.

ii.. TESTS are conducted at a given "confidence level" - most common is a 95% level. At this level there is a 5% chance of incorrectly rejecting the null hypothesis when it is true. For stricter test, use 99% confidence level and look for SIG's <.01. Weaker, use 90% , SIG's < .10.

iii.. On computer output look for the SIGnificance or PROBability associated with the test. The F, T, Chi-square, etc are the actual "test statistics", but the SIG's are what you need to complete the test. SIG gives the probability you could get results like those you see from a random sample of this size IF there were no relationship between the two variables in the population from which it is drawn. If small probability (<.05) you REJECT the assumption of no relationship (the null hypothesis).

For 95% level, you REJECT null hypothesis if SIG <.05

If SIG > .05 you FAIL TO REJECT

REJECTING NULL HYPOTYHESIS means the data suggest that there is a relationship.

iv. Hypothesis tests are assessing if one can generalize from information in the sample to draw conclusions about relationships in the population. With very small samples most null hypotheses cannot be rejected while with very large samples almost any hypothesized relationship will be "statistically significant" - even when not practically significant. Be cognizant of sample size (N) when making tests.

Type I error: rejecting null hypothesis when it is true. Prob of Type I error is 1confidence level.

Type II error: failing to reject null hypothesis when it is false. Power of a test = 1prob of a type II error.

DESCRIPTIVE STATISTICS

As the name implies, these are used to describe characteristics of the sample or the population it is intended to represent.

Begin by describing variables one at a time (univariate statistics). There are two basic procedures for this:

FREQUENCIES

If variable is nominal, or ordinal with a small number of categories/levels, use SPSS FREQUENCIES procedure.

This will produce a table giving the number and percentage of cases that gave each of the possible responses.

Here is a sample SPSS output table from FREQUENCY of INCOME variable. Check questionnaire to see that income was measured in 4 categories with a “choose not to answer” response. The codebook or Variable View page on SPSS file will indicate variable was coded 1-4 for four income groups, and 5 for the “Choose not to answer” response.

TOTAL HOUSEHOLD INCOME BEFORE TAXES

Frequency / Percent / Valid Percent / Cumulative Percent
Valid / UNDER $25,000 / 112 / 10.4 / 14.0 / 14.0
$25,000 TO $49,999 / 267 / 25.0 / 33.5 / 47.5
$50,000 TO $74,999 / 227 / 21.2 / 28.4 / 75.9
$75,000 OR MORE / 193 / 18.0 / 24.1 / 100.0
Total / 798 / 74.5 / 100.0
Missing / CHOOSE NOT TO ANSWER / 182 / 17.0
System / 90 / 8.4
Total / 273 / 25.5
Total / 1071 / 100.0

The five possible responses are the rows. Notice response categories (values) are labeled (“Under $25” etc).

Frequency = number of cases selecting this response
Percent = percentage this is of al cases
Valid Percent = percentage of “non-missing” cases. Here the No answer response is missing as is “system mising”, cases that left this question blank.
Cumulative Percent = running total (not always useful or relevant)

Generally, you want to report the Valid Percent as your best estimate of the percentages of all visitors (in the population) in each income group. Raw counts are largely a function of sample size and not that useful.

DESCRIPTIVES

For interval or ratio scale variables, you usually want to compute means and standard deviations rather than frequencies.

Here’s table from running DESCRIPTIVES procedure with the age variable. Age was measured as interval scale.

N / Minimum / Maximum / Mean / Std. Error / Std. Deviation
AGE OF SUBJECT / 925 / 16 / 86 / 43.29 / .48 / 14.52
Valid N (listwise) / 925

In this case the average age was 43, lowest age in the sample was 16 and highest was 86. The average is based on 925 cases that answered this question. The standard deviation indicates the “spread” of ages in the sample. You may compute a 95% confidence interval for the estimate of average age by computing the standard error = standard deviation/ sqrt(n). In this example SE = 14.52/sqrt(925) = .48. A 95 % confidence interval is two standard errors either side of the mean = (43 + or – 2*.48) or roughly (42,44). SPSS computes the SE for you.

Guidance to Statistical Tests - Hypothesis Tests

Testing hypotheses is a little more complicated, but give it a try. Here we want to test for a relationship between two or more variables. Again, which procedure to use depends on the measurement scales of the variables.

CROSSTABULATIONS – The CROSSTABS procedure – This is simply the bivariate version of FREQUENCIES. Run this when you have two variables that are nominal scale or have small number of categories/levels. (Nominal x Nominal)

SPSS produces the bivariate distribution in the sample and a Chi Square Statistic, which tests the null hypothesis of no relationship between two variables. This is analagous to using a Pivot Table in Excel. You need a minimum of 5 cases per cell in your table, so don’t run this with variables that have too many categories (recode if neccessary to collapse categories) . The Pearson Chi Square statistic in SPSS provides a test of whether or not the two variables are related.

Example of Crosstabs with HCMA data

Examine relationship between age and income – CROSSTABS AGE2 BY INCOME (note AGE2 puts age into a small set of categories

Compare activity participation or attitudes of men and women – CROSSTAB of GENDER with one of the activity or attitude variables.

To get the Chi Square test along with the table, select the Statistics button and check Chi square. Look for Significance levels smaller than .05 to reject null hypothesis of no relationship at the 95% confidence level. If SIG > .05 sample doesn’t provide enough eviodence to conclude there is a relationship within the full population.

COMPARING SUBGROUP MEANS

Another common bivariate analysis is to compare means on an interval scale variable across two or more population subgroups. In this case you want an interval scale dependent variable (the one you compute means for) and a nominal scale independent variable (the one that forms the groups). (Nominal x Interval)

SPSS has several different procedures for comparing means. It will suffice to use the MEANS procedure. Put the interval scale variable in the dependent variable box and the variable for forming subgroups in independent variable box. To get a hypothesis test, select the Options button and check the “Anova table and eta” box at the bottom, then CONTINUE.

CORRELATIONS - Interval by Interval

Pearson Correlation: Run CORRELATION procedure to get the correlation coefficient between the two variables AND a test of null hypothesis that the correlation in population is zero. Be sure you understand distinction here between the measure of association between the two variables in the sample (correlation coefficient) and the test of hypothesis that correlation is zero (making inference to the population).

Regression : is multivariate extension of correlation. A linear relationship between a dependent variable and several independent variables is estimated. t-statistics for each regression coefficient test for a relationship between X and Y while controlling for the other independent variables. Standardized regression coefficients (betas) indicate relative importance of each independent variable. The R square statistic (use adjusted R square) measures amount of variation in Y explained by the X’s.

EXAMPLES OF T-TEST/ANOVA AND CHI SQUARE

The Independent Samples T-TEST Tests for differences in means (or percentages) across two subgroups. ANOVA is simply the extension to more than two groups and uses the F statistic. Null hypothesis with two groups is that the mean of Group 1 = mean of group 2. This test assumes interval scale measure of dependent variable (the one you compute means for) and that the distribution in the population is normal. The generalization to more than two groups is called a one way analysis of variance (ANOVA) and the null hypothesis is that all the subgroup means are identical. These are parametric statistics since they assume interval scale and normality.

In SPSS use Compare means, several options as follows:

MeansCompare subgroup means, Options ANOVA for stat test

One Sample T-TestTest H0 : Mean of variable = some constant

Indep. Samples T-Test Two groups, Test H0 : Mean for group 1 = Mean for group 2

Paired samples T-TestPaired variables - applies in pre-test, post-test situation

One Way ANOVACompare means for more than two groups

Chi square is a nonparametric statistic to test if there is a relationship in a contingency table, i.e. Is the row variable related to the column variable? Is there any discernible pattern in the table? Can we predict the column variable Y if we know the row variable X?

The Chi square statistic is calculated by comparing the observed table from the sample, with an "expected" table derived under the null hypothesis of no relationship. If Fo denotes a cell in the observed table and Fe a corresponding cell in expected table, then

Chi square ( 2 ) = (Fo -Fe)2/Fe

cells

The cells in the expected table are computed from the row (nr ) and column (nc ) totals for the sample as follows:

Fe =nr nc / n.

CHI SQUARE TEST EXAMPLE: Suppose a sample (n=100) from student population yields the following observed table of frequencies:

GENDER

MaleFemaleTotal

IM-USE

Yes204060

No301040

Total5050100

EXPECTED TABLE UNDER NULL HYPOTHESIS (NO RELATIONSHIP)

GENDER

MaleFemaleTotal

IM-USE

Yes303060

No202040

Total5050100

2 = (20-30)2/30 + (40-30)2/30 + (30-20)2/20 + (10-20)2/20

100/30 + 100/30 + 100/20 +100/20 = 13.67

Chi square tables report the probability of getting a Chi square value this high for a particular random sample, given that there is no relationship in the population. If doing the test by hand, you would look up the probability in a table. There are different Chi square tables depending on the number of cells in the table. Determine the number of degrees of freedom for the table as (rows-1) X (columns -1). In this case it is (2-1)*(2-1)=1. The probability of obtaining a Chi square of 13.67 given no relationship is less than .001. (The last entry in my table gives 10.83 as the chi square value corresponding to a probability of .001, so 13.67 would have a smaller probability).

If using a computer package, it will normally report both the Chi square and the probability or significance level corresponding to this value. In testing your null hypothesis, REJECT if the reported probability is less than .05 (or whatever confidence level you have chosen). FAIL TO REJECT if the probability is greater than .05.

REVIEW OF STEPS IN HYPOTHESIS TESTING: For the above example :

(1) Nominal level variables, so we used Chi square.

(2) State null hypothesis. No relationship between gender and IM-USE

(3) Choose confidence level. 95%, so alpha = .05, critical region is 2 > 3.84

(4) Draw sample and calculate the statistic; 2 = 13.67

(5). 13.67 > 3.84, so inside critical region, REJECT null hypothesis. Alternatively, SIG= .001 on computer printout, .001<.05 so REJECT null hypothesis. Note we could have rejected null hypothesis at .001 level here.

WHAT HAVE WE DONE? We have used probability theory to determine the likelihood of obtaining a contingency table with a Chi square of 13.67 or greater given that there is no relationship between gender and IMUSE. If there is no relationship (null hypothesis is true), obtaining a table that deviates as much as the observed table does from the expected table would be very rare - a chance of less than one in 1000. We therefore assume we didn't happen to get this rare sample, but instead our null hypothesis must be false. Thus we conclude there is a relationship between gender and IMUSE.

The test doesn't tell us what the relationship is, but we can inspect the observed table to find out. Calculate row or column percents and inspect these. For row percents divide each entry on a row by the row total.

Row percents:

GENDER

MaleFemaleTotal

IM-USE

Yes.33.671.00

No.75.251.00

Total.50.501.00

To find the "pattern" in table, compare row percents for each row with the "Totals" at bottom. Thus, half of sample are men, whereas only a third of IMusers are male and three quarters of nonusers are male. Conclusion - men are less likely to use IM.

------

Column Percents: Divide entries in each column by column total.