PY1PR2 Workshop 1:Crosstabs and Chi-square using SPSS (Statistical Package for the Social Sciences)
Today’s workshop will cover the following techniques
- Testing the null hypothesis that the number of members in each of a set of categories is equal using one variable chi-square
- One variable chi-square with an unequal null hypothesis
- Two variable chi-square (3 * 2 contingency table)
- Finding out where the significant difference in a 3 * 2 contingency table lies by partitioning into a set of 2 * 2 contingency tables
- This will require the use of “select cases” and “relational operators”, which are generally useful techniques to familiarize yourselves with
Chapter 18 of the course textbook, Andy Field, covers chi-square, and you can also find information at
If you get stuck during the workshop ask one of the demonstrators for help
STEP1: Finding the files you need
Click on Start, and click on My Computer:
Look for the K: drive, and double-click on it:
Find the Pracs folder and double-click on it (folders are arranged alphabetically):
In the Pracs folder, find the Workshop_PY1PR2_1_SPSS folder and right click on it. From the menu that appears selectSend to and then My documents:
You should now have a copy of the folder in your My documents folder in the M: drive. You can access My documents by clicking start, then clicking My documents. There is also a My Documents icon on the desktop.
STEP 2: Starting SPSS
Start SPSS by clicking on the Start menu at the bottom left of the screen; select All Programs, SPSS for Windows, SPSS 15.0 for Windows. Click.
This opens the SPSS Data Viewer window, as shown below. An SPSS for windows dialogue box may also appear. This gives you the option to open a file that has been used recently. As the file we are using has not been used recently, it is easiest to cancel this dialogue box. Click Cancel.
STEP 3: Opening an SPSS data file
From the main menu, choose File / Open / Data
All data files of the expected type (*.sav) that are found in your current directory are listed in the dialogue box. SPSS should automatically look in My Documents. The folder called Workshop_PY1PR2_1_SPSS that you copied should be found here. Open that folder. Click on tennis_handedness.sav, and then click Open.If SPSS is not looking in the folder, My Documents, click on the arrow at the right hand side of the area headed Look in:My documents.
The file contains some imaginary data on the handedness of tennis players. There are two variables, a player identifier, and a categorical variable called handedness, which has a value of 1 if the player is right handed and 2 if the player is left handed. We want to know if the proportion of left handed tennis players is higher than the proportion of people who are left handed in the general population
STEP 4: Performing a one variable chi square to test the null hypothesis that the number of left handed tennis players in the sample is equal to the number of right handed tennis players. (Yes, this is implausible, but it’s just for practice.)
To obtain the dialogue box for a one variable chi-square use Analyze->Nonparametric Tests -> Chi-Square…
Send the variable handedness across into the Test variable List box and click OK. Now answer the following questions:
QUESTION 1: How many tennis players are in the sample[s1]? How many of those are left handed?
______
QUESTION 2: What’s the value of the chi-square statistic and can the null hypothesis be rejected[s2]? ______
STEP 5: Performing a one variable chi square to test the more realistic null hypothesis that the proportion of left handed tennis players in the sample is 10%, which is approximately the proportion in the general population. The alternative hypothesis is that left handers will be over represented because they have a slight advantage when playing tennis.
Bring up the same dialogue box as before, but before clicking OK you have to tell SPSS what the null hypothesis is. This is not quite as simple as telling SPSS the percentages you expect to find in each category. You have to work out the actual frequencies based upon your sample sizes and the percentages you expect in each category under the null hypothesis.
For category 1 (right handed) we expect 90% of the sample of 56 people to be right handed if the null hypothesis is true. 90% of 56 is 50.4 (note that decimal places are acceptable for the expected frequency even if they are not realistic)
For category 2 (left handed) we expect 10% of the sample of 56 people to be left handed if the null hypothesis is true. 10% of 56 is 5.6
It’s a good idea to check that you frequencies add up to the number of people in the sample: 50.4 + 5.6 = 56.
In the Expected Values area of the dialogue box tick the Values radio button. Then enter 50.4 and click Add. Then enter 5.6 and click Add. Note that SPSS expects you to add the frequencies expected under the null hypothesis according to the numbers you have used for category labels (in ascending order).
Click OK and then answer the following question:
QUESTION 3:[s3]What is the value of chi-square? What is the probability of the proportion of left handers in this sample of tennis players being obtained by random sampling from a population distribution in which the proportion of left-handers is 10%?
______
STEP 6: Open the data file facebook_age.sav. This contains imaginary data on two variables for 213 people who responded to a questionnaire on facebook use. A value of 0 on the facebook variable indicates that the participant does not use facebook, while a score of 1 indicates that they do – scroll down to see some scores of 1. The age variable divides age into 3 categories (1 = 18-24, 2 = 25-34, 3 = 35+).
The researchers who gathered this data were interested in whether facebook use is more common amongst young people than older people.
QUESTION 4:[s4]What is the null hypothesis? ______
Before running the chi-square test, create a table of the contingency between the two variables using crosstabs, which can be obtained from Analyze-> Descriptive-> Crosstabs… Send the variable facebook into the Rows box and the variable age into the Columns box. Click on OK. This will produce a crosstabulation of the frequencies in each of the six categories.
QUESTION 5: What is the most frequently occurring age category in the sample? [s5]______
Now ask SPSS to calculate the frequencies that would be expected to occur if the null hypothesis was true. Bring up the dialogue box again and click on the Cells button. Tick the box asking for SPSS to display the Expected values as well as the Observed values. Click on Continue and then OK.
QUESTION 6: Why is the expected count under the null hypothesis for over 35’s using facebook higher than the expected count for 18-24’s usingfacebook[s6]? ______
QUESTION 7: Which cell of the table has the biggest discrepancy between the expected value and the obtained value[s7]? ______
NB: you should always check the expected values to make sure they are all greater than 5. If any expected values are less than 5 chi-square is not reliable.
To perform the Chi-square test of the null hypothesis bring up the same dialogue box, click on the Statistics button and tick the chi-square box. Now click on continue and OK. (Note: the fact the one variable and the two variable chi-square tests are located in entirely different sections of the interface is one of the many annoying features of SPSS…).
Now use the SPSS output to fill in the blanks in the following sentence
The chi square value of[s8] _____, df = ___ had an associated p value of ____, so the null hypothesis that facebook use and the three age categories are unrelated can be rejected. This significant result was caused by…..
Because there are 3 age categories the significant result leaves some ambiguity as to where the difference lies. It might be that the young age group differs from the old age group, or perhaps each group is significantly different from each of the others.
To resolve this ambiguity, you need to partition the overall contingency table into a set of three 2 * 2 chi-square tests, in which you compare each of the pairs of age groups.
The practical problem is how to make SPSS compare only two levels of the age variable at a time given that you have 3 in the data file. One solution would be to create 3 separate smaller data files. A more flexible solution is to use the Select Cases dialogue box. Obtain this dialogue box now using Data->Select Case…Tick the radio button labeled “If condition is satisfied” and then click on the If… button.
This dialogue box allows a subset of the cases (rows) in the data file to be selected based on almost any criteria you can think of, no matter how complex. In our case we have fairly simple criteria, and we can use the “relational operator” buttons to implement them.
You will need to send the age variable across into the box on the right because this is the variable that we want to use as the basis of selecting the cases. Using age in combination with one or more of the relational operators from the list at the foot of this page will allow you to select the cases you want. (Note, there is more than one way of doing this, but some ways are simpler than others)
First, try to select only those cases where the age category has a value of 1 or 2.[s9] You will need to click continue and OK to see the results of your selection. Excluded cases have a diagonal line through them on the left of the data editor. You can always reinstate excluded cases by selecting allAll Cases radio button.Ask a demonstrator for help if you are having difficulty.
> means greater than
< means less than
= means equal to
~= means not equal to
& means AND
| means OR
>= means greater than or equal to
<= means less than or equal to
~ means not
[DataSet0] C:\Documents and Settings\sxs02dtf\My Documents\facebook_age.sav
STEP 7: Repeat the chi-square now that you have only age categories 1 and 2 selected in the data file. Then return to the case selection dialogue box and select a different pair of age categories. Repeat the chi-square and selection process a total of 3 times[s10] and then fill in the blanks in the following sentence, which is a continuation of the sentence you completed on page 7. Remember from the lecture that the conventional p value of 0.05 has to be divided by the number of tests you perform, and the resulting lower p value is used as the cut off for determining if each of the 2 * 2 chi-square tests is significant.
This significant result was caused by the difference in facebook use between[s11] ______and[s12] ______. The chi square value of[s13] _____, df[s14] = ___ had an associated p value of[s15] ____, which is lower than the multiple comparison corrected alpha level of[s16] ______.
STEP 8:How to enter data into SPSS for chi-square analysis. In the facebook_age.savdata file each participant is represented by a single row, in the same way asthe data files you used last term for performing a t tests. When all the variables you are interested in are categorical ones, to be analyzed using cross tabulation and chi square, there is an efficient shortcut to avoid the error-prone procedure of entering many rows of data. You will need to use this shortcut in one of your laboratory practical classes later this term, so let’s practice it now.
On the File menu, choose New, then Data. You are going to reproduce the facebook_age data using the shortcut method called weight cases. This will allow you to enter observed frequencies directly into SPSS. To use the weight cases method you will need one row in the data file to represent each combination of categories. There were 6 categories in the facebook_age data file, so you will need 6 rows. But first, you have to create two variables to represent facebook use and age. Follow these steps:
1)Click on the Variable View tab.
2)Create one variable called facebook and another called age
3)Change the Type of each variable to string (string means the vales will be text instead of numbers)
4)Return to the Data View tab and enter 6 rows of data to represent the following 6 categories of people by writing the text in “” marks into the appropriate data cells
- Facebook = “yes”, age “18-24”
- Facebook = “yes”, age “25-34”
- Facebook = “yes”, age “35+”
- Facebook = “no”, age “18-24”
- Facebook = “no”, age “25-34”
- Facebook = “no”, age “35+”
5)Now go back to the Variable View tab and create a third variable called Frequencies. You can leave this one as the default numeric data type.
6)In the Data View you can now enter the observed frequencies for the 6 categories. To produce a data file that is equivalent to the one containing a row for each individual participant that you worked with earlier enter the following numbers into the Frequencies variable, keeping this order
- 37
- 44
- 26
- 22
- 45
- 39
7)The numbers you have just entered are the observed frequencies, so SPSS won’t have to calculate them itself. Before you can calculate expected frequencies under the null hypothesis and a chi square statistic you need to tell SPSS that you have entered observed frequencies directly. This is done via the Data menu, where you select Weight Cases. Tick the radio button Weight cases by, and then send the variable Frequencies across.
8)Produce a cross tabulation including the frequencies that would be expected to occur if the probability of being a facebook user was independent of age, and a chi square statistic to test the adequacy of that null hypothesis model of the data. Hint – look back to questions 4 to 7 to remind yourself how to do this.
9)The output you produced should be identical to that you produced earlier. Now alter the data, slightly increasing the number of people in the oldest age category who use facebook, and decreasing the number of young people who use facebook by the same amount. (This illustrates the main advantage of this method of data entry – it is quick and easy to modify the data file. The disadvantage is that if you have any numeric variable types you can’t use it)
10) Repeat the chi square test and verify that the result is now non-significant[DTF17].
The above shortcut to enter the observed frequencies directly is also explained in chapter 18 of the Field textbook, section 18.5.2. You can also find an explanation at the following URL
Page 1 of 10
[s1]56 in total, available in the output table or scrolling down in the data editor window. 10 are left handed.
[s2]23.14. Yes, it can be rejected. The tricky thing here is to remember that when SPSS prints .000 for a significance level it actually means <.0001, which in turn is obviously less than the conventional cut off of 0.05. But people get confused about this because it looks odd and in other cases SPSS gives the exact p value.
[s3]3.84. 0.05 (or 5%)
[s4]There will be no relationship between age and facebook use, or older people are just as likely to use facebook as younger people
[s5]25-34
[s6]Because over 35’s outnumber 18-24 year olds in this sample
[s7]18-25 year old facebook users
[s8]See SPSS output on next page, which also contains the contingency table of observed and expected frequencies.
[s9]You can use age = 1 | age = 2. Alternatively, age ~= 3 will do the trick.
[s10]Age 1 versus 3, 1 versus 2, and 2 versus 3 are the necessary comparisons
[s11]18-24 year olds
[s12]Those over 35
[s13]6.38
[s14]1
[s15]0.012
[s16]0.016
[DTF17]Before signing the students off, verify that in the Chi Square tests output table the N of valid cases is still 213 as before, and that the “Asymp. Sig 2 sided” is now greater than 0.05