USING SPSS FOR DATA ANALYSIS

Michael Shalev June 2008 (amended March 2009)

GETTING TO KNOW YOUR DATA

Before doing anything else you need to become familiar with all of the variables that you intend to analyze. For this purpose, use the procedures available from Descriptive Statistics (which is accessed from the Analyzemenu).

Start by requestingFrequenciesfor the variables of interest and study the results carefully (e.g. you may find that a variable has one or more categories that are nearly “empty”—decide if you want to recode them as “Missing”, combine them with other categories, or leave them as they are).

PAY ATTENTION TO THE SCALE OF MEASUREMENT (סולם מדידה)!

Statistical procedures like correlation and regression requires continuous variables
(משתנים רציפים): eitheron an interval scale (סולם רווחים) or ratio scale (סולם מנה). In practice, researchers may treat ordinal data (סולם סדר)as if it was interval, especially questions that ask people to express their attitudes, e.g.on a scale from 1 (“Strongly favor”) to 5 (“Strongly oppose”).

These types of data must be distinguished from categorical variables (סולם שמי) like sex or ethnic origin. Categorical variables can only be included in correlations or regressions if they are converted to dichotomous variables which have only two possible values: 1 or 0. (These are often called dummy variables.) For the variable SEX the solution is easy: recode females 1 and males 0 (or vice versa).

For variables with more than two categories, each of the categories except for one “reference category” ("קבוצת התיחסות")must be converted to a dummy variable. It is not possible to convert a categorical variable to multiple dichotomous variables if it is going to be the dependent variable in a regression. (Why? Because there can only be onedependent variable in a regression, whereas we can have as many independent variables as we like.)

The regression coefficient of a dummy variable is interpreted as follows: it shows the average value of the dependent variable for the category coded “1” in comparison with the average for the “reference category” which is coded 0.

DATA ANALYSIS – AN EXAMPLE

Our example is based on Arian and Shamir’s article about the 1981 elections. One of their main arguments was thatethnic origin does not cause differences in voting. The most important causalvariable is “hawkishness”. Voters who oppose territorial compromise prefer the Likud. The correlation between ethnicity and voting is spurious ((כוזב/מדומהand is due to the fact that Mizrachim tend to be “hawks” and Ashkenazim “doves”.

The false model is:

ETHNICITY VOTE

The true model is:

ETHNICITY

VOTE

HAWKISHNESS

Testing this model requires the following five steps:

  1. Decide exactly how to measure each variable in the model.
  2. Use tables and charts to explore the relationships between the variables in more detail.
  3. Decide whether to change the model in light of the results so far.
  4. Use multiple regression to summarize the relationships between the variables and see if they fit the model(s).

1. DECIDE HOW TO MEASURE THE VARIABLES

The dependent variable is vote: Like Arian and Shamir, we will use “Vote if Knesset elections were held today” (v122) to create a new dummy variable called LIKUD which is coded 1 if a person said they would vote for the Likud and 0 if s/he voted for the Maarach. (Of course this is not the only solution. For example, we could have used the Left-Right scale, V116, which gives meaningful values for respondents who supported parties other than Likud and Maarach.)

The first independent variable (the original cause) is מוצא: there are a variety of ways that מוצאcould be measured. (We will not do it here, but in this kind of situation it is best to repeat the analysis for each different measure. You can then see if the results depend on how the variable is measured.)
We will try to improve on the way Arian and Shamir measured מוצא. They did not distinguish between “first generation” and “second generation”. Also, they defined all “Sabras” (Israeli-born whose father was also born in Israel) as Ashkenazim. We will define immigrant Ashkenazim as the “reference category” (קבוצת התיחסות) and we will create 4 dummy variables for the other 4 categories of the variable “Country of origin” (V137).

V137 / ASH_ISR / MIZ_OLEH / MIZ_ISR / SABRA
1 Israel - Israel / 0 / 0 / 0 / 1
2 Israel - Asia-Africa / 0 / 0 / 1 / 0
3 Israel - Europe-America / 1 / 0 / 0 / 0
4 Asia-Africa - Asia-Africa / 0 / 1 / 0 / 0
5 Europe-America - Europe-America* / 0 / 0 / 0 / 0
6 other combination / Missing / Missing / Missing / Missing
System Missing / Missing / Missing / Missing / Missing

*No dummy variable is created for Ashkenazi immigrants because they
serve as the “reference category”.

The second independent variable (the “real” cause) is hawkishness: The questionnaire included three different questions concerning attitude to annexation: V7, V8 and V105. Are these three different ways of measuring the same thing? Or does hawkishness have more than one “dimension”? To find out, we could perform a Factor Analysis (a procedure which is explained at the end of this Guide). For now we will use only one question, V8, which ranges from a value of 1 (“no territorial concessions”) to 5 (“concede all of the territories”). This measures “dovishness” (יוניות).

2. GENERATE TABLES

The Custom Tables procedure of SPSSwill help us to find out how different “types” of respondents differ in their voting.The dependent variable will be LIKUD - the mean [ממוצע] Likud vote. The “types” will be combinations of ethnicity (V137) and dovishness (V8). We want to know what happens to the effect of ethnicity when dovishness is controlled. So we need to know the mean Likud vote of each מוצא, for each category of dovishness.

To make the table, from the Analyze menu choose Tablesand thenCustom Tables.

  1. You will be asked if you want to define how your variables are measured. SPSS requires that the variables which form the rows and columns of the table be either nominal (סולם שמי) or ordinal (סולם סדר). Since both V8 and V137 are defined at the moment as "scale" variables
    ("סולם רווחים" או מדידה רציפה), you need to change the type of measurement. There are 3 different ways to do this. Changes made in the Measure column of the Data Editor will be permanent. Otherwise, you can either use the wizard, or else make changes by right-clicking a variable in the variable list.
  2. A template (שבלון) is provided for laying out the table. It is recommended that the control variable (in this vase V8) be placed in the rows.Drag V8 to "rows" and V137 to "columns". You now have to choose what you want to appear in each cell – in this case, the mean of the variable LIKUD. The default for cell contents is the number of cases ("count"). Drag the variable LIKUD to the Count labels (make sure all the labels are selected). The calculation will automatically change to Mean, which is what we want. To change the measure, or the number of decimal places, right-click LIKUD in the table and choose Summary Statistics.
  3. The default table doesn't include totals. To add them, right-click V8, select Categories and Totals, then check the box at the bottom marked Show Total andclick the Apply button.Repeat for the variable V137. At this stage, the dialog box should look like this:


4. Finally, choose the Titles tab,and add a title to the table("Mean Likud Vote in 1981"). Press OK and the table will appear. It should look like this:

Mean Likud Vote in 1981

v137 country of origin
1 Israel - Israel / 2 Israel - Asia-Africa / 3 Israel - Europe-America / 4 Asia-Africa - Asia-Africa / 5 Europe-America - Europe-America / 6 other combination / Total
likud / likud / likud / likud / likud / likud / likud
Mean / Mean / Mean / Mean / Mean / Mean / Mean
v8 Yoniyut / 1 against vitur / .62 / .66 / .48 / .60 / .37 / . / .55
2 small vitur / .69 / .35 / .22 / .48 / .34 / . / .38
3 some vitur / .36 / .35 / .24 / .40 / .17 / . / .28
4 nearly all vitur / .33 / .67 / .00 / .00 / .00 / . / .18
5 complete vitur / .00 / .00 / .00 / .33 / .00 / . / .05
Total / .55 / .53 / .32 / .52 / .28 / . / .43

Study the table carefully and you will see several interesting things. First, look at the totals in the bottomrow, which show the overall effect of ethnicity: this effect is large. Both Mizrachim and “Sabras” were much more likely than Ashkenazim to vote Likud (more than 50%, compared with about 30%). We can also see that the difference between the foreign-born and Israeli-born generations is small.

Second, look at the totals in the last column, which show the effect of “dovishness” on voting. This effect is also very strong, and it appears to be linear. We can construct a line chart to make sure. Double-click the table, select the first 5 numbers in the last column, then click the right mouse buttonand choose “Create Graph” and “Line”. Here is the result.We see that the influence of dovishness on Likud voting is almost perfectly linear.

- 1 -

What about the inner cells of the table? If the ethnic effect on voting really is spurious, then within categories of dovishness there will be no “ethnic vote”. A quick look shows that inside each row (level of dovishness) there are still differences in the Likud vote between ethnic groups. However,before going any further we need to answer two questions.

First, are there are enough cases for us to have confidence in the means? If there are very few people in a cell, it is not worth much! We therefore generate the tableagain, but this time requesting Count instead of Mean as the desired Statistic. The results (not shown) reveal that there were very few people in the two most extreme categories of the dovishness variable.
Second, are there any categories that should be dropped or combined? The “Sabras” (Israeli-born whose fathers were also born here) should be dropped, because we don't know where their grandparents came from. In addition, we already saw from the previous table that there is little difference in the vote of foreign-born and Israel-born respondents.

We conclude that it would be a good idea to simplify both of our variables. First we recode V8 into a new variable, V8new, and give it the label “Dovishness”. The new variable combines the threemost dovish categories(3, 4 and 5). Then we recode V137 into V137new, which leaves out the “Sabras” and compares all Ashkenazim (coded 1) with all Mizrachim (coded 0). (It’s important to add Value Labels so your results will show that 1 is Ashkenazim and 0 is Mizrachim.)

We will use the two new variables to make a chart showing the difference in Likud vote for Mizrachim and Ashkenazimat different levels of dovishness. Each level of dovishness will be represented by a different line. After creating the table double-click it, then select all cells except “Totals” and click with the right mouse button to request a Line chart.
The result is the chart on the left.

- 1 -

The question we asked is: do ethnic differences in voting disappear within categories of dovishness? The answer is NO! Whether they were hawks or doves, many more Mizrachim than Ashkenazim planned to vote Likud. Now let’s ask a different question: can we see any conditional relationships (קשרים מותנים)? In other words, is there any difference between the slope of the three lines? If so, it would mean that the effect of ethnicity depends on the level of dovishness. But there is actually not much difference.

For a good example of a conditional relationship, look at the chart on the right. Here the effect of ethnicity is examined for different categories of religiousness (שמירת מסורת) instead of for differences in dovishness. We see that among people who are very religious (the top line), the slope is unusual. In this group, Ashkenazim were actually more likely than Mizrachim to vote Likud.

3. SHOULD THE MODEL BE CHANGED?

We try to learn from the tables how best to set up our regression model. In the present example one result (already mentioned) was the discovery that “generation” does not matter—the Likud vote is very similar for immigrant and Israeli Ashkenazim, and for immigrant and Israeli Mizrachim. Therefore instead using 4 dummy variables for the 5 categories of מוצאin our regressions, we coulduse only two: “Mizrachim” and “Sabras”, with all Ashkenazim serving as the “reference category”. (We do not actually make this change in the example below.)

More important would be additions or changes to the causal relationships that our regression is designed to test. We have used tables to check two issues: (1)whether the effect of dovishness on voting is linear; and (2)whether dovishness conditions the effect of ethnicity. In reality, neither was a problem—but what if they had been?
(1)Suppose we had found that ethnicity made no difference to voting except for the most dovish group. Then, rather than continuing to measure dovishness as a continuous variable it would have been better to turn it into a dummy variable (1=very dovish, 0=everyone else).
(2)What if the effect of ethnicity had been conditional on dovishness? (e.g. all hawks support Likud, regardless of ethnicity; but among doves, there is a difference between Ashkenazim and Mizrachim.) That would have requiredtesting for “interaction”, which is explained on page 9.

4. REGRESSIONS

The tables showed that both ethnicity and attitude to annexation affect whether people vote for the Likud. Multiple regression (רגרסיה מרובה) will provide a test of whether the effect of ethnicity on voting is spurious. The purpose of the regression is to see what happens to the effect of ethnicity after we control for dovishness.

  • If the coefficients (מקדמים) of the ethnic “dummy variables” get a lot smaller, this would support the Arian-Shamir hypothesis of קשר כוזב.
  • But smaller coefficients may also be consistent with the hypothesis that the effect of ethnicity ismediated by(מתווך ע"י) dovishness, or that both ethnicity and dovishness affect voting (complementary effects-קשרים משלימים).
  • If on the other hand the coefficients remain the same, then the two effects are independent as well as complementary (הפיקוח אינו יוצר תיקון).

SPSS makes it easy to run “before and after” regressions. The original model is defined as the first block. The next model (the second “block”) addsadditional variables that were not in the first model.
Use the Linear Regression procedure (from the Analyze menu, choose Regression). Select LIKUD as your Dependent Variable and the four ethnic dummy variables as your Independent Variables. Press Next and add V8 to your second Block. Click on Statistics, select Confidence intervals and R squared change, then click Continue. Now click OK to run the regression.

Understanding Regression Output

The first table of output is the “Model Summary”. It shows the “percentage of variance explained” by each model. (Remember, “Model 1” is what we defined in Block 1, “Model 2” includes Block 2 as well.) We see that ethnicity alone explains 4.7% (.047) of the variance in LIKUD. Adding V8 to the regression more than doubles the Adjusted R-squared, which rises to 10.6% (.106). As discussed in class, we are more interested in the effects (regression coefficients, or “slopes”) than the ability of the model to “explain variance”. However, we may be interested in the F-test of whether the change in R=squared between models is significant. In this case it definitely is statistically significant (see the arrow pointing to .000). This F-test can be very useful when a new “block” adds more than one independent variable. It tests whether these variables as a group add anything significant to the previous model.

Next we need to look at the regression coefficients (the table labeled “Coefficients”). First let’s examine the “unstandardized” coefficients (B) that are highlighted in yellow.

Do Mizrachim vote differently from Ashkenazim, and what difference does it make if we control for dovishness? The B coefficient for MIZ in Model 1 is .232. An unstandardizedregression coefficient represents the expected effect on Y of a 1-unit increase in X. In the special case of dummy variables, the coefficient represents the average difference between the category coded 1 and the reference category.The coefficient forMIZ shows that, on average, the proportion voting Likud among foreign-born Mizrachim is 23.2% higher than among foreign-born Ashkenazim.

What happens when V8 is added to the regression (Model 2)? The coefficients of all the ethnic dummy variables decline slightly, but remain substantial. Comparing the coefficients for MIZ we see that the gap between foreign-born Mizrachim and Ashkenazim falls from 23% to 19%. Thus, the effect of ethnicity is not spurious.Only a small part of ethnicity’s effect is actually due to dovishness, or is “mediated by” dovishness. Our results therefore do not support Arian and Shamir’s main claim.

Several other features of the regression output are worth noting:

1. The constant (קבוע): it is interpreted as the expected value of Y when all X’s (independent variables) are zero. Usually this is not very informative. But in Model 1 the constant is .283, which means that 28.3% of the reference category (Ashkenazim) voted Likud. (Take a look at the last column of the table on page 4, and you will see the identical result!)

2. The column headedSig. shows the significance level of each coefficient. The previous column shows the value of the t-statistic, from which significance is computed. If t is at least 2 then the coefficient will usually be significant at the 5% level or better. If the value of “Sig.” is greater than .05 (5%), this means that there is more than a 5% probability that the true coefficient is actually zero (i.e. the independent variable probably has no effect).

Significance levels provide a very rough guide to whether a regression coefficient means anything. More helpful are the Confidence Intervals(רווחי בר סמח) shown in the last two columns of the regression table. Recall that “statistical significance” is based on the idea that our data come from a sample which is only one of many possible samples that might have been drawn. The question is, how “typical” are the coefficients in the sample we used compared with what would have been found if we could have run the regression forall possible samples? What SPSS computes for us isthe range of coefficients that would have been expected in 95% of these samples. If the values in this range are all meaningful to us, this is a good indication that a coefficient is “solid”. In Model 2 the effect of MIZ in 95% of samples is expected to be somewhere between 10% and 29%. This is encouraging, because even the lowest expected effect is quite large.