Cris BurgessContinuing Professional Development: Data management using SPSSSPSSv13-XP CPD.doc
Opening SPSS statistical analysis package:
Click on "Start", select "Data analysis" and then select "SPSS v13"
Handling Files
- Opening an existing data file
Check Open an existing data source
Select file you want and click OK
If the file you want is not displayed, make sure that the option More files is highlighted and then click on OK. You will now be able to select the correct file from your computer's directory of folders. Once you have found the right file, click on it to highlight it and then click on Open.
Existing SPSS data files will have the suffix '.sav' after their name, and will automatically 'read' into your data spreadsheet.
- Entering new data
Check Type in data
Click on OK
- The 'golden rule' is that each individual participant or case gets their own row across the spreadsheet, which will contain all the data that they provide. Therefore, the numbers down the far left-hand column are effectively participants' 'index' numbers. An individual participant’s data set is called a ‘case’.
- Each column will contain data for a particular variable. For example, 'age' values will be contained in one column, 'gender' in the next, 'experience' in the next, and so on.
- You can enter either numbers or text into the cells of the spreadsheet, but for these exercises you will use only numbers.
- Before you enter any data, you must format your spreadsheet in the Data Editor (see next section).
Defining your variables:
It makes it easier in the long run if you 'format' the columns into which your data will be placed. In other words, tell the computer what type of data is being entered in that particular column. If you've opened an existing file, the chances are that it will already have formatted columns. If it doesn't, or if you are entering your own data, the simplest way of doing this is:
Double-click in the grey cell at the top of the column you want to define.
OR: select the Variable View tab at the bottom left-hand of the window.
This will bring up the variable information window. Each of the rows corresponds with each of your variables. The columns contain particular kinds of information about those variables:
Name – this can only have eight characters, so keep it short.
Type / Measure – The 'Define variable' window is also used to tell the computer what type of data appears in each column; numerical, categorical or text 'strings', as well as dates, times etc. Different types of data need to be treated in different ways, so it is probably worthwhile looking back to make sure that you understand the differences between the types. See the following sections.
Width/Decimals/Columns/Align – formats the data in the column.
Label – provides a little more information about the variable (this will appear when you move the cursor to the grey cell at the top of each column in the Data View
Values – allows you to give category variables ‘labels’ for each category value
Missing – allows you to define missing values of various kinds
- Text data
If your spreadsheet is to contain text string responses (words, sentences or comments in plain English) from respondents, you must specify this column as containing a 'string' variable:
Bring the Variable View window up and click on the Type cell for the variable that you want to edit.
Check the String variable type, and change the required number of characters (the amount of space required to write what you want to write). You can change this later if you discover that there is not enough space. Click on OK.
- Numeric data
Data from numerical items (eg: 'years since passing driving test', 'preferred speed on motorways', or 'score on sensation-seeking test') can be entered directly into the relevant cells of the spreadsheet without any further formatting.
- Categorical data
Variables for which you have defined response categories (eg: yes/no, male/female, age group, etc.) can be defined by assigning 'variable labels' (eg: “1=yes, 2=no, or 1=male, 2=female”). This makes it a lot easier to see what the data represents on your spreadsheet, and in your output. You must enter a number in the spreadsheet, but the computer will display the relevant variable label:
Bring the Variable View window up and click on the values cell for the variable that you want to edit.
A grey square containing three dots will appear in the cell – click on this box.
Enter 1 in the 'value' box; give it a name in 'value label' box (eg: 'male'). Click on Add. Then enter 2 in the 'value' box and give it a name in the 'value label' box (eg: 'female'). Don't forget to click on Add, or you will lose it! Enter 3 in 'value' box and.....?
Once all value labels are entered for that variable, click on OK.
Finally, in the Data View window, click on the View option in the top menu, and make sure there is a tick next to Value Labels. If not, then just click on Value Labels, and the labels should appear.
Now when you enter data for this variable, you have two options:
Type the category number into the relevant cell in the spreadsheet (eg: '1') and the value label (eg: 'male') will be displayed.
OR: Move the cursor to the cell you want to fill and click on the right-hand mouse button. Select Pick from labels and then just click once on the required category.
This may seem like a lot of unnecessary work, but it is worth going to the trouble of doing this, as these value labels will now appear on any analysis output and make that output far easier to understand.
Cleaning your data
- Missing values
Frequently you will find that your data is incomplete; for example, a respondent did not answer a question, their questionnaire response was illegible, or any number of other possibilities. Rather than just leaving the cell blank, it is a good idea to define your 'missing values', so that you know why that cell is empty. However, you need to tell SPSS not to treat these values as actual response values, but to ignore them in any calculations. In order to do this, you need the 'Define Variable' window:
Bring the Variable View window up and click on the missing cell for the variable that you want to edit.
Check Discrete missing values and enter values, one in each of the three available boxes.
Click OK, and OK again.
- Identifying outliers
There may be some data points that do not ‘fit’ with the main body of data for some reason, called “outliers”. For example, there may be a specific sub-group within your sample that differs in some systematic way from the rest. Hence, this group may not be representative of the population you wish to sample, and so their data should be omitted from your analyses. At a practical level, entering data is a boring job and, like any other boring job, people tend to make mistakes. In addition, some respondents may decide to play a game with you as researcher and return non-relevant data to amuse themselves. In order to identify these data points, you need to ask SPSS to produce some descriptive statistics called “FREQUENCIES”.
Click on Analyze… …Descriptive statistics… …Frequencies.
Select the variables that you want to check for outliers and move them into the Variable(s) box by clicking the button.
It may help if you ask SPSS to produce some bar charts, as this will make outliers stand out more clearly. So, click on Charts, select Bar charts and click on Continue.
Click on OK.
- Removing outliers
Any outliers will clearly stand out, as they will be either very high or very low compared with the rest of the data set. There are a number of ways to remove these outliers, but the simplest way is to delete the ‘case’ (the complete set of an individual respondent’s data) entirely from the particular analysis for which it is a problem. However, you must remember to save a copy of the file, in order to reload that case before running any other analyses for which it provides valid data:
Run the analysis and make a note of the results.
Select the variable with the problem, outlier value by clicking on the grey cell in the spreadsheet that contains the variable name.
Click on Edit… …Find…
Enter the value for the outlier and click on the Search Forward button.
SPSS will highlight the cell that contains the outlier value. You can now delete the problem case:
Click on the relevant grey cell in the far left-hand column and press the Delete button on the keyboard.
Run the analysis again and compare the results with those of the first analysis.
If there are any qualitative differences in the two sets of results, then omit the outlier(s) from the final (reported) analysis. However, if there is no difference in the results, or if the differences are quantitative, then report the first analysis and note that the inclusion of outliers made no qualitative difference to the results.
Analysis: Descriptive Statistics
The first thing that you want to do once you have entered all your data into a spreadsheet is to summarise it. The statistics that summarise data are called 'descriptive statistics'. Which statistics to report is really a case for common sense. Reporting the mean of a category variable does not make sense. For example, if you are asking respondents for their favourite colour, and have coded their responses "1=blue, 2=green, 3=red, etc.", then reporting that the mean 'favourite colour' is "2.7" does not make sense. The modal (most popular) choice would make more sense in this case. Remember that the numbers in your spreadsheet relate to real numbers in the real world – they are not abstract concepts, but things that we can see around us.
- Calculating descriptive statistics
Click on Analyze (sic) in the main menu bar, then select Descriptive statistics and click on Frequencies.
Select those variables you want to report, but now click on Statistics.
Check the boxes relating to those statistics you want to report. Click on Continue.
Click on Charts. Select Bar chart; click on Continue and then on OK.
Windows display
The Microsoft Windows versions of SPSS use three separate windows:
a Data Editor window, which consists of a spreadsheet containing your data
an Output – SPSS Viewer window, which contains a record of everything that you do during a session, including the results of any statistical procedure
a Syntax Editor window, which contains text commands (this will be explained later in this handout)
You can switch between these windows easily:
Click on Window in the top menu. Select the window you want to view (the one currently displayed will have a tick next to it).
Alternatively, use the buttons at the bottom of the screen to switch.
Now you are ready to enter your data, and to run some analyses, but before you do so, if your SPSS has not already been set up to paste commands as well as output into the output window, you should do the following:
Click on Edit, then on Options and then click on the Viewer tab
Click on Display commands in the log so that a tick appears in the bottom left box
Click on Apply and then OK.
Now, whenever you carry out any procedures, the commands should appear along with the output. It is important to have it set up this way so that you can see what transformations you might have made to the data etc. You may need to do this each time you open up the SPSS programme.
Exercises:
- Exercise One:Copying an example datafile to your computer
To start with, you should save a copy of the file ‘CPD-driving.sav’ into your file space. You can find ‘CPD-driving.sav’ by visiting the web-page at:
Go to the ‘Teaching’ section of the page and click on “CPD Statistics”
You will find other relevant files on this page (including the questionnaire from which this data set is derived and copies of all the slides and handouts for the ‘Introduction to Research Statistics’ courses), which you are free to download
Save the file to a folder on your PC by clicking with the right mouse button and selecting “save target/file as”.
Whenever you need the file again, you now have a copy from which to work.
- Exercise Two:Opening, viewing and changing a data file
Open the file “CPD-driving.sav”
(refer back to the section “Opening an existing data file”)
Look at the first few variables in turn. This should help you to understand what they represent.
Click Utilities… …Variables to get information about the variables.
Close the window and get back to the data window.
Now change some of the variable definitions:
Click the column labelled “licence”, then Data… …Define variables (or just double-click on the grey cell at the top of the relevant column).
Change the name “licence” to “type” and click on OK
Now double-click on the cell at the top of the “gender” column and change the labels for “gender” to “man” and “woman”
Change the “gender” column format to column width 6 and click on OK
Click on View… …value labels. Have the “gender” labels changed?
- Exercise Three:Cleaning your data
Make sure that all missing values are specified as “99” in the Define variable box (you can easily check this by looking at the output from the Utilities… …Variables analysis that you’ve just done). If there are any variables that don’t have missing values specified, follow the relevant procedure to correct this (page 2).
Check for outliers in the data. Compare a Descriptives analysis (see next section) for the relevant variable before and after removing the outlying data points. Are the points worth removing?
- Exercise Four:Descriptive statistics
Click Analyse… …Descriptive statistics… …Frequencies. Use this procedure to find out how many males there are, and what percentage of the sample live in an urban area. You can put in as many variables as you like and run a frequency analysis on all of them in one go.
Click Analyse … … Descriptive statistics… …Crosstabs. A dialogue box appears: click on “gender” in the list and then on the arrow in the middle to take “gender” into the top box ‘rows’; then click on “area” and the arrow to take this variable into the ‘columns’ box. Now click OK and the programme will run the analysis. Look at the output. How many of the women live in rural areas?
Now do the same thing with gender and age group (“age”). Once you have done this, run it again, but this time putting “gender” in the columns “age” in the rows - which table is easier to read?
Look at these two procedures again, and take them a bit further by looking at the options available at the bottom of the window; e.g., get a frequency table for “age”, but ask for the mean, the median and the mode. Which of these ‘average’ age groups would it not make any sense to report in this case?
- Exercise Five:Creating graphs
SPSS is capable of producing many different types of chart and graph. Which one you use depends entirely on the variables that you want to represent. There are so many different types of graph that it is not worth going through them all here, but four useful ones are the bar chart, the line chart, the pie chart and the scatter-plot.
Bar chart
Bar charts are often a good way to communicate information about variables that are made up of ordered categories.
Click on Graphs… …Bar… in the top menu
Click on the Simple type (only one variable to be displayed)
Click on Define
Select “age” and click on the arrow to move it into the category axis box
Click on OK
The number of respondents in each age group will be displayed.
Pie chart
Pie charts are a good option for unordered category variables.
Click on Graphs… …Pie… in the top menu
Make sure that the Summaries for groups of cases option is checked
Click on Define
Select “gender” and click on the arrow to move it into the category axis box
Click on OK
The areas of the pie segments represent the proportion of males and females in the sample.
Scatter plots
This is a good option if you wish to show the relationship between two continuous variables.
Click on Graphs… …Scatter… in the top menu