Date: / 13. 12. 2016
Prepared by: / Irena Vipavc Brvar, Alekdandar Zdravković and Sanja Lužar
Status of the document / Internal
Sintaksa za urejanje podatkovnih datotek
Index
1 Managing data file
1.1 Header
1.2 Comment
1.3 Reading data file
1.4 Frequencies
1.5 Missing values
1.6 Recode
1.7 Recoding with a specified condition
1.8 Hidden value
1.9 Variable and value labels
1.10 COMPUTER / GENERATE
1.11 Formats and rounding xx
1.12 Renaming variable
1.13 Weights
1.14 Saving data file
1.15Creating distribution data file (tab-delimited, Ascii..)
1.16 Merging data files
Appedinx: List of common syntax and its content
1 Managing data file
All changes in a data files should be implemented through programme code – syntax. This is the only way we can ensure traceability of changes. Prepare an empty syntax file (e.g. ID_survey_vX.sps for SPSS or ID_survey_vX.do in Stata ).
In SPSS command sentence finish with a dot (».«).
Note: STATA print a dot (».«) in front of commands in Result window, but the dot is not part of the command and you do not type it.
1.1 Header
Syntax should start with some sort of header. E.g.:
**
* VERSION, DATE
* Version 1, 2010-04-13
* AUTOR
* NAME SURNAME
* CHANGE MADE
* Missing values defined, added labels, anonymized values defined.
***
1.2 Comment
Text that we want to add as a comment needs to start with an »*«at the beginning of the line.
E.g.(same for SPSS and STATA)
* this is a comment
In case we want to add comment inside a command, we use »/*« before the comment and »*/« after it.
E.g. (SPSS)
AUTORECODE VARIABLES=v39dr / INTO /*name of new variable*/ v39new.
1.3 Reading data file
At first we need to import data file we will be working on.
E.g. (SPSS)
GET FILE='..\pb1101.sav'.
or
IMPORT FILE='..\pb1101.por' (when we are working with *.por data file)
(STATA)
Clear if you want the current data in memory cleared.use“…\pb1101.dta”
You can also import some other formats of files in SPSS and STATA, but is it suggested to convert it with another tool (like Nesstar Publisher or Stat transfer) first.
If we can identify at first instance variables that are of no relevance for the clean file, or should not be there because of personal protection issued (like telephone numbers) we drop them at the beginning.
We use command drop(or keep).
E.g. (SPSS)
GET FILE='..\pb1101.sav'
/dropsample sequence str_reg st_citiz veln2 selection1 selection2 address1 thankyou user qi x1 x2 notereason.
(STATA)
use “..\pb1101.dta”
drop str_reg st_citiz veln2 selection1 selection2 address1 thankyou user qi x1 x2 note reason
drop variables:
drop varlist
drop observations if some condition (exp) is fulfilled
drop if exp
drop a range of observations
drop in range if exp
1.4 Frequencies
View your data.
(SPSS)
FREQUENCIES ALL.
(or short: FREQ ALL.)
(STATA)
One-way table of summary statistics
tabulate q1
One-way table for each variable
tab1 q1 q2
or two-way table
tabulate q1 q2
1.5 Missing values
(SPSS)
With command MISSING VALUES we define missing values for variables. For each variable we can define a maximum of 3 missing values or a range. Otherwise we need to use RECODE or COMPUTE command first. As it is common in SPSS we can write a sequence of variables as »fist variable to last variable«.
MISSING VALUESq1 q2 q481c w10 w q773 q774 q775 q780 q781(3).
MISSING VALUESw20 w2 w3 wskup (0).
MISSING VALUESld q776 (4THRU HI).
MISSING VALUESq777 q778 zap2 reg1 (5,8,9).
MISSING VALUESq772a q772b (7).
Use range commands as LO, LOWEST, HI, HIGHEST, THRU.
You can also write
MISSING VALUESq1 q2 q481c w10 w q773 q774 q775 q780 q781 (3),
/w20 w2 w3 wskup (0),
/v1 v2 v3 v10 v9 v5 m1 m3 to m11 m21 m22 m13 m15 m23 (6).
With the command below we clear all defined missing values for the variable list.
MISSING VALUESvarlist ().
(STATA)
For one variable at the time
replace q777 = . if q777 ==9
or with a missing with a name
replace female = .a if female ==-999
replace female = .b if female ==-99
replace female = .c if female >=-3 & female <0
for multiple variables recoding use
According tomvdecode female ses, mv(-999=. a \-99=. b \ -3/-1 =. c)
we can use the key word _all to refer to all the variables in the data set.
mvdecode _all, mv(-999=. \-99=. \ -3/-1 =.)
1.6 Recode
When we want to change values to variable we use command RECODE.
(SPSS)
RECODE v62a (-1=0) (ELSE=COPY) TO v62a_r.
RECODE B14 (1=1) (2=2) (9=3) (ELSE=SYSMIS) TO B14_r.
ELSE refers to all other original values that we don’t define/specify in RECODE.
COPY – copy original values without changing them.
SYSMIS – define values as missing.
For recoding string variables (alphanumeric – like countries; or open questions - occupation) into codes we could use command AURORECODE, which assign a numeric value to each descriptive value.
AUTORECODE VARIABLES=v39dr / INTO v39new.
EXECUTE.
(STATA)
recode v62a (-1=0), gen(v62a_r)
Instead of else, Stata uses nonmissing (for all nonmissing values not explicitly converted in brackets) and missing (for all system missing values).
recode B14 (1=1) (2=2) (nonmissing=3) (missing=0), gen (B14_r)
– where all missing values are converted to zero.
If you want to declare some nonmissing values as missing:
recode B14 (1=1) (2=2) (nonmissing=.), gen (B14_r)
For conversion of alphanumeric variables to numeric
encodev39dr, gen(v39new)
IMPORTANT1: Stata has the following logic of encoding:
If the values that are recoded alphabetical strings, Stata will generate encoding numbers by descending alphabetical order of values regardless of original ordering, for example value starting with »a« will get number 1 no matter where it is originally placed.
If values are numerical strings, like personal id, Stata will generate encoding numbers by descending numerical order of values regardless of original ordering, for example value starting with »1« will get number 1 no matter where it is originally placed.
IMPORTANT2: String values in Stata data browser are marked by red color, and values of variables being encoded marked by blue color. In data browser after recoding you will again see original values in newly created variable (in this case v39new), because Stata considers original values as labels after recoding, and labels of values are shown instead of data their self by default in Stata browser. If you want to seereal encoded values, you have to double click on particular data cell of newly created variable and it will turn to newly encoded value (it holds for any labeled data).
IMPORTANT3: By running encode command, Stata does not automatically rearrange order of data by values of newly created variable. If you want to do it, you have to run command:
sort v39new
IMPORTANT4: Sometimes it happens that Stata do not recognize numerical data, but classify them as string. If you want to convert numerical data, wrongly seen by Stata as string, into numerical format (without any kind of encoding), you can do it by command:
This will ignore alphanumeric charactersdestring v39dr, gen(v39new) force
If missing values are not clearly defined please use as following
Value / Value label99 / not recorded
98 / not provided/no answer
97 / not applicable
96 / don’t known
95 / error
1.7 Recoding with a specified condition
Sometimes it is necessary to recode variables on the basis of previous values, conditions (ie: respondent was asked a question event if he was not supposed to be and has non-missing value).
(SPSS)
DO IF (age50).
RECODE q10 (1=97)(2=97).
END IF.
EXECUTE.
ADD VALUE LABELS q10 97'not applicable'.
(STATA)
recodeq10 (1=97)(2=97)if age50
Note: In case of condition age = 50, proper Stata syntax is age == 50
IMPORTANT!!!
Work with data labels in Stata is extremely complicated relative to SPSS. First you have to predefine so-called list of value labels independently of variable, and then you can attach this list to any variable which data match predefined labels in label list. For example, suppose that you needto label values in variable q10. First you predefine list of labels, e.g. list_q10 by command.
label definelist_q10 1 "one" 2 "two" 3 "three"...
Then you attach this list to given variable.
label values q10 list_q10
So if you want to add label for newly added value to some variable, basically you have to redefine list of value labels and replace existing.
label define list_q10 1 "one" 2 "two" 3 "three"... 97 " not applicable", replace
In the rest of the text we use this logic in explaining change of data labels
1.8 Hidden values
Some variables / values might be critical for identification of interviewed persons. This might be town, municipality, occupation, names, surnames etc.
Values of such variables will be hidden for publish distribution. We still leave variables in and hide the values. In case user really needs them, they can ask for additional access, and could view un-anonymized file in safe room environment. In this case we will have two distribution files (PUF, SUF).
(SPSS)
For string type variables:
STRINGtown_r (A60).
RECODE town (ELSE='Hiden value') INTO town_r.
VARIABLE LABELS town_r 'Town'.
For numeric variables (recoding into -1):
RECODE post (ELSE=-1) INTO post_r.
VARIABLE LABELS post_r 'Post office number'.
VALUE LABELS post_r -1 'Hidden value'.
In case you want to keep missing values:
RECODE post (99=SYSMIS)(ELSE=-1) INTO post_r.
VALUE LABELS post_r -1 'Hiden value'.
FREQUENCIES post_r.
Oz. if values are already defined as missing
RECODEpost (SYSMIS=SYSMIS) (ELSE=-1).
1.9 Variable and value labels
Variables that have inadequate, incorrect or missing names we add/ edit them with:
(SPSS)
VARIABLE LABELS weight 'Poststratification weight.
VARIABLE LABELS region 'Region'.
(STATA)
label define list_labels1weight "Poststratification weight", replace
label define list_labels2 region"Region", replace
It has been assumed that list_label1 and list_label2 are already attached to relevant variables which labels are changing, as explained above.
If variables values (answer categories) are missing or are incorrect:
(SPSS)
VALUE LABEL q807_1 to q807_5 1'yes'
2'no'
3'DK, NA'.
(STATA)
label define list_q807 1 "yes" 2 "no" 3 "DK, NA", replace
Under assumption that list_q807 is already attached to variables q807_1 to q807.
If the variable does not have value labels:
label values var1 list_q807
1.10 COMPUTER / GENERATE
Command COMPUTE (GENERATE or. GEN) is one of the commands to transform data. It will calculate certain value of a variable and adapt new values. It creates a new variable either by alternating single existing variable or by mathematical operations applied to two or more single variables.
(SPSS)
COMPUTE problems=q19_1r + q19_2r + q19_3r + q19_4r.
COMPUTEage = 101-ybirth.
(STATA)
gen problems=q19_1r + q19_2r + q19_3r + q19_4r
gen age = 101-ybirth.
Command COMPUTE (GENERATE) is together with command RECODE most often used as the help of defining missing values.
(SPSS)
COMPUTEOBZ1=OBZ.
RECODEOBZ1(0,5,7,8,9=0).
MISSING VALUESOBZ1(0).
(STATA)
gen OBZ1=OBZ
recode OBZ1(0=0) (5=0) (7=0) (8=0) (9=0), replace or shorter recode OBZ1 0 5 7 8 9=0, replace
replace OBZ1=.ifOBZ1==0
1.11 Formats and rounding
(STATA)
formatvariable name %9.0g (general)
or %9. 2f (fixed)
%10.7 e (exponential)
and so on
(SPSS)
We use command FORMATS to assign a format to specific variable.
If we want to round a variable to whole number we write:
FORMATS age weight(F2.0).
F – stands for numerical variable
2 – number of reserved places
0 – how many of reserved places are decimal places
Note1: Have in mind that ».« uses one place – so if you want a format »3.30« you should write »4.2«.
Note2: Rounding will effect output calculation as well.
It is possible to define number of places also for alphanumeric variables.
FORMAT v31else(A20).
A – stands for alphanumeric variable
20 – number of reserved places
When we want to add variable with a date:
COMPUTE survey = 1.
EXECUTE.
COMPUTE survey = DATE.MOYR(1,2005).
EXECUTE.
FORMATS survey(moyr8).
All values of this variable will have value »jan 2005«.
Or
COMPUTE survey = 1.
EXECUTE.
COMPUTE survey = DATE.DMY(23,1,2005).
EXECUTE.
FORMATS survey(edate10).
All values of this variable will have value »23.1. 2005«.
Do not forget to label new variable.
Formats of dates could be written as follows:
DATE.DMY. DATE.DMY(day,month,year).
DATE.MDY. DATE.MDY(month,day,year).
DATE.MOYR. DATE.MOYR(month,year).
DATE.QYR. DATE.QYR(quarter,year).
DATE.WKYR. DATE.WKYR(weeknum,year).
DATE.YRDAY. DATE.YRDAY(year,daynum).
TIME.DAYS. TIME.DAYS(days).
TIME.HMS. TIME.HMS(hours[,minutes,seconds]).
Rounding XX
(STATA)
generate var2=int (var1)
or
replace var1=floor (var1)
replace var1=ceil (var1)
or
round (x[,y])
x rounded in units of y or x rounded to the nearest integer if the argument y is omitted; x (not ".") if xis missing (meaning that round(.a) = .a and thatround(.a,y) = .a if y is not missing) and if y ismissing, then "." is returned
1.12 Renaming variable
If variables need to be renamed:
(SPSS)
RENAME VARIABLES M20=M20_1.
(STATA)
rename M20 M20_1
1.13 Weights
Data file might have weights in it. You might be using them to compare data file with published results.
However data files that will be distributed should have weights set of.
WEIGHT
OFF.
You can see it weights are on or off in SPSS data view window. There is note in right bottom corner saying »Weight On« or “Weight off”.
Weighting of the data in Stata does not work in similar manner as in SPSS. Basically, there is no possibility to turn on or off weighting of some variable, but only creating a new variable with weighted values of original variable.
Stata has a wide range of commands for survey data, where defining weights is possible (svy)
1.14 Saving data file
For saving a data file:
(SPSS)
SAVE OUTFILE='..\mlaalk07.sav'.
EXPORT OUTFILE='..\mlaalk07.por'.
(STATA)
save“..\mlaalk07.dta”
For exporting in other file formats see help for more information.
1.15Creating distribution data file (tab-delimited, Ascii..)
Creating tab delimited output file for archival purpose.
(SPSS)
SAVE TRANSLATE OUTFILE='..\id_survey_f1.txt'
/TYPE=TAB
/MAP
/REPLACE
/FIELDNAMES
/CELLS=VALUES.
(STATA)
export delimited using “..\id_survey_f1.txt”, delimiter(tab) replace
1.16 Merging data files
In order to merge two or more datasets to one cumulative database, you have to have some common identifier for all merging datasets. Common id should be unique for given data record in both datasets to have sensible merging.
(SPSS)
We call our first dataset:
GET FILE='..\data\pb\2005\pb0501.sav'.
And save it:
SAVE OUTFILE=..\data\pb\2005\pbsi05.sav'.
exec.
Then we add files:
ADD FILES /FILE=*
/FILE='..\data\pb\2005\pbsi0502.sav'.
EXECUTE.
ADD FILES /FILE=*
/FILE='..\data\pb\2005\pbsi0503.sav'.
EXECUTE.
(STATA)
We call our first dataset:
use “..\data\pb\2005\pb0501.dta”
joinby common_idusing “..\data\pb\2005\pbsi0502.dta”
and save it (by creating new data file)
save “..\data\pb\2005\pbsi05.dta”
Note: joinby makes all possible combinations, … often the command merge is used, … to add new observations command appendis used
Note: when merging, data records, wherein common id is not matched, are deleted in both merging datasets. If you want to save data with unmatched id, you can use the option unmatched (both)
joinbycommon_idusing “..\data\pb\2005\pbsi0502.dta”, unmatched (both)
Note: it is possible to merge data using one or more common id, when it has sense, like in panel data you can merge by years and subject id
joinby common_id years using“..\data\pb\2005\pbsi0502.dta”
Appedinx: List of common syntax and its content in SPSS
ADD FILESADD FILES FILE={'savfile'|'dataset'}
{* }
[/RENAME=(old varnames=new varnames)...]
[/IN=varname]
/FILE=... [/RENAME=...] [/IN=...]
[/BY varlist]
[/MAP]
[/KEEP={ALL** }] [/DROP=varlist]
{varlist}
[/FIRST=varname] [/LAST=varname]
Example
ADD FILES FILE="/data/school1.sav" /FILE="/data/school2.sav".
AUTORECODE
AUTORECODE VARIABLES=varlist
/INTO new varlist
[/BLANK={VALID**}
{MISSING}
[/GROUP]
[/APPLY TEMPLATE=’filespec’]
[/SAVE TEMPLATE=’filespec’]
[/DESCENDING]
[/PRINT]
COMPUTE
COMPUTE target variable=expression
Example
COMPUTE newvar1=var1+var2.
COMPUTE newvar2=RND(MEAN(var1 to var4).
COMPUTE logicalVar=(var1>5).
STRING newString (A10).
COMPUTE newString=CONCAT((RTRIM(stringVar1), stringVar2).
Functions and operators available for COMPUTE are described in Transformation Expressions.
•Arithmetic Operations
•Numeric Constants
•Numeric Functions
•Arithmetic Functions
•Statistical Functions
•Random Variable and Distribution Functions
•String Functions
•String/Numeric Conversion Functions
•Logical Functions
•Relational Operators
•NOT Logical Operator
•AND and OR Logical Operators
•LAG Function
•Missing Value Functions
•Scoring Expressions (PASW Statistics Server)
•Date and Time Functions
EXPORT
EXPORT OUTFILE='file'
[/TYPE={COMM**}]
{TAPE }
[/UNSELECTED=[{RETAIN}]
{DELETE}
[/KEEP={ALL** }] [/DROP=varlist]
{varlist}
[/RENAME=(old varnames=new varnames)...]
[/MAP]
[/DIGITS=n]
Example
EXPORT OUTFILE="/data/newdata.por"
/RENAME=(V1 TO V3=ID, SEX, AGE) /MAP.
FORMATS
FORMATS varlist(format) [varlist...]
COMPUTE V3=V1 + V2.
FORMATS V3 (F3.1).
•COMPUTE creates the new numeric variable V3. By default, V3 is assigned an F8.2 format (or the default format specified on SET).
•FORMATS changes both the print and write formats for V3 to F3.1.
FREQUENCIES
FREQUENCIES VARIABLES=varlist [varlist...]
[/FORMAT= [{NOTABLE }] [{AVALUE**}]
{LIMIT(n)} {DVALUE}
{AFREQ }
{DFREQ }
[/MISSING=INCLUDE]
[/BARCHART=[MINIMUM(n)] [MAXIMUM(n)] [{FREQ(n) }]]
{PERCENT(n)}
[/PIECHART=[MINIMUM(n)] [MAXIMUM(n)] [{FREQ }] [{MISSING }]]
{PERCENT} {NOMISSING}
[/HISTOGRAM=[MINIMUM(n)] [MAXIMUM(n)] [{FREQ(n) }] [{NONORMAL}] ]
{NORMAL }
[/GROUPED=varlist [{(width) }]]
{(boundary list)}
[/NTILES=n]
[/PERCENTILES=value list]
[/STATISTICS=[DEFAULT] [MEAN] [STDDEV] [MINIMUM] [MAXIMUM]
[SEMEAN] [VARIANCE] [SKEWNESS] [SESKEW] [RANGE]
[MODE] [KURTOSIS] [SEKURT] [MEDIAN] [SUM] [ALL]
[NONE]]
[/ORDER=[{ANALYSIS}] [{VARIABLE}]
GET DATA
GET DATA
/TYPE = {ODBC }
{OLEDB}
{XLS }
{XLSX }
{XLSM }
{TXT }
/FILE = ’filename’
Subcommands for TYPE = ODBC and OLEDB
/CONNECT=’connection string’
[{/ENCRYPTED }]
{/UNENCRYPTED}
/SQL ’select statement’
[’select statement continued’]
Subcommands for TYPE=ODBC, TYPE=OLEDB, XLS, XLSX, and XLSM
[/ASSUMEDSTRWIDTH={255**}]
{n }
Subcommands for TYPE = XLS, XLSX, and XLSM*
[/SHEET = {INDEX**} {sheet number}]
{NAME } {’sheet name’}
[/CELLRANGE = {RANGE } {’start point:end point’ }]
{FULL**}
[/READNAMES = {on** }]
{off }
Subcommands for TYPE = TXT
[/ARRANGEMENT = {FIXED }]
{DELIMITED**}
[/FIRSTCASE = {n}]
[/DELCASE = {LINE** }]1
{VARIABLES n}
[/FIXCASE = n]
[/IMPORTCASE = {ALL** }]
{FIRST n }
{PERCENT n}
[/DELIMITERS = {"delimiters"}]
[/QUALIFIER = "qualifier"]
VARIABLES subcommand for ARRANGEMENT = DELIMITED
/VARIABLES = varname {format}
VARIABLES subcommand for ARRANGEMENT = FIXED
/VARIABLES varname {startcol - endcol} {format}
{/rec#} varname {startcol - endcol} {format}
MISSING VALUES
MISSING VALUES {varlist}(value list) [[/]{varlist} ...]
{ALL } {ALL }
Keywords for numeric value lists:
LO, LOWEST, HI, HIGHEST, THRU
Example
MISSING VALUES V1 (8,9) V2 V3 (0) V4 ('X') V5 TO V9 (' ').
RENAME VARIABLES
RENAME VARIABLES {(varname=varname) [(varname ...)]}
{(varnames=varnames) }
Example
RENAME VARIABLES (JOBCAT=TITLE).
RECODE
For numeric variables:
RECODE varlist (value list=value)...(value list=value) [INTO varlist]
[/varlist...]
Input keywords:
LO, LOWEST, HI, HIGHEST, THRU, MISSING, SYSMIS, ELSE
Output keywords:
COPY, SYSMIS
For string variables:
RECODE varlist [('string',['string'...]='string')][INTO varlist]
[/varlist...]
Input keywords:
CONVERT, ELSE
Output keyword:
COPY
Examples
RECODE V1 TO V3 (0=1) (1=0) (2,3=-1) (9=9) (ELSE=SYSMIS).
RECODE STRNGVAR ('A','B','C'='A')('D','E','F'='B')(ELSE=' ').
SAVE MODEL
SAVE MODEL OUTFILE='filename'
[/KEEP={ALL** }]
{model names}
{procedures }
[/DROP={model names}]
{procedures }
[/TYPE={MODEL**}]
{COMMAND}
VALUE LABELS
VALUE LABELS varlist value 'label' value 'label'... [/varlist...]
[/datevarlist ’value’ ’label’...]
Example
VALUE LABELS JOBGRADE 'P' 'Parttime Employee' 'C' 'Customer Support'.
VARIABLE LABELS
VARIABLE LABELS varname 'label' [/varname...]
Example
VARIABLE LABELS YRHIRED 'YEAR OF FIRST HIRING'.
WEIGHT
WEIGHT {BY varname}
{OFF }
WEIGHT BY V1.
Stran 1 od 18