1 Managing Data File

SYTAX FOR MANAGING DATA FILES
Date: / 13. 12. 2016
Prepared by: / Irena Vipavc Brvar, Alekdandar Zdravković and Sanja Lužar
Status of the document / Internal

Sintaksa za urejanje podatkovnih datotek

Index

1 Managing data file

1.1 Header

1.2 Comment

1.3 Reading data file

1.4 Frequencies

1.5 Missing values

1.6 Recode

1.7 Recoding with a specified condition

1.8 Hidden value

1.9 Variable and value labels

1.10 COMPUTER / GENERATE

1.11 Formats and rounding xx

1.12 Renaming variable

1.13 Weights

1.14 Saving data file

1.15Creating distribution data file (tab-delimited, Ascii..)

1.16 Merging data files

Appedinx: List of common syntax and its content

1 Managing data file

All changes in a data files should be implemented through programme code – syntax. This is the only way we can ensure traceability of changes. Prepare an empty syntax file (e.g. ID_survey_vX.sps for SPSS or ID_survey_vX.do in Stata ).

In SPSS command sentence finish with a dot (».«).

Note: STATA print a dot (».«) in front of commands in Result window, but the dot is not part of the command and you do not type it.

1.1 Header

Syntax should start with some sort of header. E.g.:

* VERSION, DATE

* Version 1, 2010-04-13

* AUTOR

* NAME SURNAME

* CHANGE MADE

* Missing values defined, added labels, anonymized values defined.

***

1.2 Comment

Text that we want to add as a comment needs to start with an »*«at the beginning of the line.

E.g.(same for SPSS and STATA)

* this is a comment

In case we want to add comment inside a command, we use »/*« before the comment and »*/« after it.

E.g. (SPSS)

AUTORECODE VARIABLES=v39dr / INTO /*name of new variable*/ v39new.

1.3 Reading data file

At first we need to import data file we will be working on.

E.g. (SPSS)

GET FILE='..\pb1101.sav'.

IMPORT FILE='..\pb1101.por' (when we are working with *.por data file)

(STATA)

Clear if you want the current data in memory cleared.

use“…\pb1101.dta”

You can also import some other formats of files in SPSS and STATA, but is it suggested to convert it with another tool (like Nesstar Publisher or Stat transfer) first.

If we can identify at first instance variables that are of no relevance for the clean file, or should not be there because of personal protection issued (like telephone numbers) we drop them at the beginning.

We use command drop(or keep).

E.g. (SPSS)

GET FILE='..\pb1101.sav'

/dropsample sequence str_reg st_citiz veln2 selection1 selection2 address1 thankyou user qi x1 x2 notereason.

(STATA)

use “..\pb1101.dta”

drop str_reg st_citiz veln2 selection1 selection2 address1 thankyou user qi x1 x2 note reason

drop variables:

drop varlist

drop observations if some condition (exp) is fulfilled

drop if exp

drop a range of observations

drop in range if exp

1.4 Frequencies

View your data.

(SPSS)

FREQUENCIES ALL.

(or short: FREQ ALL.)

(STATA)

One-way table of summary statistics

tabulate q1

One-way table for each variable

tab1 q1 q2

or two-way table

tabulate q1 q2

1.5 Missing values

(SPSS)

With command MISSING VALUES we define missing values for variables. For each variable we can define a maximum of 3 missing values or a range. Otherwise we need to use RECODE or COMPUTE command first. As it is common in SPSS we can write a sequence of variables as »fist variable to last variable«.

MISSING VALUESq1 q2 q481c w10 w q773 q774 q775 q780 q781(3).

MISSING VALUESw20 w2 w3 wskup (0).

MISSING VALUESld q776 (4THRU HI).

MISSING VALUESq777 q778 zap2 reg1 (5,8,9).

MISSING VALUESq772a q772b (7).

Use range commands as LO, LOWEST, HI, HIGHEST, THRU.

You can also write

MISSING VALUESq1 q2 q481c w10 w q773 q774 q775 q780 q781 (3),

/w20 w2 w3 wskup (0),

/v1 v2 v3 v10 v9 v5 m1 m3 to m11 m21 m22 m13 m15 m23 (6).

With the command below we clear all defined missing values for the variable list.

MISSING VALUESvarlist ().

(STATA)

For one variable at the time

replace q777 = . if q777 ==9

or with a missing with a name

replace female = .a if female ==-999

replace female = .b if female ==-99

replace female = .c if female >=-3 & female <0

for multiple variables recoding use

According to

mvdecode female ses, mv(-999=. a \-99=. b \ -3/-1 =. c)

we can use the key word _all to refer to all the variables in the data set.

mvdecode _all, mv(-999=. \-99=. \ -3/-1 =.)

1.6 Recode

When we want to change values to variable we use command RECODE.

(SPSS)

RECODE v62a (-1=0) (ELSE=COPY) TO v62a_r.

RECODE B14 (1=1) (2=2) (9=3) (ELSE=SYSMIS) TO B14_r.

ELSE refers to all other original values that we don’t define/specify in RECODE.

COPY – copy original values without changing them.

SYSMIS – define values as missing.

For recoding string variables (alphanumeric – like countries; or open questions - occupation) into codes we could use command AURORECODE, which assign a numeric value to each descriptive value.

AUTORECODE VARIABLES=v39dr / INTO v39new.

EXECUTE.

(STATA)

recode v62a (-1=0), gen(v62a_r)

Instead of else, Stata uses nonmissing (for all nonmissing values not explicitly converted in brackets) and missing (for all system missing values).

recode B14 (1=1) (2=2) (nonmissing=3) (missing=0), gen (B14_r)

– where all missing values are converted to zero.

If you want to declare some nonmissing values as missing:

recode B14 (1=1) (2=2) (nonmissing=.), gen (B14_r)

For conversion of alphanumeric variables to numeric

encodev39dr, gen(v39new)

IMPORTANT1: Stata has the following logic of encoding:

If the values that are recoded alphabetical strings, Stata will generate encoding numbers by descending alphabetical order of values regardless of original ordering, for example value starting with »a« will get number 1 no matter where it is originally placed.

If values are numerical strings, like personal id, Stata will generate encoding numbers by descending numerical order of values regardless of original ordering, for example value starting with »1« will get number 1 no matter where it is originally placed.

IMPORTANT2: String values in Stata data browser are marked by red color, and values of variables being encoded marked by blue color. In data browser after recoding you will again see original values in newly created variable (in this case v39new), because Stata considers original values as labels after recoding, and labels of values are shown instead of data their self by default in Stata browser. If you want to seereal encoded values, you have to double click on particular data cell of newly created variable and it will turn to newly encoded value (it holds for any labeled data).

IMPORTANT3: By running encode command, Stata does not automatically rearrange order of data by values of newly created variable. If you want to do it, you have to run command:

sort v39new

IMPORTANT4: Sometimes it happens that Stata do not recognize numerical data, but classify them as string. If you want to convert numerical data, wrongly seen by Stata as string, into numerical format (without any kind of encoding), you can do it by command:

This will ignore alphanumeric characters

destring v39dr, gen(v39new) force

If missing values are not clearly defined please use as following

Value / Value label
99 / not recorded
98 / not provided/no answer
97 / not applicable
96 / don’t known
95 / error

1.7 Recoding with a specified condition

Sometimes it is necessary to recode variables on the basis of previous values, conditions (ie: respondent was asked a question event if he was not supposed to be and has non-missing value).

(SPSS)

DO IF (age50).

RECODE q10 (1=97)(2=97).

END IF.

EXECUTE.

ADD VALUE LABELS q10 97'not applicable'.

(STATA)

recodeq10 (1=97)(2=97)if age50

Note: In case of condition age = 50, proper Stata syntax is age == 50

IMPORTANT!!!

Work with data labels in Stata is extremely complicated relative to SPSS. First you have to predefine so-called list of value labels independently of variable, and then you can attach this list to any variable which data match predefined labels in label list. For example, suppose that you needto label values in variable q10. First you predefine list of labels, e.g. list_q10 by command.

label definelist_q10 1 "one" 2 "two" 3 "three"...

Then you attach this list to given variable.

label values q10 list_q10

So if you want to add label for newly added value to some variable, basically you have to redefine list of value labels and replace existing.

label define list_q10 1 "one" 2 "two" 3 "three"... 97 " not applicable", replace

In the rest of the text we use this logic in explaining change of data labels

1.8 Hidden values

Some variables / values might be critical for identification of interviewed persons. This might be town, municipality, occupation, names, surnames etc.

Values of such variables will be hidden for publish distribution. We still leave variables in and hide the values. In case user really needs them, they can ask for additional access, and could view un-anonymized file in safe room environment. In this case we will have two distribution files (PUF, SUF).

(SPSS)

For string type variables:

STRINGtown_r (A60).

RECODE town (ELSE='Hiden value') INTO town_r.

VARIABLE LABELS town_r 'Town'.

For numeric variables (recoding into -1):

RECODE post (ELSE=-1) INTO post_r.

VARIABLE LABELS post_r 'Post office number'.

VALUE LABELS post_r -1 'Hidden value'.

In case you want to keep missing values:

RECODE post (99=SYSMIS)(ELSE=-1) INTO post_r.

VALUE LABELS post_r -1 'Hiden value'.

FREQUENCIES post_r.

Oz. if values are already defined as missing

RECODEpost (SYSMIS=SYSMIS) (ELSE=-1).

1.9 Variable and value labels

Variables that have inadequate, incorrect or missing names we add/ edit them with:

(SPSS)

VARIABLE LABELS weight 'Poststratification weight.

VARIABLE LABELS region 'Region'.

(STATA)

label define list_labels1weight "Poststratification weight", replace

label define list_labels2 region"Region", replace

It has been assumed that list_label1 and list_label2 are already attached to relevant variables which labels are changing, as explained above.

If variables values (answer categories) are missing or are incorrect:

(SPSS)

VALUE LABEL q807_1 to q807_5 1'yes'

2'no'

3'DK, NA'.

(STATA)

label define list_q807 1 "yes" 2 "no" 3 "DK, NA", replace

Under assumption that list_q807 is already attached to variables q807_1 to q807.

If the variable does not have value labels:

label values var1 list_q807

1.10 COMPUTER / GENERATE

Command COMPUTE (GENERATE or. GEN) is one of the commands to transform data. It will calculate certain value of a variable and adapt new values. It creates a new variable either by alternating single existing variable or by mathematical operations applied to two or more single variables.

(SPSS)

COMPUTE problems=q19_1r + q19_2r + q19_3r + q19_4r.

COMPUTEage = 101-ybirth.

(STATA)

gen problems=q19_1r + q19_2r + q19_3r + q19_4r

gen age = 101-ybirth.

Command COMPUTE (GENERATE) is together with command RECODE most often used as the help of defining missing values.

(SPSS)

COMPUTEOBZ1=OBZ.

RECODEOBZ1(0,5,7,8,9=0).

MISSING VALUESOBZ1(0).

(STATA)

gen OBZ1=OBZ

recode OBZ1(0=0) (5=0) (7=0) (8=0) (9=0), replace or shorter recode OBZ1 0 5 7 8 9=0, replace

replace OBZ1=.ifOBZ1==0

1.11 Formats and rounding

(STATA)

formatvariable name %9.0g (general)

or %9. 2f (fixed)

%10.7 e (exponential)

and so on

(SPSS)

We use command FORMATS to assign a format to specific variable.

If we want to round a variable to whole number we write:

FORMATS age weight(F2.0).

F – stands for numerical variable

2 – number of reserved places

0 – how many of reserved places are decimal places

Note1: Have in mind that ».« uses one place – so if you want a format »3.30« you should write »4.2«.

Note2: Rounding will effect output calculation as well.

It is possible to define number of places also for alphanumeric variables.

FORMAT v31else(A20).

A – stands for alphanumeric variable

20 – number of reserved places

When we want to add variable with a date:

COMPUTE survey = 1.

EXECUTE.

COMPUTE survey = DATE.MOYR(1,2005).

EXECUTE.

FORMATS survey(moyr8).

All values of this variable will have value »jan 2005«.

COMPUTE survey = 1.

EXECUTE.

COMPUTE survey = DATE.DMY(23,1,2005).

EXECUTE.

FORMATS survey(edate10).

All values of this variable will have value »23.1. 2005«.

Do not forget to label new variable.

Formats of dates could be written as follows:

DATE.DMY. DATE.DMY(day,month,year).

DATE.MDY. DATE.MDY(month,day,year).

DATE.MOYR. DATE.MOYR(month,year).

DATE.QYR. DATE.QYR(quarter,year).

DATE.WKYR. DATE.WKYR(weeknum,year).

DATE.YRDAY. DATE.YRDAY(year,daynum).

TIME.DAYS. TIME.DAYS(days).

TIME.HMS. TIME.HMS(hours[,minutes,seconds]).

Rounding XX

(STATA)

generate var2=int (var1)

replace var1=floor (var1)

replace var1=ceil (var1)

round (x[,y])

x rounded in units of y or x rounded to the nearest integer if the argument y is omitted; x (not ".") if xis missing (meaning that round(.a) = .a and thatround(.a,y) = .a if y is not missing) and if y ismissing, then "." is returned

1.12 Renaming variable

If variables need to be renamed:

(SPSS)

RENAME VARIABLES M20=M20_1.

(STATA)

rename M20 M20_1

1.13 Weights

Data file might have weights in it. You might be using them to compare data file with published results.

However data files that will be distributed should have weights set of.

WEIGHT

OFF.

You can see it weights are on or off in SPSS data view window. There is note in right bottom corner saying »Weight On« or “Weight off”.

Weighting of the data in Stata does not work in similar manner as in SPSS. Basically, there is no possibility to turn on or off weighting of some variable, but only creating a new variable with weighted values of original variable.

Stata has a wide range of commands for survey data, where defining weights is possible (svy)

1.14 Saving data file

For saving a data file:

(SPSS)

SAVE OUTFILE='..\mlaalk07.sav'.

EXPORT OUTFILE='..\mlaalk07.por'.

(STATA)

save“..\mlaalk07.dta”

For exporting in other file formats see help for more information.

1.15Creating distribution data file (tab-delimited, Ascii..)

Creating tab delimited output file for archival purpose.

(SPSS)

SAVE TRANSLATE OUTFILE='..\id_survey_f1.txt'
/TYPE=TAB
/MAP
/REPLACE
/FIELDNAMES
/CELLS=VALUES.

(STATA)

export delimited using “..\id_survey_f1.txt”, delimiter(tab) replace

1.16 Merging data files

In order to merge two or more datasets to one cumulative database, you have to have some common identifier for all merging datasets. Common id should be unique for given data record in both datasets to have sensible merging.

(SPSS)

We call our first dataset:

GET FILE='..\data\pb\2005\pb0501.sav'.

And save it:

SAVE OUTFILE=..\data\pb\2005\pbsi05.sav'.

exec.

Then we add files:

ADD FILES /FILE=*

/FILE='..\data\pb\2005\pbsi0502.sav'.

EXECUTE.

ADD FILES /FILE=*

/FILE='..\data\pb\2005\pbsi0503.sav'.

EXECUTE.

(STATA)

We call our first dataset:

use “..\data\pb\2005\pb0501.dta”

joinby common_idusing “..\data\pb\2005\pbsi0502.dta”

and save it (by creating new data file)

save “..\data\pb\2005\pbsi05.dta”

Note: joinby makes all possible combinations, … often the command merge is used, … to add new observations command appendis used

Note: when merging, data records, wherein common id is not matched, are deleted in both merging datasets. If you want to save data with unmatched id, you can use the option unmatched (both)

joinbycommon_idusing “..\data\pb\2005\pbsi0502.dta”, unmatched (both)

Note: it is possible to merge data using one or more common id, when it has sense, like in panel data you can merge by years and subject id

joinby common_id years using“..\data\pb\2005\pbsi0502.dta”

Appedinx: List of common syntax and its content in SPSS

ADD FILES
ADD FILES FILE={'savfile'|'dataset'}
{* }
[/RENAME=(old varnames=new varnames)...]
[/IN=varname]
/FILE=... [/RENAME=...] [/IN=...]
[/BY varlist]
[/MAP]
[/KEEP={ALL** }] [/DROP=varlist]
{varlist}
[/FIRST=varname] [/LAST=varname]
Example
ADD FILES FILE="/data/school1.sav" /FILE="/data/school2.sav".
AUTORECODE
AUTORECODE VARIABLES=varlist
/INTO new varlist
[/BLANK={VALID**}
{MISSING}
[/GROUP]
[/APPLY TEMPLATE=’filespec’]
[/SAVE TEMPLATE=’filespec’]
[/DESCENDING]
[/PRINT]
COMPUTE
COMPUTE target variable=expression
Example
COMPUTE newvar1=var1+var2.
COMPUTE newvar2=RND(MEAN(var1 to var4).
COMPUTE logicalVar=(var1>5).
STRING newString (A10).
COMPUTE newString=CONCAT((RTRIM(stringVar1), stringVar2).
Functions and operators available for COMPUTE are described in Transformation Expressions.
•Arithmetic Operations
•Numeric Constants
•Numeric Functions
•Arithmetic Functions
•Statistical Functions
•Random Variable and Distribution Functions
•String Functions
•String/Numeric Conversion Functions
•Logical Functions
•Relational Operators
•NOT Logical Operator
•AND and OR Logical Operators
•LAG Function
•Missing Value Functions
•Scoring Expressions (PASW Statistics Server)
•Date and Time Functions
EXPORT
EXPORT OUTFILE='file'
[/TYPE={COMM**}]
{TAPE }
[/UNSELECTED=[{RETAIN}]
{DELETE}
[/KEEP={ALL** }] [/DROP=varlist]
{varlist}
[/RENAME=(old varnames=new varnames)...]
[/MAP]
[/DIGITS=n]
Example
EXPORT OUTFILE="/data/newdata.por"
/RENAME=(V1 TO V3=ID, SEX, AGE) /MAP.
FORMATS
FORMATS varlist(format) [varlist...]
COMPUTE V3=V1 + V2.
FORMATS V3 (F3.1).
•COMPUTE creates the new numeric variable V3. By default, V3 is assigned an F8.2 format (or the default format specified on SET).
•FORMATS changes both the print and write formats for V3 to F3.1.
FREQUENCIES
FREQUENCIES VARIABLES=varlist [varlist...]
[/FORMAT= [{NOTABLE }] [{AVALUE**}]
{LIMIT(n)} {DVALUE}
{AFREQ }
{DFREQ }
[/MISSING=INCLUDE]
[/BARCHART=[MINIMUM(n)] [MAXIMUM(n)] [{FREQ(n) }]]
{PERCENT(n)}
[/PIECHART=[MINIMUM(n)] [MAXIMUM(n)] [{FREQ }] [{MISSING }]]
{PERCENT} {NOMISSING}
[/HISTOGRAM=[MINIMUM(n)] [MAXIMUM(n)] [{FREQ(n) }] [{NONORMAL}] ]
{NORMAL }
[/GROUPED=varlist [{(width) }]]
{(boundary list)}
[/NTILES=n]
[/PERCENTILES=value list]
[/STATISTICS=[DEFAULT] [MEAN] [STDDEV] [MINIMUM] [MAXIMUM]
[SEMEAN] [VARIANCE] [SKEWNESS] [SESKEW] [RANGE]
[MODE] [KURTOSIS] [SEKURT] [MEDIAN] [SUM] [ALL]
[NONE]]
[/ORDER=[{ANALYSIS}] [{VARIABLE}]
GET DATA
GET DATA
/TYPE = {ODBC }
{OLEDB}
{XLS }
{XLSX }
{XLSM }
{TXT }
/FILE = ’filename’
Subcommands for TYPE = ODBC and OLEDB
/CONNECT=’connection string’
[{/ENCRYPTED }]
{/UNENCRYPTED}
/SQL ’select statement’
[’select statement continued’]
Subcommands for TYPE=ODBC, TYPE=OLEDB, XLS, XLSX, and XLSM
[/ASSUMEDSTRWIDTH={255**}]
{n }
Subcommands for TYPE = XLS, XLSX, and XLSM*
[/SHEET = {INDEX**} {sheet number}]
{NAME } {’sheet name’}
[/CELLRANGE = {RANGE } {’start point:end point’ }]
{FULL**}
[/READNAMES = {on** }]
{off }
Subcommands for TYPE = TXT
[/ARRANGEMENT = {FIXED }]
{DELIMITED**}
[/FIRSTCASE = {n}]
[/DELCASE = {LINE** }]1
{VARIABLES n}
[/FIXCASE = n]
[/IMPORTCASE = {ALL** }]
{FIRST n }
{PERCENT n}
[/DELIMITERS = {"delimiters"}]
[/QUALIFIER = "qualifier"]
VARIABLES subcommand for ARRANGEMENT = DELIMITED
/VARIABLES = varname {format}
VARIABLES subcommand for ARRANGEMENT = FIXED
/VARIABLES varname {startcol - endcol} {format}
{/rec#} varname {startcol - endcol} {format}
MISSING VALUES
MISSING VALUES {varlist}(value list) [[/]{varlist} ...]
{ALL } {ALL }
Keywords for numeric value lists:
LO, LOWEST, HI, HIGHEST, THRU
Example
MISSING VALUES V1 (8,9) V2 V3 (0) V4 ('X') V5 TO V9 (' ').
RENAME VARIABLES
RENAME VARIABLES {(varname=varname) [(varname ...)]}
{(varnames=varnames) }
Example
RENAME VARIABLES (JOBCAT=TITLE).
RECODE
For numeric variables:
RECODE varlist (value list=value)...(value list=value) [INTO varlist]
[/varlist...]
Input keywords:
LO, LOWEST, HI, HIGHEST, THRU, MISSING, SYSMIS, ELSE
Output keywords:
COPY, SYSMIS
For string variables:
RECODE varlist [('string',['string'...]='string')][INTO varlist]
[/varlist...]
Input keywords:
CONVERT, ELSE
Output keyword:
COPY
Examples
RECODE V1 TO V3 (0=1) (1=0) (2,3=-1) (9=9) (ELSE=SYSMIS).
RECODE STRNGVAR ('A','B','C'='A')('D','E','F'='B')(ELSE=' ').
SAVE MODEL
SAVE MODEL OUTFILE='filename'
[/KEEP={ALL** }]
{model names}
{procedures }
[/DROP={model names}]
{procedures }
[/TYPE={MODEL**}]
{COMMAND}
VALUE LABELS
VALUE LABELS varlist value 'label' value 'label'... [/varlist...]
[/datevarlist ’value’ ’label’...]
Example
VALUE LABELS JOBGRADE 'P' 'Parttime Employee' 'C' 'Customer Support'.
VARIABLE LABELS
VARIABLE LABELS varname 'label' [/varname...]
Example
VARIABLE LABELS YRHIRED 'YEAR OF FIRST HIRING'.
WEIGHT
WEIGHT {BY varname}
{OFF }
WEIGHT BY V1.

Stran 1 od 18