SOCY7709: Quantitative Data Management

Instructor: Natasha Sarkisian

Advanced Recoding: Working with Numeric Formats, Dates, and String Variables

Numeric Variable Formats

Numbers in Stata can be stored in 5 different types of variables.

There are three integer formats:

·  byte – for numbers below 100, ideal for categorical variables

·  int - numbers up to 32,000

·  long – up to about 2 billion

And three formats for numbers with fractions:

·  float(the default) -- about 7 digits of accuracy (224 = 16,777,216 is the largest number that can be precisely stored)

·  double – 16 digits of accuracy

When you create a new numeric variable and do not specify the storage type for it, the new variable is made a float, unless you have previously used “set type” command. For example:

. gen hrs40=(hrs1>=40) if hrs1<.

. des hrs40

storage display value

variable name type format label variable label

------hrs40 float %9.0g

. set type double

. drop hrs40

. gen hrs40=(hrs1>=40) if hrs1<.

(1036 missing values generated)

. des hrs40

storage display value

variable name type format label variable label

------

hrs40 double %10.0g

To set the default back:

. set type float

To create a specific variable that differs from default format (float), specify format in the gen or egen command:

. gen byte hrs40=(hrs1>=40) if hrs1<.

If you declare a variable as an integer (byte,intorlong), but make it equal to something that in fact contains fractions, the fractional part will be truncated (not rounded but just cut off!). For example:

. sum hrs1

Variable | Obs Mean Std. Dev. Min Max

------+------

hrs1 | 1729 41.77675 14.62304 1 89

. gen hrs1d10=hrs1/10

(1036 missing values generated)

. sum hrs1d10

Variable | Obs Mean Std. Dev. Min Max

------+------

hrs1d10 | 1729 4.177675 1.462304 .1 8.9

. gen byte hrs1d10_b=hrs1d10

(1036 missing values generated)

. sum hrs1d10_

Variable | Obs Mean Std. Dev. Min Max

------+------

hrs1d10_b | 1729 3.95026 1.485213 0 8

In most cases, it doesn’t make sense to worry too much about setting the format – except in those cases where the default (float) causes an undesirable loss of precision. For example, if your IDs are very large numbers (more than 7 digits) and you store them as default (float), they can be rounded and therefore no longer uniquely identify individuals. Store such IDs using long or double; saving them as a string variable is another safe option.

Float and double can also cause us problems if we want to use exact comparisons with fractions because the way there are stored (in binary format), they might be a tiny little bit off from, say, 1.3 that is displayed to us. So do your comparisons based on intervals rather than exact values when dealing with fractions. For example:

. tab hrs1d10 if hrs1d10>6 & hrs1d10<7

hrs1d10 | Freq. Percent Cum.

------+------

6.1 | 2 5.00 5.00

6.2 | 6 15.00 20.00

6.3 | 3 7.50 27.50

6.4 | 4 10.00 37.50

6.5 | 18 45.00 82.50

6.6 | 4 10.00 92.50

6.8 | 3 7.50 100.00

------+------

Total | 40 100.00

. list id if hrs1d10==6.1

. list id if hrs1d10==6.2

. list id if hrs1d10==6.3

. list id if hrs1d10==6.4

. list id if hrs1d10==6.5

+------+

| id |

|------|

33. | 33 |

408. | 408 |

453. | 453 |

758. | 758 |

1105. | 1105 |

|------|

1264. | 1264 |

1340. | 1340 |

1414. | 1414 |

1520. | 1520 |

1702. | 1702 |

|------|

1947. | 1947 |

1957. | 1957 |

2096. | 2096 |

2156. | 2156 |

2269. | 2269 |

|------|

2277. | 2277 |

2327. | 2327 |

2743. | 2743 |

+------+

This problem never occurs with byte, integer, long, or string formats or with integer numbers stored as float so if you want to use exact conditions, multiply your variable by, say, 100 or 1000 to get rid of decimals.

. gen hrs1dm=hrs1d10*10

(1036 missing values generated)

. des hrs1dm

storage display value

variable name type format label variable label

------hrs1dm float %9.0g

. list id if hrs1dm==61

+------+

| id |

|------|

865. | 865 |

2000. | 2000 |

+------+

If your dataset is large, using small variable types likebyte can save a lot of memory, but that can be accomplished after all the variables are created, before saving the dataset, using thecompresscommand. It will automatically store variables in smaller types if it is possible to do that without losing precision. It also looks whether strings can be stored as shorter strings.

. compress

emailhr was int now byte

chathr was int now byte

artshr was int now byte

emhrh was int now byte

emhrw was int now byte

wwwhrw was int now byte

emhro was int now byte

wwwhro was int now byte

chldprb was int now byte

chldhlp was int now byte

hrs40 was double now byte

(47,005 bytes saved)

You can also change the format type of a specific variable using recast command:

recast type varlist [, force]

where type is byte, int, long, float, double, str1, str2, ..., str2045, or strL. For example:

. recast byte hrs1dm

. des hrs1dm

storage display value

variable name type format label variable label

------

hrs1dm byte %9.0g

. recast byte hrs1d10

hrs1d10: 786 values would be changed; not changed

. sum hrs1d10

Variable | Obs Mean Std. Dev. Min Max

------+------

hrs1d10 | 1729 4.177675 1.462304 .1 8.9

. recast byte hrs1d10, force

hrs1d10: 786 values changed

. sum hrs1d10

Variable | Obs Mean Std. Dev. Min Max

------+------

hrs1d10 | 1729 3.95026 1.485213 0 8

Note that force makes recast unsafe -- variables can get the new storage type even if that will cause a loss of precision, introduction of missing values, or, for a string variables, the truncation of strings.

Display Formats for Numeric Variables

We already saw that formatting date variables helps Stata understand that we specified dates and to display them correctly. You can also modify display format of various numeric variables, also using format command:

format varlist %fmt

Here are the variable formats for numeric variables (from help format):

Numerical

%fmt Description Example

------

right-justified

%#.#g general %9.0g

%#.#f fixed %9.2f

%#.#e exponential %10.7e

%21x hexadecimal %21x

%16H binary, hilo %16H

%16L binary, lohi %16L

%8H binary, hilo %8H

%8L binary, lohi %8L

right-justified with commas

%#.#gc general %9.0gc

%#.#fc fixed %9.2fc

right-justified with leading zeros

%0#.#f fixed %09.2f

left-justified

%-#.#g general %-9.0g

%-#.#f fixed %-9.2f

%-#.#e exponential %-10.7e

left-justified with commas

%-#.#gc general %-9.0gc

%-#.#fc fixed %-9.2fc

------

You may substitute comma (,) for period (.) in any of

the above formats to make comma the decimal point. In

%9,2fc, 1000.03 is 1.000,03. Or you can use “set dp comma.”

The format %g is usually used as %width.0g with 0 decimal points specified, but in fact what that means is that this format can decide how many digits to display to the right of the decimal point depending on how many digits total there are, while in %f, the number of digits after the decimal point is specified precisely by the format. Also, %g format will switch to a %e display format (exponential) if the number is too large or too small, while %f does not do that.

. des spsei

storage display value

variable name type format label variable label

------

spsei float %3.2f spsei r's spouse's socioeconomic index

. list spsei in 7/8

+------+

| spsei |

|------|

7. | 64.1 |

8. | 29.2 |

+------+

. format spsei %3.2f

. list spsei in 7/8

+------+

| spsei |

|------|

7. | 64.10 |

8. | 29.20 |

+------+

. format spsei %09.2f

. list spsei in 7/8

+------+

| spsei |

|------|

7. | 000064.10 |

8. | 000029.20 |

+------+

. format spsei %3.2e

. list spsei in 7/8

+------+

| spsei |

|------|

7. | 6.4e+01 |

8. | 2.9e+01 |

+------+

The default formats are:

byte %8.0g

int %8.0g

long %12.0g

float %9.0g

double %10.0g

You can also change the default format for displaying all coefficients using set cformat command – e.g., to only show 2 decimal points, we can use the following command prior to running our regression models:

set cformat %9.2f

Dealing with Date Variables

Stata wants dates stored in number of units since January 1, 1960—the units can be seconds, minutes, days or months. So if we want to be able to do use date procedures in Stata (e.g. calculate the number of months between some events), we should store date variables in Stata format. Coding and interpretation of date and time values in Stata are as follows:

+------

| | | ----- Numerical value & interpretation ------

| Format | Meaning | Value = -1 | Value = 0 | Value = 1

|------+------+------+------+------

| %tc | clock | 31dec1959 | 01jan1960 | 01jan1960

| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001

| | | | |

| %td | days | 31dec1959 | 01jan1960 | 02jan1960

| | | | |

| %tw | weeks | 1959w52 | 1960w1 | 1960w2

| | | | |

| %tm | months | 1959m12 | 1960m1 | 1960m2

| | | | |

| %tq | quarters | 1959q4 | 1960q1 | 1960q2

| | | | |

| %th | half-years | 1959h2 | 1960h1 | 1960h2

| | | | |

| %tg | generic | -1 | 0 | 1

| | | | |

| %ty | year | 1959 | 1960 | 1961

| | | | |

| %tC | clock | 31dec1959 | 01jan1960 | 01jan1960

| | | 23:59:59.999 | 00:00:00.000 | 00:00:00.001

+------

(Note: %tC with capital C includes leap seconds).

We will work with the interview date variable in GSS 2002 as an example.

·  DATEINTV

·  Date of interview

Survey Question: Date of interview.

Range of Valid Numeric Responses
Minimum value=1 Maximum value=9998

Response Categories
Category / Label / Frequency
0 / Not applicable / 0
9999 / Not available / 18

Column: 1276 Width: 4 Type: numeric
Text:REMARKS: This variable consists of the month (Cols. 5734-5735) and date (Cols. 5736-5737) on which the interview was conducted. Collapsed information by month is listed above for convenience of display only.

. sum dateintv

Variable | Obs Mean Std. Dev. Min Max

------+------

dateintv | 2747 383.1736 120.219 206 626

One way to manage this would be to split the original variable into date and month and use a numeric importing function. To split it, it might be easier to use it as a string, so we convert the original variable into string using tostring command:

. tostring dateintv, gen(datestr2)

datestr2 generated as str3

. gen month=substr(datestr2, 1, 1)

. gen day=substr(datestr2, 2, 2)

(18 missing values generated)

. tab month

month | Freq. Percent Cum.

------+------

. | 18 0.65 0.65

2 | 557 20.14 20.80

3 | 745 26.94 47.74

4 | 703 25.42 73.16

5 | 526 19.02 92.19

6 | 216 7.81 100.00

------+------

Total | 2,765 100.00

. tab day, m

day | Freq. Percent Cum.

------+------

| 18 0.65 0.65

01 | 79 2.86 3.51

02 | 77 2.78 6.29

03 | 60 2.17 8.46

04 | 79 2.86 11.32

05 | 55 1.99 13.31

06 | 95 3.44 16.75

07 | 87 3.15 19.89

08 | 90 3.25 23.15

09 | 80 2.89 26.04

10 | 83 3.00 29.04

11 | 122 4.41 33.45

12 | 101 3.65 37.11

13 | 134 4.85 41.95

14 | 86 3.11 45.06

15 | 103 3.73 48.79

16 | 94 3.40 52.19

17 | 65 2.35 54.54

18 | 104 3.76 58.30

19 | 97 3.51 61.81

20 | 101 3.65 65.46

21 | 99 3.58 69.04

22 | 120 4.34 73.38

23 | 110 3.98 77.36

24 | 83 3.00 80.36

25 | 119 4.30 84.67

26 | 92 3.33 87.99

27 | 84 3.04 91.03

28 | 110 3.98 95.01

29 | 72 2.60 97.61

30 | 54 1.95 99.57

31 | 12 0.43 100.00

------+------

Total | 2,765 100.00

Now we convert these back into numbers:

. destring month, replace

month has all characters numeric; replaced as byte

(18 missing values generated)

. destring day, replace

day has all characters numeric; replaced as byte

(18 missing values generated)

. tab month, m

month | Freq. Percent Cum.

------+------

2 | 557 20.14 20.14

3 | 745 26.94 47.09

4 | 703 25.42 72.51

5 | 526 19.02 91.54

6 | 216 7.81 99.35

. | 18 0.65 100.00

------+------

Total | 2,765 100.00

. sum day

Variable | Obs Mean Std. Dev. Min Max

------+------

day | 2747 15.97306 8.259537 1 31

We need to add year – but such variable exists already:

. tab year

gss year |

for this |

respondent | Freq. Percent Cum.

------+------

2002 | 2,765 100.00 100.00

------+------

Total | 2,765 100.00

Now we need to import date information from these three numeric variables into a single numeric variable that is coded in the way that Stata understands; here are various possibilities of importing date from numeric variables – the structure of the command would be, for example:

gen varname= mdyhms(M, D, Y, h, m, s)

where mdyhms is the function you use and M, D, Y, h, m, s in parentheses are replaced with names of variables where information on each component is stored. Here are all possible functions:

%tc | mdyhms(M, D, Y, h, m, s)

%tc | dhms(td, h, m, s)

%tc | hms(h, m, s)

|

%tC | Cmdyhms(M, D, Y, h, m, s)

%tC | Cdhms(td, h, m, s)

%tC | Chms(h, m, s)

|

%td | mdy(M, D, Y)

|

%tw | yw(Y, W)

%tm | ym(Y, M)

%tq | yq(Y, Q)

%th | yh(Y, H)

%ty | Y

So for our example:

. gen intervdate=mdy(month, day, year)

(18 missing values generated)

. sum intervdate

Variable | Obs Mean Std. Dev. Min Max

------+------

intervdate | 2747 15436.14 35.72223 15377 15517