How to Set Up Missing Values

in a SAS Data Set

(commands=missing.sas)

Missing Data in Numeric Variables:

Handling missing data is one of the most important tasks involved in creating and managing data. Sometimes the missing values for numeric data are coded as numbers that are not possible as real data values (e.g. 98 or 99 for variables whose valid codes can only be as large as 5). It is important that these missing values be correctly identified as missing, so SAS will not use them in calculations. If your data has missing value codes that are numeric, they need to be replaced by the SAS missing value codes.

The SAS missing value code for numeric data is a period (.). If your raw data were entered with periods for missing data, they will be correctly read by SAS as missing values for numeric data, and you will not need to do any recoding of missing values.

Reading in Raw Data:

The commands below are used to read in a raw data file create a SAS data set called OWEN.

data owen;

infile "owen.dat";

input fam_num childnum age sex race w_rank income_c height

weight hemo vit_c vit_a head_cir fatfold b_weight

mot_age b_order m_height f_height ;

run;

title “Owen Data. Missing Value Codes Have Not Been Fixed ”;

proc means data=owen;

run;

The output from these commands is shown below. Notice that there are no missing values for any of the variables (n=1006 for each variable) but by looking at the maximum values, it can be readily seen that some of the values are impossible. To be sure that these are actually supposed to be missing value codes, check the documentation for the Owen data set.

Owen Data. Missing Value Codes Have Not Been Fixed

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

fam_num 1006 4525.11 1634.03 2000.00 7569.00

childnum 1006 1.3359841 0.5716672 1.0000000 3.0000000

age 1006 44.0248509 16.6610452 12.0000000 73.0000000

sex 1006 1.4890656 0.5001291 1.0000000 2.0000000

race 1006 1.2823062 0.4503454 1.0000000 2.0000000

w_rank 1006 2.2127237 0.9024440 1.0000000 4.0000000

income_c 1006 1581.31 974.2279710 80.0000000 6250.00

height 1006 103.5159046 64.3384339 70.0000000 999.0000000

weight 1006 21.4941153 75.8424096 8.2400000 999.0000000

hemo 1006 12.4606362 1.1578850 6.2000000 24.1000000

vit_c 1006 1.1302187 0.6599121 0.1000000 3.5000000

vit_a 1006 51.2465209 28.0530567 15.0000000 99.0000000

head_cir 1006 49.7216700 4.6155769 39.0000000 99.0000000

fatfold 1006 5.6780318 10.8109068 2.6000000 99.0000000

b_weight 1006 338.4502982 111.0447134 91.0000000 999.0000000

mot_age 1006 30.9990060 12.4970444 17.0000000 99.0000000

b_order 1006 5.4304175 15.4013836 1.0000000 99.0000000

m_height 1006 185.3499006 132.7438368 122.0000000 999.0000000

f_height 1006 203.5119284 142.1009149 152.0000000 999.0000000

--------------------------------------------------------------------------------

Setting up missing value codes in the Data Step:

SAS missing data codes are set up in the data step. Once the missing value codes are set for a variable, they will be recognized as missing in all later analyses. The following SAS code can be used to set up the missing values in the Owen data set. Note that this can all be done by simply altering the original data step, and does not require a second data step.

data owen;

infile "owen.dat";

input fam_num childnum age sex race w_rank income_c height

weight hemo vit_c vit_a head_cir fatfold b_weight

mot_age b_order m_height f_height ;

/*set up missing value codes*/

if vit_a = 99 then vit_a = .;

if head_cir = 99 then head_cir = .;

if fatfold = 99 then fatfold = .;

if mot_age = 99 then mot_age = .;

if b_order = 99 then b_order = .;

if height = 999 then height = .;

if weight = 999 then weight = .;

if b_weight = 999 then b_weight = .;

if m_height = 999 then m_height = .;

if f_height = 999 then f_height = .;

run;

title "Owen Data. Missing Value Codes Have Been Replaced by . ";

proc means data=owen;

run;

Owen Data. Missing Value Codes Have Been Replaced by .

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

--------------------------------------------------------------------------------

fam_num 1006 4525.11 1634.03 2000.00 7569.00

childnum 1006 1.3359841 0.5716672 1.0000000 3.0000000

age 1006 44.0248509 16.6610452 12.0000000 73.0000000

sex 1006 1.4890656 0.5001291 1.0000000 2.0000000

race 1006 1.2823062 0.4503454 1.0000000 2.0000000

w_rank 1006 2.2127237 0.9024440 1.0000000 4.0000000

income_c 1006 1581.31 974.2279710 80.0000000 6250.00

height 1001 99.0429570 11.4300111 70.0000000 130.0000000

weight 1000 15.6290800 3.6523446 8.2400000 41.0800000

hemo 1006 12.4606362 1.1578850 6.2000000 24.1000000

vit_c 1006 1.1302187 0.6599121 0.1000000 3.5000000

vit_a 763 36.0380079 8.8951237 15.0000000 78.0000000

head_cir 999 49.3763764 2.0739057 39.0000000 56.0000000

fatfold 993 4.4562941 1.6683194 2.6000000 42.0000000

b_weight 986 325.0517241 59.5162936 91.0000000 544.0000000

mot_age 981 29.2660550 6.2603025 17.0000000 51.0000000

b_order 980 2.9479592 2.1939526 1.0000000 16.0000000

m_height 980 163.7632653 6.3663343 122.0000000 199.0000000

f_height 975 178.2194872 7.3821354 152.0000000 210.0000000

-------------------------------------------------------------------------------

Check the output above to see that the sample size (N) for many variables is now less than 1006, and that the maximum values are no longer the missing value codes (99 or 999).

Create a subset of complete cases:

You can also use the nmiss function to select complete cases as a subset of your data:

data complete;

set owen;

if nmiss(of fam_num -- f_height)=0;

run;

title "Complete Data";

proc means data=complete;

run;

The output below shows that only 702 cases had complete data for all variables.

Complete Data

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

-------------------------------------------------------------------------------

fam_num 702 5033.72 1545.09 2000.00 7569.00

childnum 702 1.3133903 0.5538790 1.0000000 3.0000000

age 702 44.8874644 16.4842946 12.0000000 72.0000000

sex 702 1.4928775 0.5003057 1.0000000 2.0000000

race 702 1.2464387 0.4312443 1.0000000 2.0000000

w_rank 702 2.2763533 0.8909014 1.0000000 4.0000000

income_c 702 1639.91 963.5405113 80.0000000 6250.00

height 702 99.5569801 11.3280245 71.0000000 130.0000000

weight 702 15.7570798 3.7269699 8.2400000 41.0800000

hemo 702 12.4481481 1.1131253 6.2000000 15.8000000

vit_c 702 1.1417379 0.6528191 0.1000000 3.5000000

vit_a 702 36.1566952 8.9369932 15.0000000 78.0000000

head_cir 702 49.3931624 2.0945848 39.0000000 56.0000000

fatfold 702 4.4860399 1.8726362 2.6000000 42.0000000

b_weight 702 328.4259259 58.5042721 109.0000000 544.0000000

mot_age 702 29.1424501 6.2377703 17.0000000 51.0000000

b_order 702 2.8219373 2.1461381 1.0000000 16.0000000

m_height 702 163.9330484 6.0725351 147.0000000 188.0000000

f_height 702 178.2236467 7.2359744 152.0000000 210.0000000

-------------------------------------------------------------------------------

Missing Values for Character Variables:

If the value of a character variable is missing, it is stored by SAS as " " (quote-blank-quote). When entering data for character variables in a free format, you need to put a placeholder for the missing value—simply use a dot. This will be interpreted as missing when SAS reads it in. If you are reading character data from a file with the values lined up in columns, simply leave missing values blank in the raw data; this will also be interpreted as missing by SAS. When entering character data into an Excel file, simply leave the cell blank (null) and skip to the next value.

The example below illustrates the use of missing values, such as “NA” or “Unk” for character variables. The second block of SAS code shows how to set these up as the SAS missing value code of “ “.

data test ;

input name $ sex $ age city $;

cards;

Gene M 62 Saginaw

Cyndi F 45 Unk

Alice NA 51 Bay City

Bob M 55 Unk

;

proc print data=test;

title "Test Data Set with Missing Values Still in Data Set";

run;

Test Data Set with Missing Values Still in Data Set

Obs name sex age city

1 Gene M 62 Saginaw

2 Cyndi F 45 Unk

3 Alice NA 51 Bay

4 Bob M 55 Unk

data test ;

input name $ sex $ age city $;

if sex = "NA" then sex=' ';

if city = "Unk" then city = ' ';

cards;

Gene M 62 Saginaw

Cyndi F 45 Unk

Alice NA 51 Bay City

Bob M 55 Unk

;

proc print data=test;

title "Test Data Set with Missing Values Set to ' ' ";

run;

Test Data Set with Missing Values Set to ' '

Obs name sex age city

1 Gene M 62 Saginaw

2 Cyndi F 45

3 Alice 51 Bay

4 Bob M 55

1