Exploratory Analysis of Survey Data

Ian Duling, AstraZeneca LP, Wilmington DE

Valuable information can be derived from sample survey data that is collected on a sample of observations, which are selected, from the population of interest using a probability-based sample design. The complex multistage probability sample design used in a survey like the National Health and Nutrition Examination Survey (NHANES)[1] improves the precision and controls costs of survey data collection, but makes analysis more complex in order to obtain unbiased estimates.

Understanding the design of the questionnaires and the flow of data collection based on conditional responses to initial interview questions can be challenging. How this conditional logic influences the structure of the resulting datasets has a direct impact of the ease of identification, extraction and unbiased interpretation of responses to questions. Statistical inference to the entire population, requires the use of sample weights due to the differential probabilities of selection., i.e. the oversampling of certain subsets of the population.

This discussion will relate specific examples of the author’s use of SAS® software to identify correlated variables within survey data and the generalization of population characteristics.

The task undertaken in this example is to identify the overlap between NHANES participants responding YES to “Has a doctor or other health professional ever told you that you had arthritis?” and those responding YES to “During the past 12 months, have you had pain, aching, stiffness or swelling in or around a joint?

  • From the Medical Conditions Section of the Sample Person Questionnaire (MCQ):

MCQ.160
Has a doctor or other health
professional ever told {you/SP}
that {you/s/he} . . . / MCQ.190
Which type of arthritis was it?
a. had arthritis?
YES ...... 1
NO ...... 2 (b)
REFUSED . . . . . 7 (b)
DON'T KNOW . . 9 (b) / RHEUMATOID ARTHRITIS . . . . 1
OSTEOARTHRITIS ...... 2
OTHER ...... 3
REFUSED ...... 7
DON'T KNOW ...... 9
  • From the Miscellaneous Pain Section of Sample Person Questionnaire (MPQ):

MPQ.010 During the past 12 months, {have you/has SP} had pain, aching, stiffness or swelling in or around a joint?

[Do not include neck pain.]

YES ...... 1

NO ...... 2 (MPQ.060)

REFUSED ...... 7 (MPQ.060)

DON'T KNOW ...... 9 (MPQ.060)

MPQ.020 Were these symptoms present on most days for at least 1 month?

YES ...... 1

NO ...... 2

REFUSED ...... 7

DON'T KNOW ...... 9

MPQ.030 Did these symptoms begin only because of an injury?

YES ...... 1

NO ...... 2 (MPQ.050)

REFUSED ...... 7 (MPQ.050)

DON'T KNOW ...... 9 (MPQ.050)

MPQ.050 Please look at this card and give me the joints that were affected.

CODE ALL THAT APPLY. HAND CARD MPQ1

SHOULDER - RIGHT ...... 10

SHOULDER - LEFT ...... 11

ELBOW - RIGHT ...... 12

ELBOW - LEFT ...... 13

HIP - RIGHT ...... 14

HIP - LEFT ...... 15

WRIST - RIGHT ...... 16

WRIST - LEFT ...... 17

KNEE - RIGHT ...... 18

KNEE - LEFT ...... 19

ANKLE - RIGHT ...... 20

ANKLE - LEFT ...... 21

TOES - RIGHT ...... 22

TOES - LEFT ...... 23

FINGERS/THUMB - RIGHT ...... 24

FINGERS/THUMB - LEFT ...... 25

OTHER (SPECIFY) ...... 26

REFUSED ...... 77

DON'T KNOW ...... 99

  • The NHANES codebook provides SAS metadata:

NHANES 99+

Codebook for Data Collection (1999-2000)

  • Create SAS variables for each instance of joint pain by recoding the original values:

Shoulder50 = (MPD050a = 10 or MPD050b = 11);

Refused50 = (MPD050a = 77);

Dontknow50 = (MPD050a = 99);

Elbow50 = (MPD050c = 12 or MPD050d = 13);

Hip50 = (MPD050e = 14 or MPD050f = 15);

Wrist50 = (MPD050g = 16 or MPD050h = 17);

Knee50 = (MPD050i = 18 or MPD050j = 19);

Ankle50 = (MPD050k = 20 or MPD050l = 21);

Toes50 = (MPD050m = 22 or MPD050n = 23);

Fingers50 = (MPD050o = 24 or MPD050p = 25);

Other50 = (MPD050q = 26);

MPQ.060 The following questions are about pain {you/SP} may have experienced in the past 3 months. Please refer to pain that lasted a whole day or more. Do not report aches and pains that were fleeting or minor.

MPQ.120 Regarding {your/SP's} pain problem, which regions are affected?

CODE ALL THAT APPLY HAND CARD MPQ2

HEAD ...... 10

FACE/DENTAL ...... 11

SHOULDER GIRDLE - RIGHT ...... 12

SHOULDER GIRDLE - LEFT ...... 13

UPPER ARM - RIGHT ...... 14

UPPER ARM - LEFT ...... 15

MID-ARM - RIGHT ...... 16

MID-ARM - LEFT ...... 17

LOWER ARM - RIGHT ...... 18

LOWER ARM - LEFT ...... 19

UPPER BACK - RIGHT ...... 20

UPPER BACK - LEFT ...... 21

  • Create SAS variables for each instance of region pain by recoding the original values:

Head120 = (MPQ120a = 10);

Refused120 =(MPQ120a = 77);

Dontknow120 = (MPQ120a = 99);

Facedental120 = (MPQ120b = 11);

Shoulder120 = (MPQ120c = 12 or MPQ120d = 13);

Arm120 = (MPQ120e = 14 or MPQ120f = 15 or MPQ120g = 16 or MPQ120h = 17

or MPQ120i = 18 or MPQ120j = 19);

Upperback120 = (MPQ120k = 20 or MPQ120l = 21);

Lowerback120 = (MPQ120m = 22 or MPQ120n = 23);

Buttocks120 = (MPQ120o = 24 or MPQ120p = 25);

Leg120 = (MPQ120q = 26 or MPQ120r = 27 or MPQ120s = 28 or MPQ120t = 29

or MPQ120u = 30 or MPQ120v = 31);

Neck120 = (MPQ120w = 32);

Sternum120 = (MPQ120x = 33);

Chest120 = (MPQ120y = 34 or MPQ120z = 35);

Abdomen120 = (MPQ120aa = 36);

Spine120 = (MPQ120ab = 37);

Hand120 = (MPQ120ac = 38 or MPQ120ad = 39);

Foot120 = (MPQ120ae = 40 or MPQ120af = 41);

  • Generate a frequency table for every possible combination of joint pain and chronic region pain values:

%macro wtfreqs(var,table);

proc freq data=anlys_1 noprint;

table &var*Head120/ out=&table.1;

table &var*Refused120 / out=&table.2;

table &var*Dontknow120 / out=&table.3;

table &var*Facedental120 / out=&table.4;

table &var*Shoulder120 / out=&table.5;

table &var*Arm120/ out=&table.6;

table &var*Upperback120 / out=&table.7;

table &var*Lowerback120 / out=&table.8;

table &var*Buttocks120 / out=&table.9;

table &var*Leg120 / out=&table.10;

table &var*Neck120 / out=&table.11;

table &var*Sternum120 / out=&table.12;

table &var*Chest120 / out=&table.13;

table &var*Abdomen120 / out=&table.14;

table &var*Spine120 / out=&table.15;

table &var*Hand120 / out=&table.16;

table &var*Foot120 / out=&table.17;

weight wtint2yr;

run;

%mend wtfreqs;

From the NHANES ANALYTIC AND REPORTING GUIDELINES [2]regarding weighting of sample data:

“NHANES is based on a complex multistage probability sample design. Several aspects of the NHANES design must be taken into account in data analysis, including the sampling weights and the complex survey design. Appropriate sampling weights are needed to estimate prevalence, means, medians, and other statistics. Sampling weights are used to produce correct population estimates because each sample person does not have an equal probability of selection. The sampling weights incorporate the differential probabilities of selection and include adjustments for noncoverage and nonresponse. Although initial exploratory analyses may be performed on unweighted data with standard statistical packages assuming simple random sampling, final analyses should be done on weighted data using appropriate sampling weights.”

The SAS FREQ procedure weight statement treats observations as if they appear multiple times in the input data set. The syntax is as follows, WEIGHT variable, where variable specifies a numeric variable whose value represents the frequency of the observation[3]. The FREQ procedure is unique in its application of sampling weights using the weight statement. Other SAS procedures such as the REPORT procedure apply weighting using the FREQ statement.

%wtfreqs(Shoulder50,a)

%wtfreqs(Refused50,b)

%wtfreqs(Dontknow50,c)

%wtfreqs(Elbow50,d)

%wtfreqs(Hip50,e)

%wtfreqs(Wrist50,f)

%wtfreqs(Knee50,g)

%wtfreqs(Ankle50,h)

%wtfreqs(Toes50,i)

%wtfreqs(Fingers50,j)

%wtfreqs(Other50,k)

run;

  • Create a comprehensive dataset representing only those combinations of joint pain and chronic region pain values that include non-missing observations from both pain questions

%macro dsfreqs(ds,table,var);

data &ds;

set &table.1 (where=(&var and Head120))

&table.2 (where=(&var and Refused120))

&table.3 (where=(&var and Dontknow120))

&table.4 (where=(&var and Facedental120))

&table.5 (where=(&var and Shoulder120))

&table.6 (where=(&var and Arm120))

&table.7 (where=(&var and Upperback120))

&table.8 (where=(&var and Lowerback120))

&table.9 (where=(&var and Buttocks120))

&table.10 (where=(&var and Leg120))

&table.11 (where=(&var and Neck120))

&table.12 (where=(&var and Sternum120))

&table.13 (where=(&var and Chest120))

&table.14 (where=(&var and Abdomen120))

&table.15 (where=(&var and Spine120))

&table.16 (where=(&var and Hand120))

&table.17 (where=(&var and Foot120));

run;

data all (drop=percent);

set a b c d e f g h i j k;

Create two variables (dim1 and dim2), that will represent row and column frequencies of the recoded survey response variables in a cross tabulation

attrib dim1 length=$13 label='Affected Regions' dim2 length=$10 label='Affected Joints';

select;

when (Head120) dim1 = 'Head';

when (Refused120) dim1 = 'Refused';

when (Dontknow120) dim1 = 'Dont know';

when (Facedental120) dim1 = 'Face/dental';

when (Shoulder120) dim1 = 'Shoulder';

when (Arm120) dim1 = 'Arm';

when (Upperback120) dim1 = 'Upper back';

when (Lowerback120) dim1 = 'Lower back';

when (Buttocks120) dim1 = 'Buttocks';

when (Leg120) dim1 = 'Leg';

when (Neck120) dim1 = 'Neck';

when (Sternum120) dim1 = 'Sternum';

when (Chest120) dim1 = 'Chest';

when (Abdomen120) dim1 = 'Abdomen';

when (Spine120) dim1 = 'Spine';

when (Hand120) dim1 = 'Hand';

when (Foot120) dim1 = 'Foot';

otherwise dim1 = '';

end;

select;

when (Shoulder50) dim2 = 'Shoulder';

when (Refused50) dim2 = 'Refused';

when (Dontknow50) dim2 = 'Dont know';

when (Elbow50) dim2 = 'Elbow';

when (Hip50) dim2 = 'Hip';

when (Wrist50) dim2 = 'Wrist';

when (Knee50) dim2 = 'Knee';

when (Ankle50) dim2 = 'Ankle';

when (Toes50) dim2 = 'Toes';

when (Fingers50) dim2 = 'Fingers';

when (Other50) dim2 = 'Other';

otherwise dim2 = '';

end;

run;

%mend dsfreqs;

%dsfreqs(a,a,Shoulder50);

%dsfreqs(b,b,Refused50);

%dsfreqs(c,c,Dontknow50);

%dsfreqs(d,d,Elbow50);

%dsfreqs(e,e,Hip50);

%dsfreqs(f,f,Wrist50);

%dsfreqs(g,g,Knee50);

%dsfreqs(h,h,Ankle50);

%dsfreqs(i,i,Toes50);

%dsfreqs(j,j,Fingers50);

%dsfreqs(k,k,Other50);

  • Generate the final report

procreportdata=all nowindowsmissingheadskip;

column dim1 dim2,count;

define dim1 / groupwidth=13 ;

define dim2 / across;

define count / analysissumformat=COMMA20.0'n';

run;

In the previous example saving weighted output generated by the FREQ procedure to an output dataset and displaying that output using the REPORT procedure provides a means for display of wide integer values. The REPORT procedure can accommodate these values and allows for the control of cell formatting. The default output of cell values in a FREQ table could result in the display of cell values using scientific notation, e.g. numeric values equal to or greater than eight positions wide.

A solution to this problem on the Microsoft Windows Platform is available from SAS support documentation: FAQ # 1786 Is there away to format the statistics in the cross-table produced by PROC FREQ? Answer: Beginning with Release 8.1 of SAS, there is an undocumented FORMAT= TABLES statement option that allows you to specify a format for the statistics. This option was added because there is no crosstabfreq ODS template that could be modified to change the format of the statistics[4].

NHANES 99-00
Cross tabulation of positive responses to survey questions MPQ120 (Regions Affected)
and MPD050 (Joints Affected) for NHANES survey participants who answered Yes to MCQ160
Unweighted
AffectedJoints
Ankle / Elbow / Fingers / Hip / Knee / Other / Shoulder / Toes / Wrist
AffectedRegions / n / n / n / n / n / n / n / n / n
Abdomen / 16 / 11 / 17 / 12 / 22 / 1 / 16 / 4 / 8
Arm / 45 / 48 / 37 / 36 / 59 / 7 / 55 / 20 / 38
Buttocks / 40 / 24 / 32 / 37 / 56 / 9 / 34 / 17 / 26
Chest / 8 / 5 / 7 / 6 / 9 / 2 / 9 / 3 / 4
Dont know / . / . / 1 / . / . / . / . / 1 / .
Face/dental / 13 / 11 / 12 / 8 / 14 / 2 / 14 / 5 / 11
Foot / 53 / 25 / 33 / 26 / 48 / 6 / 39 / 28 / 26
Hand / 37 / 33 / 48 / 28 / 49 / 1 / 42 / 26 / 35
Head / 33 / 23 / 31 / 29 / 45 / 5 / 30 / 13 / 24
Leg / 79 / 44 / 51 / 65 / 138 / 12 / 75 / 26 / 47
Lower back / 73 / 53 / 62 / 66 / 107 / 16 / 75 / 25 / 50
Neck / 44 / 40 / 44 / 41 / 69 / 10 / 58 / 24 / 38
Refused / 1 / 1 / 1 / 1 / 1 / . / 1 / 1 / 1
Shoulder / 54 / 44 / 52 / 50 / 76 / 6 / 103 / 27 / 41
Spine / 25 / 20 / 22 / 25 / 34 / 7 / 34 / 12 / 19
Sternum / 5 / 5 / 6 / 4 / 7 / 1 / 6 / 4 / 4
Upper back / 24 / 22 / 16 / 21 / 28 / 5 / 27 / 7 / 19
NHANES 99-00
Cross tabulation of positive responses to survey questions MPQ120 (Regions Affected)
and MPD050 (Joints Affected) for NHANES survey participants who answered Yes to MCQ160
Full Sample 2 Year Interview Weight
AffectedJoints
Ankle / Elbow / Fingers / Hip / Knee / Other / Shoulder / Toes / Wrist
AffectedRegions / n / n / n / n / n / n / n / n / n
Abdomen / 429,958 / 400,104 / 274,076 / 316,051 / 640,187 / 2,710 / 329,883 / 88,574 / 136,008
Arm / 1,525,207 / 2,056,496 / 1,526,275 / 1,404,824 / 2,561,355 / 194,005 / 2,194,921 / 640,246 / 1,405,588
Buttocks / 1,651,581 / 1,178,071 / 1,403,960 / 1,249,935 / 2,272,391 / 327,929 / 1,690,884 / 753,215 / 1,240,717
Chest / 398,286 / 253,426 / 323,141 / 303,541 / 352,429 / 11,507 / 280,376 / 152,394 / 251,292
Dont know / . / . / 3,498 / . / . / . / . / 3,498 / .
Face/dental / 605,156 / 607,129 / 540,795 / 296,764 / 606,602 / 11,507 / 654,113 / 147,754 / 504,916
Foot / 2,223,468 / 1,082,243 / 1,100,993 / 746,624 / 1,891,356 / 271,638 / 1,742,052 / 812,139 / 1,140,223
Hand / 1,364,416 / 1,365,395 / 1,813,441 / 981,372 / 1,862,868 / 62,508 / 1,512,258 / 995,245 / 1,405,502
Head / 1,419,714 / 1,126,895 / 1,326,280 / 991,503 / 1,761,166 / 143,371 / 1,191,221 / 434,422 / 918,543
Leg / 2,570,419 / 1,843,554 / 1,868,701 / 2,265,130 / 5,050,079 / 335,623 / 2,784,371 / 956,380 / 1,633,988
Lower back / 2,632,807 / 2,301,164 / 2,608,339 / 2,521,009 / 4,193,287 / 455,485 / 3,341,103 / 912,681 / 1,880,029
Neck / 1,886,262 / 1,935,908 / 1,865,593 / 1,683,254 / 2,976,353 / 228,065 / 2,526,951 / 963,832 / 1,611,639
Refused / 26,311 / 26,311 / 26,311 / 26,311 / 26,311 / . / 26,311 / 26,311 / 26,311
Shoulder / 2,157,715 / 2,097,878 / 2,052,722 / 1,917,835 / 3,386,778 / 299,144 / 4,161,594 / 888,175 / 1,549,446
Spine / 1,212,026 / 1,060,117 / 1,028,620 / 1,010,831 / 1,458,014 / 251,270 / 1,656,260 / 563,453 / 1,111,666
Sternum / 230,539 / 315,435 / 303,362 / 188,447 / 327,222 / 72,823 / 318,273 / 147,161 / 190,315
Upper back / 1,146,370 / 1,082,466 / 737,917 / 830,672 / 1,263,165 / 129,676 / 1,464,090 / 287,056 / 943,161

Conclusion

Weighted frequencies are more appropriate when analyzing the potential effect of non-response on survey estimates. Weighted response frequencies provide a more reliable tool for identifying data trends and issues. In this example the weighted survey response rates provide a more statistically significant basis for conclusions taken from sample estimates.

REFERENCES

National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health Service. 2000.

SAS Procedures Guide, SAS Institute Inc. SAS OnlineDoc®, Version 8 February 2000

Copyright ©2000, SAS Institute Inc..

CONTACT INFORMATION

Ian Duling

Astra Zeneca LP

1800 Concord Pike

PO Box 15437

Wilmingotn, DE 19850-5437

1

[1] National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health Service. 2000.

[2] ANALYTIC AND REPORTING GUIDELINES, The Third National Health and Nutrition Examination Survey, NHANES III (1988-94), October, 1996, 2-3

National Center for Health Statistics. Health, United States, 2000. Hyattsville, Maryland: Public Health Service. 2000.

[3] SAS Procedures Guide, SAS Institute Inc. SAS OnlineDoc®, Version 8 February 2000

Copyright ©2000, SAS Institute Inc..

[4]