SAS Users Guide- 1Statistics: Unlocking the Power of Data

SAS Users Guide

to accompany

Statistics: Unlocking the Power of Data

by Lock, Lock, Lock, Lock, and Lock

Getting Started

Statistical Analysis System or SAS is a text based statistical software, as opposed to point and click based. Throughout this users guide written code will be provided for most applications, as well as direction to more information (e.g. links to documents at the SAS support website) and options regarding the procedures. Text such as DataName, VarName, Yvar, and Xvar indicate locations which require replacement with specific data names or variable names and will be written in italics.

To enter data:

Remember that each column is a different variable and the rows are the cases.

  1. If your data already exists in some format, such as Excel, you can import it into SAS by selecting

FileImport

This will open the import wizard. The following steps outline using the import wizard to enter data:

  1. Select a data source from the drop down list, and click Next

Example: For Excel select Microsoft Excel Workbook

  1. Browse for the location of your file, once selected click OK
  2. Select the appropriate worksheet and click Next
  3. Name the dataset in the Member: box and click Finish
  1. If you are typing your data in yourself you use a data statement:

The Data Statement Guide

Example:

data DataName;

inputVarName, VarName2;

cards;

1 2

3 4

5 6

run;

Warning: SAS has quantitative variables, which can contain only numbers, and categorical variables, which can contain anything. If a column being read in has anything other than a number it is considered categorical. This includes things like dollar signs, units, etc. If you wish to enter a categorical variable in the data statement you place a $ after the variable name (ex/ VarName $).

If you enter something other than a number in a quantitative variable column by mistake SAS will give you the error:

Invalid data for VarName in line #

Using SAS in Chapter 2

Note:For most tasks in SAS there are multiple approaches, we present at least one option for each. This manual attempts to present the easiest approach, which may not always be the best.

Categorical Variables

Tables for categorical variables:

For most tasks involving categorical variables we utilize the frequency procedure. Creating afrequency table for one categorical variable:

procfreqdata= DataName;

tableVarName;

run;

This provides you with both the count and the percent for each category.

For a relationship between categorical variables:

procfreqdata= DataName;

tableVarName*VarName2;

run;

This provides the count, percent, row percent, and column percent for each combination of categories. For more information about the frequency procedure:

The Frequency Procedure Guide

Graphs for categorical variables:

We utilize thegchartprocedure for graphical presentations. For a barchart:

procgchartdata= DataName;

vbarVarName;

run;

For a piechart:

procgchartdata= DataName;

pieVarName;

run;

For a relationship between categorical variables, side by side barcharts:

procgchartdata = DataName;

vbarVarName /group= VarName2;

run;

We use the gchart procedure often throughout Chapter 2, for more information:

The gchart Procedure Guide

One Quantitative Variable

Statistics for a single quantitative variable:

For statistics and graphs involving a single quantitative variable we use the univariate procedure:

procunivariatedata = DataName;

varVarName;

run;

Graphs for a single quantitative variable:

We again use the univariate procedure in order to produce a histogram:

procunivariatedata = DataName;

histogramVarName;

run;

Boxplots can also be created using the univariate procedure:

procunivariatedata = DataNameplot;

varVarName;

run;

For more information:

The Univariate Procedure Guide

One Quantitative Variable by groups in One Categorical Variable

Statistics for a quantitative variable by groups in a categorical variable:

We will again use the univariate procedure here:

procunivariatedata = DataName;

byCatVarName;

varQuantVarName;

run;

Graphs for a quantitative variable by categories in a categorical variable:

Use the gchart procedure to produce side by side histograms:

procgchartdata = DataName;

vbarQuantVarName /group = CatVarName;

run;

Two Quantitative Variables

Statistics for twoquantitative variables:

Correlation: Use the correlation procedure:

proccorrdata = DataName;

varVarNameVarName2;

run;

This provides some summary statistics for each variable (mean, standard deviation, etc.) as well as the correlation between the two variables, titled “Pearson Correlation Coefficients.”

Linear Regression: Use the regression procedure:

procregdata = DataName;

modelYvar = Xvar;

run;

The two parameter estimates, y-intercept and slope, are provided in the “parameter estimates”section of the output.

For more information on either of these procedures:

The Corr Procedure Guide

The Reg Procedure Guide

Graphs for two quantitative variables:

We can produce a scatterplot by using the gplot procedure mentioned previously:

procgplotdata = DataName;

plotYvar*Xvar;

run;

Using SASfor Chapters 3 and 4

The current version of SAS (9.2) has no easy procedures for creating bootstrap or randomization distributions. We believe the StatKey tools at lock5stat.com/statkey provide better options for these procedures.

However, for those instructors wishing to do so, we include sample code to create bootstrap distributions and perform randomization tests in SAS.

Chapter 3: Bootstrapping

Creating a set of bootstrap samples:

The code below is one way to generate a set of bootstrap samples (currently set up to generate 1,000):

databootsamp;

dosampnum = 1to1000; /* 1,000 replicates */

doi = 1to nobs;

x = round(ranuni(0) * nobs);

setDataName

nobs = nobs

point = x;

output;

end;end;stop;

run;

Finding a statistic for each boostrap sample:

The following code takes a set of bootstrap samples and finds the mean for each, saved under the data table “bootmeans”:

procunivariatedata = bootsampnoprint;

varVarName;

bysampnum;

outputout = bootmeansmean = means;

run;

If we wanted to find a different statistic we would use whatever procedure is appropriate to find our statistic of interest (see chapter 2).For example if we want correlation we would use the corr procedure on the bootstrap samples, but much of the code would look exactly the same.

Finding the confidence interval:

Once we have our set of means we can use the univariate procedure again to check that the distribution is symmetric and bell shaped (histogram), find the standard error for a confidence interval (standard deviation of the means), or find the percentiles for a confidence interval (percentiles provided in the output):

procunivariatedata = bootmeans;

var means;

histogram means;

run;

Chapter 4: Randomization Tests

Difference in Means:

We can utilize the npar1way procedure to perform a randomization test for a difference in means:

procnpar1wayscores = data data = DataName;

classCatVar;

varQuantVar;

exactscores = data /N = 1000;

run;

Anything else:

If we are attempting to do a randomization test for anything other than a difference in means we need to use samplingmethods. Below is one example, using sampling methods to test if p=0.50 with a sample size of n = 100.

Create a set of 1,000 proportions assuming the null hypothesis is true:

dataNullSamp;

dosamp = 1to1000;

p = (rand('Binomial',0.5,100))/100;

output;

end;

run;

Plot a histogram of this sample to make sure it is symmetric and bell shaped:

procunivariatedata = NullSamp;

histogramp;

run;

Find how many of your proportions are beyond your sample p (for the example we’ll assume the sample proportion = 0.55):

data more;

setNullSamp;

ifp >= 0.55;

run;

The p-value is then calculated as the number beyond (rows in the dataset more above) divided by the number created (in this case 1,000). Note: this will be determined by your alternative hypothesis of interest.

Randomization tests using sampling methods for other parameters (one mean, correlation, etc.) can be conducted with sampling procedures similar to the bootstrapping methods on the previous page.

Using SASfor Theoretical Distributions in Chapters5 – 10

Finding Values for Theoretical Distributions

To find a probability or a percentile from a theoretical distribution, we use the probcommand, which gives the probability of being less than the value presented.

Normal Distribution:

We use probnorm(z-value), where z-value denotes the specific z-value of interest.

Example:findthe probabilitya standard normal (Z) is less than 1.4, less than -2, and between1.4 and -2

dataDataName;

pn1 = probnorm(1.4);

pn2 = probnorm(-2);

pn3 = probnorm(1.4)-probnorm(-2);

run;

procprint;

run;

tDistribution:

We use probt(t-value,df), where t-value denotes the specific t-value of interest and df denotes the degrees of freedom.

Example:Find the probability a t with 10 df is less than 1.4, a t with 9 df is more than 2.3, and a t with 24 df is between -2 and 14.

data DataName;

pt1 = probt(1.4,10);

pt2 = 1- probt(2.3,9);

pt3 = probt(1.4,24)-probt(-2,24);

run;

procprint;

run;

Using SAS in Chapter 6

Inference for Means: t-Intervals and t-Tests

For both hypothesis tests and confidence intervals involving means we will be using the ttest procedure.

Confidence Intervals:

A confidence interval for one mean with confidence 1 - alpha:

procttestdata = DataNamealpha = 0.05;

varQuantVar;

run;

Confidence interval for two means with confidence 1 - alpha:

procttestdata = DataNamealpha = 0.05;

varQuantVar;

classCatVar;

run;

This gives the confidence interval for both samples individually as well as the difference.

Hypothesis Tests:

A hypothesis test for one mean with specific null hypothesis (Example: H0 = 3);

procttestdata = DataNameH0 = 3;

varQuantVar;

run;

A hypothesis test for difference in two means:

procttestdata = DataNameH0 = 0;

varQuantVar;

classCatvar;

run;

The t-test for two means shows output for “pooled” (an assumption of equal variance) and “Satterthwaite” which should match the t-statistic from the text, but with different degrees of freedom.

For more information on the ttest procedure:

The ttest Procedure Guide

Inference for Proportions: Intervals and Tests

Hypothesis Test and Confidence Interval for oneProportion:

The code for a hypothesis test and confidence interval for a single proportion using freq:

procfreqdata = DataName;

tableVarName /binomial(Waldlevel=2 Null=0.50) alpha=0.05;

run;

The level = … determines which proportion of the categorical variable you wish to run inference on (1 or 2), and the Null = … is the p0 value under the null hypothesis.

Hypothesis Test and Confidence Interval for a Difference in Proportions:

We use the genmod procedure for a hypothesis test and confidence interval with a difference in proportions:

procgenmoddata = DataName;

classCatVar;

modelYvar = CatVar;

run;

This provides a large amount of output, the important section is the “Analysis of Maximum Likelihood Parameter Estimates”. The second row of this output provides the estimated difference in proportions, a standard error, 95% confidence limits, and a p-value.

For more information on the genmod procedure:

The genmod Procedure Guide

Using SASin Chapter 7

Chi-Square Tests

Chi-Square Goodness-of-Fit Test for a Single Categorical Variable

We again use the freq procedure to perform a chi-square test:

procfreqdata = DataName;

tablesVarname /chisqtestp=(0.50.3 0.2);

run;

The values in parentheses following testp=refers to the expected proportions under the null hypothesis. Removing the testp= statement will perform a chi-square test for equally likely categories.

Chi-Square Test for Association for Two Categorical Variables

To use the freq procedure for a chi-square test for association between two categorical variables:

procfreqdata = DataName;

tablesVarname*VarName2 /chisq;

run;

Alternatively, one could calculate the chi-square test statistic by hand for either test, and compare the value to the chi-square distribution with appropriate degrees of freedom:

dataChiSquare;

px = 1-CDF('Chisquare',df,X-value);

run;

procprint;

run;

This gives you the probability of being greater than a chi-square value (X-value) for a given degrees of freedom (df).

For more information on conducting a chi-square test in the freq procedure:

The freq Procedure:Chi-square Tests and Statistics Guide

Using Sasin Chapter 8

Analysis of Variance for Difference in Means

An ANOVA to compare means is performed using the GLMprocedure:

procglmdata = DataName;

classFactorVar;

modelYvar = FactorVar;

lsmeansFactorVar /stderrCLpdiff;

run;

The first page of output gives you the analysis of variance table with degrees of freedom, sums of squares, mean squares, F statistic and p-value. The second page gives you summary statistic for the specific groups, 95% confidence limits for each mean and difference in means, and a matrix of p-values for pairwise comparisons of means.

For more information on the GLM procedure:

The GLM Procedure Guide

Using SASin Chapters 9 and 10

Correlation

The code for correlation here is the same as the code in Chapter 2. This provides summary statistics for each variable, the correlation, and a p-value for testing the correlation.

proccorrdata = DataName;

varVarNameVarName2;

run;

Linear Regression

The code for a simple or multiple linear regression is also the same as in Chapter 2. More variables can be added to the model statement for a multiple regression:

procregdata = DataName;

modelYvar = Xvar1 Xvar2;

run;

Within the output you will see:

  • The ANOVA for regression table, including the p-value for the ANOVA test
  • The estimate, standard error, test statistic (t Value), and p-value (Pr > |t|) for the slope(s) and intercept
  • The value of R-squared
  • Miscellaneous other pieces of information

See Chapters 9 and 10 for an explanation of all output.