MERGE Statement

Merging combines observations from two or more SAS data sets into one SAS data set horizontally.

Syntax:MERGE SAS-data-set-list;

BY variable-list;

One-to-One Merging: If the by variableis not specified, two or more SAS data sets merge to one SAS data set by observation numbers.

Example:

libname cdat '/courses/ddbf9765ba27fe300';

data left_dat;

input week $8.;

cards;

Week1

Week2

Week3

Week4

Week5

;

run;

data right_dat;

input topic $8.;

cards;

Topic 1

Topic 2

Topic 3

Topic 4

Topic 5

;

run;

data all;

merge left_dat right_dat;

run;

procprintdata=all;

run;

The output data set is printed as following:

Obs week topic

1 Week1 Topic 1

2 Week2 Topic 2

3 Week3 Topic 3

4 Week4 Topic 4

5 Week5 Topic 5

Data sets with the same variables

When two or more SAS data sets have the same variables, the values of the last SAS data set in the merge statement will over write the early data set, and get into the output SAS data set. If over writing is not what you want, use RENAME data step option to rename one of the input data sets variable.

Example:

data left_dat;

input ID $3. balance 4.;

cards;

001 102

005 89

002 231

004 147

003 192

;

run;

data right_dat;

input Name $ 1-15 @17 balance 4.;

cards;

John Smith 96

Ted Husion 80

Martha Chen 150

Sandy Lee 100

Paul Leny 192

;

run;

data all;

merge left_dat right_dat;

run;

procprintdata=all;

run;

The following is the output of data set all:

Obs ID balance Name

1 001 96 John Smith

2 002 80 Ted Husion

3 003 150 Martha Chen

4 004 100 Sandy Lee

5 005 192 Paul Leny

Match-Merging:

Merging with a by statement allows match observations according to the values of the by variables. Before the merging, all input data sets must be sorted by the BY variables or KEY variables.

Example:

data left_dat;

input ID $3. balance 4. zip 6.;

cards;

001 102 16431

005 89 46298

002 231 98704

004 147 42316

003 192 44765

007 479 21496

;

run;

data right_dat;

input Name $ 1-15 @17 balance 4. @23 ID $3.;

cards;

Sandy Lee 100 004

Paul Leny 192 003

John Smith 96 001

Ted Husion 80 005

Martha Chen 150 002

Jason Tod 244 006

;

run;

procsortdata=left_dat;

by id;

run;

procsortdata=right_dat;

by id;

run;

data all;

merge left_dat right_dat;

by id;

run;

procprintdata=all; run;

The following is the output of data set all:

Obs ID balance zip Name

1 001 96 1643 John Smith

2 002 150 9870 Martha Chen

3 003 192 4476 Paul Leny

4 004 100 4231 Sandy Lee

5 005 80 4629 Ted Husion

6 006 244 . Jason Tod

7 007 479 2149

IN= option :

In the above example, the observation number 6 and 7 are contributed from one of the two datasets. The IN= option create a variable that can identify whether the data set has contribution to the output.

Example: In the above example, we add another variable ‘source’, and use IN= option to identify the contribution from each input dataset:

data all;

length source $8;

merge left_dat(in=in1) right_dat(in=in2);

by id;

if in1 and in2 then source='Both';

elseif in1 then source='Left';

else source='Right';

run;

procprintdata=all; run;

The Following is the output of data all:

Obs source ID balance zip Name

1 Both 001 96 1643 John Smith

2 Both 002 150 9870 Martha Chen

3 Both 003 192 4476 Paul Leny

4 Both 004 100 4231 Sandy Lee

5 Both 005 80 4629 Ted Husion

6 Right 006 244 . Jason Tod

7 Left 007 479 2149

Example: In the above example, if we want the output data set contains only the observations that have contribution from both input data sets:

data all;

merge left_dat(in=in1) right_dat(in=in2);

by id;

if in1 and in2 ;

run;

ExampleIf we want the output data set contains all observations that have contribution from right_dat input data set:

data all;

merge left_dat(in=in1) right_dat(in=in2);

by id;

if in2 ;

run;

Note: If the purpose of the match merging is one-to-one match merging, the input data set should have no duplicate keys. Therefore, before the merging, the NODUPKEY option might need to be used for proc sort before the merging.

Match-Merging for one-to-many or many-to-one

procsortdata=cdat.sort_data2 out=sort_data;

by type order;

run;

data sum_data(keep=type order sum_bal);

set sort_data;

by type order;

if first.order then sum_bal=bal;

else sum_bal + bal;

if last.order thenoutput;

run;

data final;

merge sort_data sum_data;

by type order;

run;

Important note for one-to-many or many-to-one Merge: If the MANY dataset has the same variable name with the ONE dataset, it has to be renamed or dropped before merge. Otherwise, only the first observation of each by group will be merged in.

Example:

data sort_data2;

set sort_data;

by type order;

if first.type or first.order then sum_bal=bal;

else sum_bal+bal;

run;

data final2;

merge sort_data2 sum_data;

by type order;

run;

Compare with proc SQL method

libname cdat '/courses/ddbf9765ba27fe300';

procsql;

select industry, ticker, price, sum(price) as sumprice,

mean(price) as meanprice, std(price) as stdprice,

price/sum(price) as pctprice

from cdat.stocks2

groupby industry

having mean(price)>50

;

quit;

Note: Proc SQL usually is more convenient in calculation of percentage, group percentage etc. than data steps, such as combined data step and merge, or combined proc summary (or means) and merge.

Data Step Interface

The data step performs analysis within the data step. At the end of the data step, all information will be saved in the output data sets. How do we bring some of information to the outside of the data step, or the next data step without carrying the whole column of the data set?

CALL SYMPUT routine: The SYMPUT routine assigns a value produced in the data step to a macro variable. If the macro variable does not exist, it will be created. If the macro variable does exist, it will be over written. In most cases macro variables created by SYMPUT routine is GLOBAL, i.e. it is available for the whole SAS session.

SYNTAX: CALL SYMPUT(mvname, value);

Note: The SYMPUT routine assigns the value of the macro variable during data setp execution, but the macro variable references resolve during the compilation of a step or global statement used outside a step.Therefore:

  • Macro variable generated by the SYMPUT routine cannot be referenced in the same data step.
  • To reference a value in a global statement following the data step (for example, a TITLE or FOOTNOTE statement), an explicit RUN statement is required.

To use the macro variable, use macro reference &. For example: a=&mvname;

Example:

data stocks;

set cdat.stocks2 end=last;

if ticker='AOL'thencall symput('aol_price',price);

if last thencall symput('nobs', _n_); * Try trim(left(_n_)) here;

run;

%put &aol_price &nobs;7

Title"Stock price for &nobs companies";

procprintdata=stocks;

run;

data check;

ticker='AOL';

price=&aol_price;

put'AOL price is ' price;

put"The price for AOL is &aol_price";

put"The stocks data set has &nobs observations";

run;

What’s happened without RUN:

data stocks;

set cdat.stocks2 (obs=5) end=last;

if ticker='AOL'thencall symput('aol_price',price);

if last thencall symput('nobs', _n_); * ty trim(left(_n_)) here;

* run;

Title"Stock price for &nobs companies";

procprintdata=stocks;

run;

In the above example, we commented out the RUN statement in the data step. Also, we added a data step option (obs=5). The title is: Stock price for the 5 companies.

Why It is not working?

data stocks;

set cdat.stocks2;

if ticker='AOL';

call symput('aol_price',price);

new_price=&aol_price;* using the macro variable in the same data step is wrong;

put'The new price is ' new_price;

run;

Running the above code in a new SAS session, you will get ERROR message for the line:

new_price=&aol_price; Because at compile time, the compiler does not recognize &aol_price as an macro variable.If run the above code in an existing SAS session and the macro variable &aol_price is already exist in the session, change the value for price, you will see that the first run after the change the price is still the old value. Start from the second run, the price is the new value.

Example for using call symput: An efficient way to get number of observations in a SAS data set:

DATA_NULL_;

SET cdat.stocks2 NOBS=obsnumber;

CALL SYMPUT('numobs',trim(left(obsnumber)));

STOP;

RUN;

%put &numobs;

%put Data set stocks has &numobs observations;

Note:%PUT statement put global macro variable’s value into SAS LOG.

Example: Calculate the percent of total:

data random(drop=i);

do i=1to10; * Preparation: Create a random number data;

balance=ceil(ranuni(1)*10);

output;

end;

run;

data_null_;* Calculate the total and create a macro variable;

set random end=eof;

tot+balance;

if eof thencall symput('total',tot);

run;

data cal_pct;* Calculate percent;

set random;

pct=balance/&total;

format pct percent8.2; * Note: percent6.2 is not working, so make it longer to 8.2;

run;

procprintdata=cal_pct;

run;

CALL SYMPUTX() Routine: Assigns a value to a macro variable and removes both leading and trailing blanks.

data_null_;

pi=3.1416;

call symput('pi', pi);

call symputx('pi2', pi);

run;

%put pi=π %put; %put; %put pi2=&pi2;

Directing Output To External Files

FILE statement:FILE statement specifies the current output file for put statement.

Example: The following program use the FILE statement direct the output of the put statement to an external text file.

libname cdat '/courses/ddbf9765ba27fe300';

data_null_;

set cdat.stocks;

file'/home/yihong1/temp/stocks.txt'; * Output go to this text file;

put @1 ticker 5. -R

@8 price dollar8.2 -R; * -R make the output right aliened;

run;

Example: The above can also be written as the following way.

filename outfile '/home/yihong1/temp/stocks.txt';

data_null_;

set cdat.stocks;

file outfile; * Output go to this text file;

put @1 ticker 5. -R

@8 price dollar8.2 -R; * -R make the output right aliened;

run;

Details of file statement will be given in data _null_ / PUT reporting session later.

PROC PRINTTO procedure: The PRINTTO procedure defines destinations for SAS procedure output and for the SAS LOG.

Example: Direct print output to external file.

filename out1 'c:\sas_class\temp\date_time.txt'; * Specify output file;

procprinttoprint=out1 new; * Direct output to file out1. Without new option, append;

run;

procprintdata=cdat.date_time; run;

procprintto; * Reset proc printto to default, usually it is output windows;

run;

Example: Direct bothprint output and log to external files.

filename out1 '/home/yihong1/temp/date_time.txt';

filename log1 '/home/yihong1/temp/date_time.log';

libname cdat '/courses/ddbf9765ba27fe300';

procprinttoprint=out1 log=log1 new;

run;

procprintdata=cdat.date_time;

run;

procprintto;

run;

Miscellaneous

X command:To execute a single host system operating system command.

Syntax: X ‘operating-system-command’;

Example;

x'cd /home/yihong1/temp';

x'mkdir jk';

x'rmdir jk';

The above command requires user manually close the host system window. To automate the process, use NOXWAIT system option:

Options noxwait;

x'cd /home/yihong1/temp';

x'mkdir jk';

x'rmdir jk';

Example: To start an Excel spread sheet:

x'cd c:\sas_class\classdata';

x'stock.xls';

Run Excel macro from SAS:

options noxwait noxsync;

x'cd C:\sas_class\classdata';

x'dow_history.xls';

filename EXCEL DDE'EXCEL|SYSTEM';

data_null_;

file excel;

put'[RUN("macro2")]';

run;

To interrupt a running SAS process: use <Ctrl>-<Break, i.e. hold <Ctrl> key and press <Break> key.

To end the SAS session: Use ENDSAS statement.

Example: Endsas;

Batch submit SAS program

In Windows environment:

"C:\program Files\SAS\sas 9.1\Sas.exe" -sysin c:\temp\aa.sas

In Unix environment: In the directory of your program,

>sas myprog.sas then <Enter>batch run.

> sas myprog.sas & then <Enter>batch background run.

> nohup sas myprog.sas & then <Enter> batch background run and do not hang up.

Frequently Used SAS Procedures

The FREQ procedure

The FREQ procedure produces one-way to n-way frequency crosstabulation tables.

Syntax:

PROC FREQ <options>;

BY variables;

TABLES requests </options>;

WEIGHT variable;

Run;

Example: Without TABLES statement, all variables frequency will be calculated without crosstabulation.

procfreqdata=cdat.sort_data2;

run;

Example: Two one way frequency tables will be generated.

*options date number;

options nodate nonumber;

procfreqdata=cdat.sort_data2;

tables type order;

Title'One way frequency for two variables';

run;

Example: A two way frequency tables will be generated.

options nodate nonumber;

procfreqdata=cdat.sort_data2;

tables type * order;

Title'Two way crosstabulation';

run;

The rules of request for table statement are as following:

Tables a*(b c);tables a*b a*c;

Tables (a b)*(c d); tables a*c b*c a*d b*d;

Tables (a b c)*d;tables a*d b*d c*d;

Tables (a—c);tables a b c;

Tables (a—c)*d;tables a*d b*d c*d

Tables _numeric_;one way freq for all numeric variables.

Tables _Character_;one way freq for all character variables.

Example:

options nodate nonumber;

procfreqdata=cdat.sort_data2;

tables type * order /list;

Title'Two way crosstabulation with list option';

run;

Note: List option prints two-way to n-way tables in a list format rather than as crosstabulation tables.

Example:

procfreqdata=cdat.stocks2;

table industry / missing;

Title'One way frequency table with missing option';

run;

Note: Missing option interprets missing values as non-missing and includes them in calculations of percentage and other statistics.

Example: Direct output to a SAS data set

options nodate nonumber;

procfreqdata=cdat.sort_data2;

tables type * order / out=freq_out;

Title'Two way crosstabulation with out= option';

run;

ODS Basics:

To find out the object name for the output sas dataset, add ods trace on before the proc procedure, then find the name from the log.

odstraceon;

procfreqdata=cdat.sort_data2;

tables type * order;

Title'Two way crosstabulation with out= option';

run;

The following is the sas log output:

Output Added:

------

Name: CrossTabFreqs

Label: Cross-Tabular Freq Table

Template: Base.Freq.CrossTabFreqs

Path: Freq.Table1.CrossTabFreqs

------

NOTE: There were 36 observations read from the data set CDAT.SORT_DATA2.

NOTE: PROCEDURE FREQ used (Total process time):

real time 0.03 seconds

cpu time 0.01 seconds

From the log, we can see that the name for Cross-Tabular Freq Tables is: CrossTabFreqs in the line begins with: Name:

Then, to create the sas output dataset, replace ods trace on with the following statement: odsoutput CrossTabFreqs=cross_freq;

Where the output sas dataset name is cross_freq. You can change this name with any legal sas dataset name.

odsoutput CrossTabFreqs=cross_freq;

procfreqdata=cdat.sort_data2;

tables type * order;

Title'Two way crosstabulation with out= option';

run;

Note: PROC FREQ has many other options and can perform many statistical analysis.

For detail, see: SAS Procedure Guide

The MEANS Procedure

The MEANS procedure produces simple univariate descriptive statistics for numeric variables.

PROC MEANS <option-list> <statistic-keyword-list>;

VAR variable-list;

CLASS variable-list;

FREQ variable;

WEIGHT variable;

ID variable-list;

BY variable-list;

OUTPUTOUT= SAS data-set> <output-statistic-list>

<other options> ;

RUN;

Option-list:

The following is a few frequently used options:

DATA=SAS-data-set

MISSING: Treat missings as valid subgroup values for the CLASS variables.

MAXDEC=: Maximum number of decimal places (0 to 8) in printing results.

NOPRINT: Suppresses printing of all the descriptive statistics. Used when the

Only purpose is to create a SAS data set.

NWAY: Specify that statistics be output for only the observation with the highest

_TYPE_ value (highest level of interaction among CLASS variables).

VAR statement: List variables that statistics want to be calculated. If a VAR statement is not used, all numeric variables in the input data set, except for those listed in BY, CLASS, FREQ, ID, or WEIGHT statements, are analyzed.

CLASS statement: Segmentation variable list. Can be numeric or character, each has a small number of discrete values. Class variables do not need to be sorted. Can be used with BY variables together, each has a separate variable list.

BY statement: BY group variable list. It is similar to the CLASS variables. The main difference is the output format.

FREQ statement: Each observation in the input data set is assumed to represent m observations in the calculation of statistics, where n is the value of the FREQ variable.

ID Statement: Not used often. Self study as needed.

OUTPUT statement: Request output statistics to a new SAS data set.

OUT=SAS-data-set; Example: OUT=cdat.mean_balance;

Output-statistic-list: list of statistics to calculate and the corresponding output

variable names. Example:

mean=mean_bal std=std_bal sum=sum_bal

The available statistics including: N NMISS MIN MAX RANGE

SUM SUMWGT MEAN CSS USS VAR STD STDERR CV

SKEWNESS KURTOSIS T PRT

Example 1: Simple means using class.

libname cdat '/courses/ddbf9765ba27fe300';

procmeansdata=cdat.balance;

class type ;

var bal;

run;

The SAS System

The MEANS Procedure

Analysis Variable : bal

N

type Obs N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

A 120 120 497.4250000 279.0705059 15.0000000 996.0000000

B 80 80 468.0875000 292.2642810 8.0000000 980.0000000

C 200 200 535.1750000 284.5463716 4.0000000 996.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Example 2: Simple means with by.

procsortdata=cdat.balance out=balance;

by type;

run;

procmeansdata=balance;

by type ;

var bal;

run;

The SAS System

------type=A ------

The MEANS Procedure

Analysis Variable : bal

N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

120 497.4250000 279.0705059 15.0000000 996.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

------type=B ------

Analysis Variable : bal

N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

80 468.0875000 292.2642810 8.0000000 980.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

------type=C ------

Analysis Variable : bal

N Mean Std Dev Minimum Maximum

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

200 535.1750000 284.5463716 4.0000000 996.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Example 3: With noprint option and output send to a SAS data set.

procmeansdata=cdat.type_missing noprint;

class type ;

var bal sale;

outputout=type_out;

run;

Note: NOPRINT will cause the result not print to the output window.

Example 4: With missing option and output send to a SAS data set

procmeansdata=cdat.type_missing noprintmissing;

class type ;

var bal sale;

outputout=type_out_missing;

run;

Note: Missing option will cause proc means to treat missing values as valid subgroup values for the class variables.

Example: With NWAY option.

procmeansdata=cdat.type_missing noprintnway;

class type ;

var bal sale;

outputout=type_out;

run;

Note: NWAY option cause the statistics to be output for only the observation with the highest type value.

Example: Specify statistics key words.

procmeansdata=cdat.type_missing noprintnway;

class type ;

var bal sale;

outputout=type_out sum=;

run;

Note: When output variable name is not specified after sum=, variable names in the var statement will be the output variable names, but the value is summed value.