The Array Statement Defines a Set of Variables As Elements of an Array

The Array Statement Defines a Set of Variables As Elements of an Array

ArrayArray Array

The array statement defines a set of variables as elements of an array.

The array variables has to be either all numeric or all character.

Syntax:ARRAY array-name{subscript} <$> <length>

<array-elements> <(initial-values>;

Example:

Array v(5) v1-v5; *var1 to var5 are variables in the data set.

Array score(4) score_a score_c score_f score_t;

Array bb(3) (10 20 30) ; * Initialize numeric array;

Array aa(3) $3. (‘AAA’ ‘BBB’ ‘CCC’); * Initialize character array;

Array week(7) $3. ('Sun' 'Mon' 'Tue' 'Wed' 'Thu' 'Fri' 'Sat');

Array wk(7) $3. ('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat');

Example 1: Use array to perform calculations on a group of variables.

*** generate data;

data test_array(drop=i r);

length ID 8;

array mth(12) mth1-mth12;

do ID=1to100;

do i=1to12;

mth(i) = int(ranuni(1)*1000);

r=ranuni(0);

if r<0.1then mth(i)=.;

end;

output;

end;

run;

data test_array;

array mth(12) mth1-mth12;

set test_array;

tax_rate=0.06;

do i=1to12;

if mth(i) ne .then mth(i)=mth(i)+mth(i)*0.06;

else mth(i)=0;

end;

run;

Example 2: Use array to rotate column to row and row to column.

data vert;* Create a column data;

input v;

cards;

1

2

3

4

5

;

run;

data hori(drop=v);* Rotate column to row;

array h(5) h1-h5;

retain h1-h5 0;

set vert end=last;

h(_n_)=v;

if last thenoutput;

run;

data vert2(keep=v);* Rotate row to column;

array h(5) h1-h5;

set hori;

do i=1to5;

v=h(i);

output;

end;

run;

Explicit ARRAY vs. Implicit ARRAY

Explicit ARRAY contain an explicit specification of the number of elements in the array.

Example: Array profile(6) prof1-prof6;

^

Explicit

All arrays we used above are explicit arrays.

Implicit ARRAY does not explicitly specify the number of elements in the array.

Example: Array profile(*) ;

Example of implicit array:

data implicit;

array ar(*) ar1-ar4;

input ar(*);

cards;

11 12 13 14

21 22 23 24

31 32 33 34

41 42 43 44

;

run;

Example: Missing treatment:

libname cc '/courses/ddbf9765ba27fe300';

data Missing_treat(drop=i);

set cc.missing_data;

array Numvar(*) _numeric_;

do i = 1to dim(Numvar);

if Numvar(i)=.then Numvar(i)=0;

end;

run;

Another difference between explicit array and implicit array:

Implicit array allows specify array bounds other than 1 and n.

Example:

DATA aa;

array new(3:6) (2345);

do i=3to6;

put new(i);

end;

run;

Note: In this example, though the array is defined as new(3) to new(6), but in the data set work.aa, the variable names are new1-new4. Therefore, the array statement associate

New(3) with variable new1, new(4) with variable new2, etc.

Concept: Array variable’s program data vector buffer is created at compile time, not at execution time. This is the same for the variables defined by LENGTH statement. Therefore, you cannot define array or any variable at execution time.

Example: Incorrect array and length statement:

data bb;

a=1;

if a=1thenlength new $4;* WRONG WRONG WRONG;

if a=1thenarray ar(4) ar1-ar4;* WRONG WRONG WRONG;

run;

Two dimensional and multiple dimensional Array

Example: The following program creates 2X3=6 variables for twod1-twod6, and

2X3X2=12 variables for threed1-threed12.

data bb;

array twod(2,3);

array threed(2,3,2);

run;

How the multidimensional array associated with the one-dimensional variable names

The following SAS code demonstrate the association relationship:

data a;

array v(15) 8;

do i=1to15;

v(i)=i;

end;

run;

data b;

array a(3,5) v1-v15;

array b(5,3) v1-v15;

set a;

do i=1to3;

do j=1to5;

put'i= ' i ' j= ' j ' a(i,j)= ' a(i,j);

end;

end;

do i=1to5;

do j=1to3;

put'i= ' i ' j= ' j ' b(i,j)= ' b(i,j);

end;

end;

run;

/*

Var Name v1 v2 v3 v4 v5 v6 v7 v8 ...

a(3,5) a(1,1) a(1,2) a(1,3) a(1,4) a(1,5) a(2,1) a(2,2) a(2,3) ...

b(5,3) b(1,1) b(1,2) b(1,3) b(2,1) b(2,2) b(2,3) b(3,1) b(3,2) ...

*/

Using ARRAY to perform look up

Example: One dimensional look up

data oned; * generate random number of 1-100;

do i=1to50;

var1=CEIL(ranuni(100)*10);

output;

end;

run;

DATA oned2;

array coef(10) $6. ('1-10''11-20''21-30''31-40''41-50'

'51-60''61-70''71-80''81-90''91-100' );

set oned;

var1_fmt=coef(var1);

run;

Note: The above data step perform an one dimensional look up that requires 10 lines of IF

… THEN conditions, or a PROC FORMAT. Obviously, the array look up is more

efficient.

Disadvantage: If webrowse the above output data set “oned2”, we will find that the array created 10 variables (coef1 – coef10) that we don’t need at all in the output data set. It wasted a lot of space and make the process running slow. To solve this problem, SAS allows user to create temporary array.

Temporary Array

Temporal array does not associate the array element with the array group or variable names. Therefore, the array elements do not appear in the output data set, and it can only be referenced by array name and dimensions. Temporary array elements are automatically retained, i.e. it’s values do not reset to missing at the beginning of each data step iteration.

Examjple: array temar(5) _temporary_ (1 2 3 4 5);

Example: Using temporary array for array look up.

data oned; * generate random number of 1-100;

do i=1to50;

var1=CEIL(ranuni(100)*10);

output;

end;

run;

DATA oned2;

array coef(10) $6._temporary_ ('1-10''11-20''21-30''31-40''41-50'

'51-60''61-70''71-80''81-90''91-100');

set oned;

var1_fmt=coef(var1);

run;

Note: With this temporary array method, the array elements coef1-coef10 will not be appeared in the output data set as variables.

Multidimensional Array

Two dimensional look up

The following is the same example we used in IF … THEN look up:

Suppose there are two variables x and y. For each x and y variable there is a coefficient value as the following 2D table:

Y123

X

1111213

2212223

3313233

The follow program implement this 2D look up table in the SAS code:

data xy;

input x y;

cards;

1 1

1 2

1 3

2 1

2 2

2 3

3 1

3 2

3 3

;

run;

data lookup_xy;

set xy;

if x=1 and y=1then coef=11;

elseif x=1 and y=2then coef=12;

elseif x=1 and y=3then coef=13;

elseif x=2 and y=1then coef=21;

elseif x=2 and y=2then coef=22;

elseif x=2 and y=3then coef=23;

elseif x=3 and y=1then coef=31;

elseif x=3 and y=2then coef=32;

else coef=33;

run;

The following program use the ARRAY lookup method to solve the same problem:

data lookup_xy;

array table2d(3,3) _temporary_ (111213212223313233);

set xy;

coef=table2d(x,y);

run;

procprintdata=lookup_xy;

run;

Advanced Topic:

Array Sort:

Example: The following example sort a random array named randar(10) into a sorted array sortar(10). The ordinal function first sort all arguments (here is the randar(10)), then return the value of the I th argument.

data arraysort;

array randar(10) (41839567210);

array sortar(10);

do i=1to10;

sortar(i)=ordinal(i,of randar1-randar10);

end;

run;

data aa;

array randar(10) (38175924106);

array sortar (10);

array varlink(10) $32. ;

do i=1 to 10;

sortar(i)=ordinal(i,of randar1-randar10);

end;

do i=1 to 10;

do k=1 to 10;

if sortar(i)=randar(k) then varlink(i)='randar'||compress(k);

end;

end;

run;

End Advanced Topic.

An automated array look up example: Array_lookup.sas

PROC SORT

Syntax:PROC SORT <option-list> ;

Example: In the following example, input data is: cdat.sort_data. If data is not specified, the default data will be the most recently created SAS data set. output data is: work.aa. If output data is not specified, the output data set will over write the input data set. The BY statement specify which variable is to use to sort the data. By default the by variable is ascending.

Libname cdat '/courses/ddbf9765ba27fe300';

procsortdata=cdat.sort_data out=aa;

by type;

run;

The following code sort the data by descending order.

procsortdata=cdat.sort_data out=aa;

by descending type;

run;

The following code sort the data with 3 by variables, all in ascending order.

procsortdata=cdat.sort_data out=aa;

by type order bal;

run;

The following code sort the data with 3 by variables, variables TYPE and BAL are in ascending order, variable ORDER is in descending order.

procsortdata=cdat.sort_data out=aa;

by type descending order bal;

run;

Note 1: For proc sort, you can have many by variables. But the number of segment grows vary fast. If you have five by variables, each of the by variables has 10 values, then you will have 10,000 segments.

NODUPKEY option: Eliminate all duplicates for by variables. In the following example, variable TYPE has four values: ‘A’, ‘B’, ‘C’, and ‘D’. Therefore, in the output data, there are only 4 observations.

procsortdata=cdat.sort_data out=aa NODUPKEY;

by type;

run;

In the following example, variable TYPE has 4 values, variable order has 4 values. The output has 3*4=12 observations.

procsortdata=cdat.sort_data out=aa NODUPKEY;

by type order;

run;

Another example:

proc sort data=sashelp.cars out=cars;

by make EngineSize;

run;

proc sort data=cars out=small_engine nodupkey;

by make;

run;

NODUP option: Compare all variable values for each observation to the previous observation. If exact match is found, the current observation will not write to the output data set.

Data aa(drop=I);

Do I=1to10;

A=mod(I,2);

B=mod(I,3);

Output;

End;

Run;

procsortdata=aa ;

by a b;

run;

procsortdata=aa nodup;

by a;

run;

Note: For a data set that has duplicates, if they are not next each other, the duplicate will not be eliminated. In the above example, without the second data step, the duplicate will not be removed.

DUPOUT Option:

Dupout option output the duplicate observations to an output sas dataset:

procsortdata=cdat.sort_data out=nodup dupout=dups NODUPKEY;

by type order;

run;

TAGSORT

Stores only the BY variables and the observation number in temporary files. This is very useful when the disk space is limited. But, the speed of sorting is slower.

Note: There are many other options for the PROC SORT. These information can be find in online document or help information.

FIRST.BYVAR and LAST.BYVAR

Where BYVAR is the name of the by variable.

The BY statement causes the SAS system to create two FLAG variables for each by var.

The first is FIRST.BYVAR. It’s value is 1 for each first observation of the by group, and 0 for other observations. The second is LAST.BYVAR. It’s value is 1 for each last observation of the by group, and 0 for other observations.

Example:

libname cdat '/courses/ddbf9765ba27fe300';

data date_time;

set cdat.date_time;

run;

proc sort data=date_time;

by group;

run;

data date_time;

set date_time;

by group;

first_dot_group=first.group;

last_dot_group =last.group;

run;

proc print data=date_time;

run;

Example:

Data aa(drop=I);

Do I=1to10;

A=int(mod(I,2));

B=int(mod(I,3));

Output;

End;

Run;

procsortdata=aa ;

by b;

run;

data_null_;

set aa;

by b;

put'b= ' b ' first.b=' first.b ' last.b= ' last.b;

run;

options nocenter;

procprintdata=aa;

run;

The following is the output of the PROC PRINT

Obs A BBY variable is B

1 1 0== First.b=1 Last.b=0

2 0 0== First.b=0Last.b=0

3 1 0== Firse.b=0Last.b=1

4 1 1 == First.b=1 Last.b=0

5 0 1== First.b=0Last.b=0

6 1 1== Firse.b=0Last.b=0

7 0 1== Firse.b=0Last.b=1

8 0 2== First.b=1 Last.b=0

9 1 2== Firse.b=0Last.b=0

10 0 2== Firse.b=0Last.b=1

Eample: The following code only keep the first observation for variable b’s values.

procsortdata=aa ;

by b;

run;

data bb;

set aa;

by b;* by variables must be sorted;

if first.b;* variable b here must be by variable;

run;

Note: if first.b; equivalent to: if first.b ne 0;

The following is the output:

Obs A B

1 1 0

2 1 1

3 0 2

Example: The following code finds the observations with duplicate id.

data find_dup;

input id number;

cards;

111 2346

444 4567

222 567

222 9325

333 345

444 1895

222 1037

555 9238

666 9384

777 4522

;

run;

procsortdata=find_dup;

by id;

run;

* Method 1;

data dup_id;

set find_dup;

by id;

if (first.id ne last.id) or (first.id=0 and last.id=0);

run;

* Method 2;

data dup_id2;

set find_dup;

by id;

if not ((first.id=1) and (last.id=1));

run;

RETAINstatement

Retain statement cause the retained variable’s value remain to be the current value and not reset to missing at each iteration of the data step until it’s value is re-assigned.

  • Input variables read with SET, MERGE statement cannot be retained, because their value are reset to missing at each iteration of the data step.
  • Variable whose value is assigned in a sum statement is automatically retained, no need to retain it again.
  • Automatic variable, _n_, error, MSG etc don’t need retain.
  • END= or IN= option in SET and merge statement do not need retain.
  • As a simple rule, only the newly defined variables in the data step need to be retained if you need their vales to be retained.

Example 1: The following SAS code intend to calculate the sum and mean of the variable price. Without retain statement, sum_price and mean_price are all missing.

data stocks;

set cdat.stocks;

sum_price = sum_price + price;

mean_price= sum_price/_n_;

run;

procprintdata=stocks;

run;

The following is the selected output:

Variable_ sum_ mean_

Obs ticker price Name price price

1 ATT 55.25 TECH . .

2 LU 48.80 TECH . .

3 MSFT 67.87 TECH . .

4 PFS 45.90 PHAR . .

5 CPQ 28.60 TECH . .

Example 2: The following SAS code calculate the sum and mean with retain statement.

data stocks;

retain sum_price 0;

set cdat.stocks;

sum_price = sum_price + price;

mean_price= sum_price/_n_;

run;

procprintdata=stocks;

run;

The following is the selected output:

mean_ sum_ Variable_

Obs price price ticker price Name

1 55.2500 55.25 ATT 55.25 TECH

2 52.0250 104.05 LU 48.80 TECH

3 57.3067 171.92 MSFT 67.87 TECH

4 54.4550 217.82 PFS 45.90 PHAR

5 49.2840 246.42 CPQ 28.60 TECH

Note: Calculate sum with this method need to take care of missing value first. Because any number add missing is missing. The following is an example:

data stocks2;

* retain sum_price 0;

set cdat.stocks2;

sum_price + price;

mean_price=sum_price/_n_;

run;

procprintdata=stocks2;

run;

Note: With this method, the missing price s ignored in SUM calculation, but the N is not ignored. So the mean_price is skewed. The best practice is to remove the missing values in the calculation by WHERE statement.

The following is the selected output:

data stocks2;

retain sum_price 0;

set cdat.stocks2;

sum_price = sum_price + price;

mean_price= sum_price/_n_;

run;

procprintdata=stocks2;

run;

The following is the selected output:

mean_ sum_

Obs price price ticker price industry

1 55.2500 55.25 ATT 55.25 TECH

2 52.0250 104.05 LU 48.80 TECH

3 57.3067 171.92 MSFT 67.87 TECH

4 . . PFS . PHAR

5 . . CPQ 28.60 TECH

6 . . MRK 72.43 PHAR

7 . . AHP 67.29 PHAR

Note: Obviously, once the price has a missing value, all sum_price are missing.

To avoid this problem, either use “where price ne .” or use “if price = .

then price = 0” depend on your analysis need.

Another way to calculate sum:

data stocks2;

set cdat.stocks2;

sum_price + price;

mean_price=sum_price/_n_;

run;

procprintdata=stocks2;

run;

In the above code, sum_price + price is similar to:

sum_price= sum_price + price; But sum_price + price will treat missing value as 0. The following is the selected output:

sum_ mean_

Obs ticker price industry price price

1 ATT 55.25 TECH 55.25 55.2500

2 LU 48.80 TECH 104.05 52.0250

3 MSFT 67.87 TECH 171.92 57.3067

4 PFS . PHAR 171.92 42.9800

In the above output, the observation #4, the sum_price=171.92, missing value of price is equivalent to 0 or not counted. But the N is counted, and so the mean_price dropped.

Calculation of group sum

The following code calculate the sum for each order number in two methods. This example demonstrate how to use proc sort, retain, first.byvar, last.byvar to manulate SAS data vertically to achieve analysis purposes.

procsortdata=cdat.sort_data2 out=sort_data;

by type order;

run;

* Using retain method;

data sort_data2;

retain sum_bal 0;

set sort_data;

by type order;

if (first.type) or (first.order) then sum_bal=bal;

else sum_bal = sum_bal + bal;

* if (last.type) or (last.order) then output;

run;

* Not using retain method;

data sort_data3;

set sort_data;

by type order;

if (first.type) or (first.order) then sum_bal=bal;

else sum_bal + bal;

* if (last.type) or (last.order) then output;

run;

Calculation of percentage of group sum:

Example: The following example calculate the percentage of each observation over it’s by group sum.

procsortdata=sort_data2;

by type order descending sum_bal;

run;

data sum_data;

retain sum_bal2;

set sort_data2;

by type order;

if first.order then sum_bal2=sum_bal;

group_pct=bal/sum_bal2;

run;

Rotation of one column to 2D matrix:

data matrix2d(drop=temp coef);

array cf(14) cf1-cf14;

retain cf1-cf14 0;

set cdat.one_column;

temp=mod(_n_,14);

if temp=0then temp=14;

cf(temp)=coef*1;

if temp=14thenoutput;

run;

Rotate multiple columns with by variable horizontally

procsortdata=cdat.one_column_and_byvar out=temp;

by letter order;

run;

data hori(drop=cnt type coef);

array cf(14) cf1-cf14;

retain cf1-cf14 0;

array tp(14) $ tp1-tp14;

retain tp1-tp14 ' ';

retain cnt 0;

set temp;

by letter order;

if first.letter then cnt=1;

else cnt=cnt+1;

tp(cnt)=type;

cf(cnt)=coef;

if last.letter thenoutput;

run;

Concatenating SAS Data Sets

Concatenating combines two or more SAS data sets one after the other into a single data set. The number of observations in the new data set is the sum of the numbers of observations in the original data sets.

1. Using SET statement when two or more SAS data sets have the same variable attributes:

libname cdat '/courses/ddbf9765ba27fe300';

data date_time1 date_time2;* preparation, split the data into two data;

set cdat.date_time;

if _n_<13thenoutput date_time1;

else output date_time2;

run;

data date_time_all;* Simply use SET statement to combine;

set date_time1 date_time2;

run;

2. When data sets contain different variables

data date_time2;* prepare, rename some variables in the second data set;

set date_time2;

rename bdd=bd;

rename bmm=bm;

rename byy=by;

run;

data date_time_all2;* Set two data sets that have different variable names;

set date_time1 date_time2;

run;

The following is the output:

Obs group bdd bmm byy BIRTHDT AGE bd bm by

1 GROUP_B 4 FEB 1948 04FEB48:00:00:00 47 . .

2 GROUP_A 25 APR 1949 25APR49:00:00:00 46 . .

3 GROUP_A 4 MAR 1934 04MAR34:00:00:00 61 . .

4 GROUP_A 8 OCT 1951 08OCT51:00:00:00 44 . .

5 GROUP_B 27 FEB 1942 27FEB42:00:00:00 53 . .

6 GROUP_B 14 JUL 1952 14JUL52:00:00:00 43 . .

7 GROUP_B 19 SEP 1946 19SEP46:00:00:00 49 . .

8 GROUP_A 28 DEC 1942 28DEC42:00:00:00 53 . .

9 GROUP_B 6 JUL 1962 06JUL62:00:00:00 33 . .

10 GROUP_B 7 JUL 1948 07JUL48:00:00:00 47 . .

11 GROUP_B 23 JAN 1957 23JAN57:00:00:00 38 . .

12 GROUP_A 8 SEP 1948 08SEP48:00:00:00 47 . .

13 GROUP_B . . 12OCT53:00:00:00 42 12 OCT 1953

14 GROUP_B . . 09FEB41:00:00:00 55 9 FEB 1941

15 GROUP_B . . 09JAN45:00:00:00 51 9 JAN 1945

16 GROUP_B . . 10DEC38:00:00:00 57 10 DEC 1938

17 GROUP_A . . 20MAY58:00:00:00 37 20 MAY 1958

18 GROUP_B . . 11APR41:00:00:00 54 11 APR 1941

19 GROUP_B . . 28DEC39:00:00:00 56 28 DEC 1939

20 GROUP_B . . 14FEB51:00:00:00 45 14 FEB 1951

3. When two or more SAS data sets contain the same variables but some variables have different types: ERROR, the program will not run. Change the variable types, make all variables the same type, then run the program.

libname cdat '/courses/ddbf9765ba27fe300';* Prepare: Separate to two data sets;

data date_time1 date_time2;

set cdat.date_time;

if _n_<13thenoutput date_time1;

else output date_time2;

run;

data date_time2(rename= (bd_temp=bdd by_temp=byy)); * Prepare: Change variable types;

length bd_temp $2 by_temp $2;* for some variables;

set date_time2;

bd_temp=put(bdd,z2.);

by_temp=put(byy,4.);

drop bdd byy;

run;

data date_time_all3; * In date_time1, bariables bdd and byy are num;

set date_time1 date_time2; * In date_time1, bariables bdd and byy are char;

run;

The following is the LOG message:

53 data date_time_all3;

54 set date_time1 date_time2;

ERROR: Variable bdd has been defined as both character and numeric.

ERROR: Variable byy has been defined as both character and numeric.

55 run;

NOTE: The SAS System stopped processing this step because of errors.

WARNING: The data set WORK.DATE_TIME_ALL3 may be incomplete. When this step

was stopped there were 0 observations and 6 variables.

NOTE: DATA statement used:

real time 0.04 seconds

4. Variables in two or more SAS data sets have the same name, but some of them have different format, informat, or labels:

  • An explicitly defined format, informat, or label overrides a default, regardless the order of the data sets in the set statement.
  • If two or more data sets explicitly define different formats, informats, or labels for the same variable, the first data set in the SET statement that explicitly define the variable will set the variable attribute in the output data set.

libname cdat '/courses/ddbf9765ba27fe300';

data date_time1 date_time2;

set cdat.date_time;

if _n_<13thenoutput date_time1;

else output date_time2;

run;

data date_time2;

set date_time2;

label age='Age in years'; * label variable age for data date_time2;

run;

data date_time_all4;

set date_time1 date_time2; * The output data set has label of date_time1;

run;

options label;

proccontentsdata=date_time_all4;

run;

data date_time_all5;

set date_time2 date_time1; * The output data set has label of date_time2;

run;

proccontentsdata=date_time_all5;

run;

5. Variables in two or more SAS data sets have the same name, but some of them have different length:

  • An explicitly defined length overrides a default, regardless the order of the data sets in the set statement.
  • If two or more data sets explicitly define different length for the same variable, the first data set in the SET statement that explicitly define the variable will set the variable attribute in the output data set.

Using PROC APPEND Concatenate SAS data sets

The APPEND procedure adds the observations from one SAS data set to the end of another SAS data set. PROC APPEND doesn’t process the observations in the first data set. It simple adds the observations in the second data set to the end of the first one. Therefore, it is more efficient, especially for large data sets.