DataStage Tech Tip

Detecting Changed Column Value Using Stage Variables

1.Stage Variables

Stage variables were introduced in release 4.0 of DataStage, to provide a local storage mechanism for information within a Transformer stage.

Stage variables are initialized in the stage properties window, Stage Variables tab, and are updated for every row processed in the Stage Variables area within the graphical Transformer stage.

By default this area is hidden. It can be made visible by choosing Show Stage Variables from the pop-up menu that appears when you click with the right mouse button in the background area of the Transformer stage, or by choosing the Show/Hide Stage Variables tool from the toolbar.

2.Detecting Changed Column Value

It is a common requirement to detect that a value in a column has changed. For example, this may mean that you want to begin processing a new category of data.

Prior to stage variables, it was necessary to construct variables in a COMMON area using BASIC code, to initialize these in a before-stage subroutine, and to access them within a transform function. This technique remains valid, and will continue to remain valid.

With the advent of stage variables there is another way.

2.1Example Job

To illustrate this technique a job will be constructed using a UV stage to extract data from the UniVerse table ENGAGEMENTS.T. This is one of ten tables in a demonstration database that is available with the UniVerse RDBMS. To create these tables, execute the UniVerse command MAKE.DEMO.TABLES, for example from a DataStage Administrator command window.

Use DataStage Manager to import the table definition for ENGAGEMENTS.T.

Using Designer, construct a job that uses a UV stage, a Transformer stage and a Sequential File stage. The UV stage has a data source name of localuv, and a table name of ENGAGEMENTS.T. Load column definitions for the columns LOCATION_CODE and DATE, and include the clause ORDER BY 1,2 in the Other clauses field. If your project name is Project1 the SQL generated by DataStage should appear as:

SELECT Project1.ENGAGEMENTS.T.LOCATION_CODE, Project1.ENGAGEMENTS.T.DATE FROM "Project1"."ENGAGEMENTS.T" ORDER BY 1,2;

Test that this works (and that LOCATION_CODE is sorted) by clicking on View Data.

In the Sequential File stage choose any legal directory pathname, file name and format that you wish.

2.2Stage Variables

Within the Transformer stage open the stage properties window, then choose the Stage Variables tab. Create two stage variables, one called IsChanged and one called PrevLocCode, as shown below. Use Expression Editor to set initial values.

Click OK to close the stage properties window. If the Stage Variables area is not visible within the Transformer stage, click the Show/Hide Stage Variables tool (fourth from the left in the toolbar) or choose Show Stage Variables from the pop-up menu

Stage variables are calculated before any output links, and in the order in which they are specified. Thus, because IsChanged is calculated before PrevLocCode, the previous value of PrevLocCode is available to the expression that updates IsChanged.

Add derivation expressions for IsChanged and PrevLocCode as follows:

IsChangedDSLink3.LOCATION_CODE > PrevLocCode And @INROWNUM > 1

PrevLocCodeDSLink3.LOCATION_CODE

2.3Output Columns

Create three output columns LOCATION_CODE, DIFFERENT and DATE.

LOCATION_CODE and DATE are derived directly from their equivalent input columns. DIFFERENT is derived directly from the stage variable IsChanged.

Note that the relationship between the stage variables area and the output link area is drawn on the right hand side of the Transformer stage.

Save, compile and run the job, then use View Data for the input link on the Sequential File stage to observe that the DIFFERENT column has the value 1 for each row where the value of LOCATION_CODE is different from the previous row, and 0 for each row where it is the same.

3.Caveats

There is a couple of caveats for this approach.

3.1Sorted Data

Obviously this approach only makes sense on columns that are sorted. While it will work on unsorted columns, chances are that every row will be different from its predecessor, which rather limits the usefulness of the technique.

3.2Handling NULL

The second problem is that of NULL. Because we cannot assert that anything is equal to NULL, NULL will fail any test (Boolean) expression. In the example job, a key column was deliberately chosen since it contains no NULL value.

Where a NULL appears, the IsChanged necessarily returns NULL, which can not successfully be compared against the previous row. Further, since that NULL would then be loaded into the other stage variable, the following comparison would also return NULL.

Page 1 of 4