DATAWAREHOUSING

WHY Data Warehousing?

Data warehousing is mainly done for the reporting purposes. All the historical data is put into a Data warehouse, which can be thought of as a Very large Database. Later on, reports are generating out of this Data Warehouse to do analysis of the business.

What is difference between Enterprise Data Warehouse (EDW) and a Data Mart?

EDW consists of all the information associated with the entire Organization. For example, it will contain information about all the departments (Say Finance, Human Resource, Marketing, Sales etc).

Where as Data Mart ONLY contains the data that is specific to one department (Say only Finance).

Data Warehousing Tools

ETL Tools

ETL means Extraction, Transformation and Loading. And tools that extract data from different data sources (SQLServer, Oracle, Flat Files, Sybase etc) into a Datawarehouse are known as ETL tools. Some popular ETL tools in market are Informatica, Ab Initio and Datastage.

Reporting Tools

Reporting tools are used to generate the reports out of the information (data) stored in the Data warehouse. Some popular reporting tools in the market are Business Objects, Cognos, Microstrategy etc.

Data Modeling

A Datawarehouse is based on Fact and Dimension tables. Establishing relationship between various Fact table(s) and Dimension tables is called “Data Modeling”. Fact table contains numeric data that is needed in reports e.g. revenue, sales etc Fact table contain information about all dimension table that it is related to. This means FACT table has all the Dimension Keys as Foreign Keys.

Data Modeling is of two types:

1. Star Schema Design:

Dimension tables surrounds Fact table. Data is in de-normalized form.

2. Snow Flakes Schema Design:

Dimension tables surrounds Fact table. Data is in normalized form. Dimension table may be further split into a sub-dimension table.

Informatica Tool Installation

1. Install Oracle.

2. Install Informatica Client Tools.

3. Install Informatica Server.

While Installing the Informatica Server, Give keys for all databases, Give name for Repository (and user name and password), Give TCP/IP Port number (4001), Choose Oracle Version.

The ODBC Driver For Oracle is “Merant-32 bit for Oracle”.

INFORMATICA

About PowerCenter and PowerMart

Welcome to PowerMart and PowerCenter, Informatica’s integrated suite of software products that deliver an open, scalable solution addressing the complete life cycle for data warehouse and analytic application development. Both PowerMart and PowerCenter combine the latest technology enhancements for reliably managing data repositories and delivering information resources in a timely, usable manner.

The metadata repository coordinates and drives a variety of core functions including extraction, transformation, loading, and management. The Informatica Server can extract large volumes of data from multiple platforms, handle complex transformations, and support high-speed loads. PowerMart and PowerCenter can simplify and accelerate the process of moving data warehouses from development to test to full production.

Software features that differ between the PowerMart and PowerCenter:

If You Are Using PowerCenter

With PowerCenter, you receive all product functionality, including the ability to register multiple servers, share metadata across repositories, and partition data.

A PowerCenter license lets you create a single repository that you can configure as a global repository, the core component of a data warehouse.

When this guide mentions a PowerCenter Server, it is referring to an Informatica Server with a PowerCenter license.

If You Are Using PowerMart

This version of PowerMart includes all features except distributed metadata, multiple registered servers, and data partitioning. Also, the various options available with PowerCenter (such as PowerCenter Integration Server for BW, PowerConnect for IBM DB2, PowerConnect for IBM MQSeries, PowerConnect for SAP R/3, PowerConnect for Siebel, and PowerConnect for PeopleSoft) are not available with PowerMart.

When this guide mentions a PowerMart Server, it is referring to an Informatica Server with a PowerMart license.

Informatica Client Tools:

Designer

Server Manager

Repository Manager

Informatica Server Tools:

1. Informatica Server

Load Manager Process and Data Transformation Manager Process

The Load Manager is the primary Informatica Server process. It performs the following tasks:

  • Manages session and batch scheduling.
  • Locks the session and reads session properties.
  • Reads the parameter file.
  • Expands the server and session variables and parameters.
  • Verifies permissions and privileges.
  • Validates source and target code pages.
  • Creates the session log file.
  • Creates the Data Transformation Manager (DTM) process, which executes the session.

The Data Transformation Manager (DTM) process executes the session.

DESIGNER

The Designer has five tools to help you build mappings and mapplets so you can specify how to move and transform data between sources and targets. The Designer helps you create source definitions, target definitions, and transformations to build your mappings.

The Designer allows you to work with multiple tools at one time and to work in multiple folders and repositories at the same time. It also includes windows so you can view folders, repository objects, and tasks.

Designer Tools

The Designer provides the 5 following tools:

Source Analyzer.

Used to import or create source definitions for flat file (Fixed-width and delimited flat files), XML, COBOL, ERP, and relational sources (tables, views, and synonyms).

Warehouse Designer.

Used to import or create target definitions.

Transformation Developer.

Used to create reusable transformations.

Mapplet Designer.

Used to create mapplets.

Mapping Designer.

Used to create mappings.

What is a Transformation?

A transformation is a repository object that generates, modifies, or passes data. The Designer provides a set of transformations that perform specific functions. For example, an Aggregator transformation performs calculations on groups of data.

Transformations in a mapping represent the operations the Informatica Server performs on the data. Data passes into and out of transformations through ports that you connect in a mapping or mapplet.

Transformations can be active or passive. An active transformation can change the number of rows that pass through it, such as a Filter transformation that removes rows that do not meet the configured filter condition. A passive transformation does not change the number of rows that pass through it, such as an Expression transformation that performs a calculation on data and passes all rows through the transformation.

Transformations can be connected to the data flow, or they can be unconnected. An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.

Table 8-1 provides a brief description of each transformation:

Table 8-1. Transformation Descriptions
Transformation / Type / Description
Advanced External Procedure / Active/
Connected / Calls a procedure in a shared library or in the COM layer of Windows NT.
Aggregator / Active/
Connected / Performs aggregate calculations.
ERP Source Qualifier / Active/
Connected / Represents the rows that the Informatica Server reads from an ERP source when it runs a session.
Expression / Passive/
Connected / Calculates a value.
External Procedure / Passive/
Connected or Unconnected / Calls a procedure in a shared library or in the COM layer of Windows NT.
Filter / Active/
Connected / Filters records.
Input / Passive/
Connected / Defines mapplet input rows. Available only in the Mapplet Designer.
Joiner / Active/
Connected / Joins records from different databases or flat file systems.
Lookup / Passive/
Connected or Unconnected / Looks up values.
Normalizer / Active/
Connected / Normalizes records, including those read from COBOL sources.
Output / Passive/
Connected / Defines mapplet output rows. Available only in the Mapplet Designer.
Rank / Active/
Connected / Limits records to a top or bottom range.
Sequence Generator / Passive/
Connected / Generates primary keys.
Source Qualifier / Active/
Connected / Represents the rows that the Informatica Server reads from a relational or flat file source when it runs a session.
Router / Active/
Connected / Routes data into multiple transformations based on a group expression.
Stored Procedure / Passive/
Connected or Unconnected / Calls a stored procedure.
Update Strategy / Active/
Connected / Determines whether to insert, delete, update, or reject records.
XML Source Qualifier / Passive/
Connected / Represents the rows that the Informatica Server reads from an XML source when it runs a session.

Overview Of Transformations:

1. Aggregator

The Aggregator transformation allows you to perform aggregate calculations, such as averages and sums. The Aggregator transformation is unlike the Expression transformation, in that you can use the Aggregator transformation to perform calculations on groups. The Expression transformation permits you to perform calculations on a row-by-row basis only.

When using the transformation language to create aggregate expressions, you can use conditional clauses to filter records, providing more flexibility than SQL language.

The Informatica Server performs aggregate calculations as it reads, and stores necessary data group and row data in an aggregate cache.

After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation. When the Informatica Server performs incremental aggregation, it passes new source data through the mapping and uses historical cache data to perform new aggregation calculations incrementally.

2. Filter

The Filter transformation provides the means for filtering rows in a mapping. You pass all the rows from a source transformation through the Filter transformation, and then enter a filter condition for the transformation. All ports in a Filter transformation are input/output, and only rows that meet the condition pass through the Filter transformation.

In some cases, you need to filter data based on one or more conditions before writing it to targets. For example, if you have a human resources data warehouse containing information about current employees, you might want to filter out employees who are part-time and hourly.

The mapping in Figure 18-1 passes the rows from a human resources table that contains employee data through a Filter transformation. The filter only allows rows through for employees that make salaries of $30,000 or higher.

3. Joiner

While a Source Qualifier transformation can join data originating from a common source database, the Joiner transformation joins two related heterogeneous sources residing in different locations or file systems. The combination of sources can be varied. You can use the following sources:

a) Two relational tables existing in separate databases

b) Two flat files in potentially different file systems

c) Two different ODBC sources

d) Two instances of the same XML source

e) A relational table and a flat file source

f) A relational table and an XML source

You use the Joiner transformation to join two sources with at least one matching port. The Joiner transformation uses a condition that matches one or more pairs of ports between the two sources.

For example, you might want to join a flat file with in-house customer IDs and a relational database table that contains user-defined customer IDs. You could import the flat file into a temporary database table, and then perform the join in the database. However, if you use the Joiner transformation, there is no need to import or create temporary tables.

If two relational sources contain keys, then a Source Qualifier transformation can easily join the sources on those keys. Joiner transformations typically combine information from two different sources that do not have matching keys, such as flat file sources. The Joiner transformation allows you to join sources that contain binary data.

The Joiner transformation supports the following join types, which you set in the Properties tab:

1. Normal (Default)

2. Master Outer

3. Detail Outer

4. Full Outer

4. Source Qualifier

When you add a relational or a flat file source definition to a mapping, you need to connect it to a Source Qualifier transformation. The Source Qualifier represents the records that the Informatica Server reads when it runs a session.

You can use the Source Qualifier to perform the following tasks:

a) Join data originating from the same source database

You can join two or more tables with primary-foreign key relationships by linking the sources to one Source Qualifier.

b) Filter records when the Informatica Server reads source data

If you include a filter condition, the Informatica Server adds a WHERE clause to the default query.

c) Specify an outer join rather than the default inner join

If you include a user-defined join, the Informatica Server replaces the join information specified by the metadata in the SQL query.

d) Specify sorted ports

If you specify a number for sorted ports, the Informatica Server adds an ORDER BY clause to the default SQL query.

e) Select only distinct values from the source

If you choose Select Distinct, the Informatica Server adds a SELECT DISTINCT statement to the default SQL query.

f) Create a custom query to issue a special SELECT statement for the Informatica Server to read source data

For example, you might use a custom query to perform aggregate calculations or execute a stored procedure.

5. Stored Procedure

A Stored Procedure transformation is an important tool for populating and maintaining databases. Database administrators create stored procedures to automate time-consuming tasks that are too complicated for standard SQL statements.

A stored procedure is a precompiled collection of Transact-SQL statements and optional flow control statements, similar to an executable script. Stored procedures are stored and run within the database. You can run a stored procedure with the EXECUTE SQL statement in a database client tool, just as you can run SQL statements. Unlike standard SQL, however, stored procedures allow user-defined variables, conditional statements, and other powerful programming features.

Not all databases support stored procedures, and database implementations vary widely on their syntax. You might use stored procedures to:

a) Drop and recreate indexes.

b) Check the status of a target database before moving records into it.

c) Determine if enough space exists in a database.

d) Perform a specialized calculation.

Database developers and programmers use stored procedures for various tasks within databases, since stored procedures allow greater flexibility than SQL statements. Stored procedures also provide error handling and logging necessary for mission critical tasks. Developers create stored procedures in the database using the client tools provided with the database.

The stored procedure must exist in the database before creating a Stored Procedure transformation, and the stored procedure can exist in a source, target, or any database with a valid connection to the Informatica Server.

You might use a stored procedure to perform a query or calculation that you would otherwise make part of a mapping. For example, if you already have a well-tested stored procedure for calculating sales tax, you can perform that calculation through the stored procedure instead of recreating the same calculation in an Expression transformation.

6. Sequence Generator

The Sequence Generator transformation generates numeric values. You can use the Sequence Generator to create unique primary key values, replace missing primary keys, or cycle through a sequential range of numbers.

The Sequence Generator transformation is a connected transformation. It contains two output ports that you can connect to one or more transformations. The Informatica Server generates a value each time a row enters a connected transformation, even if that value is not used. When NEXTVAL is connected to the input port of another transformation, the Informatica Server generates a sequence of numbers. When CURRVAL is connected to the input port of another transformation, the Informatica Server generates the NEXTVAL value plus one.

You can make a Sequence Generator reusable, and use it in multiple mappings. You might reuse a Sequence Generator when you perform multiple loads to a single target.

For example, if you have a large input file that you separate into three sessions running in parallel, you can use a Sequence Generator to generate primary key values. If you use different Sequence Generators, the Informatica Server might accidentally generate duplicate key values. Instead, you can use the same reusable Sequence Generator for all three sessions to provide a unique value for each target row.

7. Rank

The Rank transformation allows you to select only the top or bottom rank of data. You can use a Rank transformation to return the largest or smallest numeric value in a port or group. You can also use a Rank transformation to return the strings at the top or the bottom of a session sort order. During the session, the Informatica Server caches input data until it can perform the rank calculations.