Automation of Candidate Extract and Load Process

DWH Material

Version 1.0

REVISION HISTORY

The following table reflects all changes to this document.

Date / Author / Contributor / Version / Reason for Change
01-Nov-2004 / 1.0 / Initial Document
14-Sep-2010 / 1.1 / Updated Document

Table of Contents

1Introduction

1.1Purpose

2ORACLE

2.1DEFINATIONS

NORMALIZATION:

First Normal Form:

Second Normal Form:

Third Normal Form:

Boyce-Codd Normal Form:

Fourth Normal Form:

ORACLE SET OF STATEMENTS:

Data Definition Language :(DDL)

Data Manipulation Language (DML)

Data Querying Language (DQL)

Data Control Language (DCL)

Transactional Control Language (TCL)

Syntaxes:

ORACLE JOINS:

Equi Join/Inner Join:

Non-Equi Join

Self Join

Natural Join

Cross Join

Outer Join

Left Outer Join

Right Outer Join

Full Outer Join

What’s the difference between View and Materialized View?

View:

Materialized View:

Inline view:

Indexes:

Why hints Require?

Explain Plan:

Store Procedure:

Packages:

Triggers:

Data files Overview:

2.2IMPORTANT QUERIES

3DWH CONCEPTS

What is BI?

4ETL-INFORMATICA

4.1Informatica Overview

4.2Informatica Scenarios:

4.3Development Guidelines

4.4Performance Tips

4.5Unit Test Cases (UTP):

5UNIX

Detailed Design Document

1Introduction

1.1Purpose

The purpose of this document is to provide the detailed information about DWH Concepts and Informatica based on real-time training.

2ORACLE

2.1DEFINATIONS

Organizations can store data on various media and in different formats, such as a hard-copy document

in a filing cabinet or data stored in electronic spreadsheets or in databases.

A database is an organized collection of information.

To manage databases, you need database management systems (DBMS). A DBMS is a program that

stores, retrieves, and modifies data in the database on request. There are four main types of databases:

hierarchical, network, relational, and more recently object relational(ORDBMS).

NORMALIZATION:

Some Oracle databases were modeled according to the rules of normalization that were intended to eliminate redundancy.

Obviously, the rules of normalization are required to understand your relationships and functional dependencies

First Normal Form:

A row is in first normal form (1NF) if all underlying domains contain atomic values only.

Eliminate duplicative columns from the same table.
Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).

Second Normal Form:

An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form (1NF) and additionally:

Does not have a composite primary key. Meaning that the primary key can not be subdivided into separate logical entities.
All the non-key columns are functionally dependent on the entire primary key.
A row is in second normal form if, and only if, it is in first normal form and every non-key attribute is fully dependent on the key.
2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key. An example is resolving many: many relationships using an intersecting entity.

Third Normal Form:

An entity is in Third Normal Form (3NF) when it meets the requirement of being in Second Normal Form (2NF) and additionally:

Functional dependencies on non-key fields are eliminated by putting them in a separate table. At this level, all non-key fields are dependent on the primary key.
A row is in third normal form if and only if it is in second normal form and if attributes that do not contribute to a description of the primary key are move into a separate table. An example is creating look-up tables.

Boyce-Codd Normal Form:

Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if, and only if, every determinant is a candidate key. Most entities in 3NF are already in BCNF.

Fourth Normal Form:

An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form (3NF) and additionally:

Has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a single one-to-many relationship.

ORACLE SET OF STATEMENTS:

Data Definition Language :(DDL)

Create

Alter

Drop

Truncate

Data Manipulation Language (DML)

Insert

Update

Delete

Data Querying Language (DQL)

Select

Data Control Language (DCL)

Grant

Revoke

Transactional Control Language (TCL)

Commit

Rollback

Save point

Syntaxes:

CREATE OR REPLACE SYNONYM HZ_PARTIES FOR SCOTT.HZ_PARTIES

CREATEDATABASELINK CAASEDW CONNECTTO ITO_ASA IDENTIFIEDBY exact123 USING'CAASEDW’

Materialized View syntax:

CREATEMATERIALIZEDVIEW EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW

REFRESHCOMPLETE

STARTWITHsysdate

NEXTTRUNC(SYSDATE+1)+ 4/24

WITHPRIMARYKEY

select * from HWMD_MTH_ALL_METRICS_CURR_VW;

Another Method to refresh:

DBMS_MVIEW.REFRESH('MV_COMPLEX','C');

Case Statement:

Select NAME,

(CASE

WHEN (CLASS_CODE = 'Subscription')

THEN ATTRIBUTE_CATEGORY

ELSE TASK_TYPE

END) TASK_TYPE,

CURRENCY_CODE

From EMP

Decode()

Select empname,Decode(address,’HYD’,’Hyderabad’,

‘Bang’, Bangalore’, address) as address from emp;

Procedure:

CREATE OR REPLACE PROCEDURE Update_bal (

cust_id_IN In NUMBER,

amount_IN In NUMBER DEFAULT 1) AS

BEGIN

Update account_tbl Set amount= amount_IN where cust_id= cust_id_IN

End

Trigger:

CREATE OR REPLACE TRIGGER EMP_AUR

AFTER/BEFORE UPDATE ON EMP

REFERENCING

NEW AS NEW

OLD AS OLD

FOR EACH ROW

DECLARE

BEGIN

IF (:NEW.last_upd_tmst > :OLD.last_upd_tmst) THEN

-- Insert into Control table record

Insert into table emp_w values('wrk',sysdate)

ELSE

-- Exec procedure

Exec update_sysdate()

END;

ORACLE JOINS:

Equi join
Non-equi join
Self join
Natural join
Cross join
Outer join

Left outer

Right outer

Full outer

Equi Join/Inner Join:

SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno=d.deptno;

USING CLAUSE

SQL> select empno,ename,job ,dname,loc from emp e join dept d using(deptno);

ON CLAUSE

SQL> select empno,ename,job,dname,loc from emp e join dept d on(e.deptno=d.deptno);

Non-Equi Join

A join which contains an operator other than ‘=’ in the joins condition.

Ex: SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno > d.deptno;

Self Join

Joining the table itself is called self join.

Ex: SQL> select e1.empno,e2.ename,e1.job,e2.deptno from emp e1,emp e2 where e1.empno=e2.mgr;

Natural Join

Natural join compares all the common columns.

Ex: SQL> select empno,ename,job,dname,loc from emp natural join dept;

Cross Join

This will gives the cross product.

Ex: SQL> select empno,ename,job,dname,loc from emp cross join dept;

Outer Join

Outer join gives the non-matching records along with matching records.

Left Outer Join

This will display the all matching records and the records which are in left hand side table those that are not in right hand side table.

Ex: SQL> select empno,ename,job,dname,loc from emp e left outer join dept d on(e.deptno=d.deptno);

SQL> select empno,ename,job,dname,loc from emp e,dept d where

e.deptno=d.deptno(+);

Right Outer Join

This will display the all matching records and the records which are in right hand side table those that are not in left hand side table.

Ex: SQL> select empno,ename,job,dname,loc from emp e right outer join dept d on(e.deptno=d.deptno);

SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) = d.deptno;

Full Outer Join

This will display the all matching records and the non-matching records from both tables.

Ex: SQL> select empno,ename,job,dname,loc from emp e full outer join dept d on(e.deptno=d.deptno);

SQL> select p.part_id, s.supplier_name

2 from part p, supplier s

3 where p.supplier_id = s.supplier_id (+)

4 union

5 select p.part_id, s.supplier_name

6 from part p, supplier s

7 where p.supplier_id (+) = s.supplier_id;

What’s the difference between View and Materialized View?

View:

Why Use Views?

• To restrict data access

• To make complex queries easy

• To provide data independence

A simple view is one that:

– Derives data from only one table

– Contains no functions or groups of data

– Can perform DML operations through the view.

A complex view is one that:

– Derives data from many tables

– Contains functions or groups of data

– Does not always allow DML operations through the view

A view has a logical existence but a materialized view has

a physical existence.Moreover a materialized view can be

Indexed, analysed and so on....that is all the things that

we can do with a table can also be done with a materialized

view.

We can keep aggregated data into materialized view. we can schedule the MV to refresh but table can’t.MV can be created based on multiple tables.

Materialized View:

In DWH materialized views are very essential because in reporting side if we do aggregate calculations as per the business requirement report performance would be de graded. So to improve report performance rather than doing report calculations and joins at reporting side if we put same logic in the MV then we can directly select the data from MV without any joins and aggregations. We can also schedule MV (Materialize View).

Inline view:

If we write a select statement in from clause that is nothing but inline view.

Ex:

Get dept wise max sal along with empname and emp no.

Select a.empname, a.empno, b.sal, b.deptno

From EMP a, (Select max (sal) sal, deptno from EMP group by deptno) b

Where

a.sal=b.sal and

a.deptno=b.deptno

What is the difference between view and materialized view?

View / Materialized view
A view has a logical existence. It does not contain data. / A materialized view has a physical existence.
Its not a database object. / It is a database object.
We cannot perform DML operation on view. / We can perform DML operation on materialized view.
When we do select * from view it will fetch the data from base table. / When we do select * from materialized view it will fetch the data from materialized view.
In view we cannot schedule to refresh. / In materialized view we can schedule to refresh.
We can keep aggregated data into materialized view. Materialized view can be created based on multiple tables.

What is the Difference between Delete, Truncate and Drop?

DELETE

The DELETE command is used to remove rows from a table. A WHERE clause can be used to only remove some rows. If no WHERE condition is specified, all rows will be removed. After performing a DELETE operation you need to COMMIT or ROLLBACK the transaction to make the change permanent or to undo it.

TRUNCATE

TRUNCATE removes all rows from a table. The operation cannot be rolled back. As such, TRUCATE is faster and doesn't use as much undo space as a DELETE.

DROP

The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be removed. The operation cannot be rolled back.

Difference between Rowid and Rownum?

ROWID

A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table, and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block number, RRRR is the slot(row) number, and FFFF is a file number.
ROWNUM
For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1, the second has 2, and so on.
You can use ROWNUM to limit the number of rows returned by a query, as in this example:
SELECT * FROM employees WHERE ROWNUM < 10;

Rowid / Row-num
Rowid is an oracle internal id that is allocated every time a new record is inserted in a table. This ID is unique and cannot be changed by the user. / Row-num is a row number returned by a select statement.
Rowid is permanent. / Row-num is temporary.
Rowid is a globally unique identifier for a row in a database. It is created at the time the row is inserted into the table, and destroyed when it is removed from a table. / The row-num pseudocoloumn returns a number indicating the order in which oracle selects the row from a table or set of joined rows.

Order of where and having:

SELECT column, group_function

FROM table

[WHERE condition]

[GROUP BY group_by_expression]

[HAVING group_condition]

[ORDER BY column];

The WHERE clause cannot be used to restrict groups. you use the

HAVING clause to restrict groups.

Differences between where clause and having clause

Where clause / Having clause
Both where and having clause can be used to filter the data.
Where as in where clause it is not mandatory. / But having clause we need to use it with the group by.
Where clause applies to the individual rows. / Where as having clause is used to test some condition on the group rather than on individual rows.
Where clause is used to restrict rows. / But having clause is used to restrict groups.
Restrict normal query by where / Restrict group by function by having
In where clause every record is filtered based on where. / In having clause it is with aggregate records (group by functions).

MERGEStatement

You can use merge command to perform insert and update in a single command.

Ex:Merge into student1 s1

Using (select * from student2) s2

On (s1.no=s2.no)

When matched then

Update set marks = s2.marks

When not matched then

Insert (s1.no, s1.name, s1.marks) Values (s2.no, s2.name, s2.marks);

What is the difference between sub-query & co-related sub query?

A sub query is executed once for the parent statement

whereas the correlated sub query is executed once for each

row of the parent query.

Sub Query:

Example:

Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or sal_grade=’B’)

Co-Related Sun query:

Example:

Find all employees who earn more than the averagesalary in their department.

SELECT last-named, salary, department_id FROM employees A

WHERE salary > (SELECT AVG (salary)

FROM employees BWHERE B.department_id =A.department_id

Group by B.department_id)

EXISTS:

The EXISTS operator tests for existence of rows in

the results set of the subquery.

Select dname from dept where exists
(select 1 from EMP
where dept.deptno= emp.deptno);

Sub-query / Co-related sub-query
A sub-query is executed once for the parent Query / Where as co-related sub-query is executed once for each row of the parent query.
Example:
Select * from emp where deptno in (select deptno from dept); / Example:
Select a.* from emp e where sal >= (select avg(sal) from emp a where a.deptno=e.deptno group by a.deptno);

Indexes:

Bitmap indexes are most appropriate for columns having low distinct values—such as GENDER, MARITAL_STATUS, and RELATION. This assumption is not completely accurate, however. In reality, a bitmap index is always advisable for systems in which data is not frequently updated by many concurrent systems. In fact, as I'll demonstrate here, a bitmap index on a column with 100-percent unique values (a column candidate for primary key) is as efficient as a B-tree index.
When to Create an Index
You should create an index if:
A column contains a wide range of values
A column contains a large number of null values
One or more columns are frequently used together in a WHERE clause or a join condition
The table is large and most queries are expected to retrieve less than 2 to 4 percent of the rows
By default if u create index that is nothing but b-tree index.

Why hintsRequire?

It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to optimize a query's execution plan. When this optimizer is really doing a good job, no hints should be required at all.

Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or more accuratly, its statistics) are out of date. In this case, a hint could help.

You should first get the explain plan of your SQL and determine what changes can be done to make the code operate without using hints if possible. However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ and SJ hints can take a wild optimizer and give you optimal performance

Tables analyze and update Analyze Statement

The ANALYZE statement can be used to gather statistics for a specific table, index or cluster. The statistics can be computed exactly, or estimated based on a specific number of rows, or a percentage of rows:

ANALYZE TABLE employees COMPUTE STATISTICS;

ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 15 PERCENT;

EXEC DBMS_STATS.gather_table_stats('SCOTT', 'EMPLOYEES');

Automatic Optimizer Statistics Collection

By default Oracle 10g automatically gathers optimizer statistics using a scheduled job called GATHER_STATS_JOB. By default this job runs within maintenance windows between 10 P.M. to 6 A.M. week nights and all day on weekends. The job calls the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal procedure which gathers statistics for tables with either empty or stale statistics, similar to the DBMS_STATS.GATHER_DATABASE_STATS procedure using the GATHER AUTO option. The main difference is that the internal job prioritizes the work such that tables most urgently requiring statistics updates are processed first.

Hint categories:

Hints can be categorized as follows:

ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.

(/*+ ALL_ROWS */)

FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.

(/*+ FIRST_ROWS */)

CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.

Hints for Join Orders,
Hints for Join Operations,
Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2 or 4 or 16
Additional Hints
HASH
Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.

/*+ use_hash */