DWH Material
Version 1.0
REVISION HISTORY
The following table reflects all changes to this document.
Date / Author / Contributor / Version / Reason for Change01-Nov-2004 / 1.0 / Initial Document
14-Sep-2010 / 1.1 / Updated Document
Table of Contents
1Introduction
1.1Purpose
2ORACLE
2.1DEFINATIONS
NORMALIZATION:
First Normal Form:
Second Normal Form:
Third Normal Form:
Boyce-Codd Normal Form:
Fourth Normal Form:
ORACLE SET OF STATEMENTS:
Data Definition Language :(DDL)
Data Manipulation Language (DML)
Data Querying Language (DQL)
Data Control Language (DCL)
Transactional Control Language (TCL)
Syntaxes:
ORACLE JOINS:
Equi Join/Inner Join:
Non-Equi Join
Self Join
Natural Join
Cross Join
Outer Join
Left Outer Join
Right Outer Join
Full Outer Join
What’s the difference between View and Materialized View?
View:
Materialized View:
Inline view:
Indexes:
Why hints Require?
Explain Plan:
Store Procedure:
Packages:
Triggers:
Data files Overview:
2.2IMPORTANT QUERIES
3DWH CONCEPTS
What is BI?
4ETL-INFORMATICA
4.1Informatica Overview
4.2Informatica Scenarios:
4.3Development Guidelines
4.4Performance Tips
4.5Unit Test Cases (UTP):
5UNIX
Detailed Design Document
1Introduction
1.1Purpose
The purpose of this document is to provide the detailed information about DWH Concepts and Informatica based on real-time training.
2ORACLE
2.1DEFINATIONS
Organizations can store data on various media and in different formats, such as a hard-copy document
in a filing cabinet or data stored in electronic spreadsheets or in databases.
A database is an organized collection of information.
To manage databases, you need database management systems (DBMS). A DBMS is a program that
stores, retrieves, and modifies data in the database on request. There are four main types of databases:
hierarchical, network, relational, and more recently object relational(ORDBMS).
NORMALIZATION:
Some Oracle databases were modeled according to the rules of normalization that were intended to eliminate redundancy.
Obviously, the rules of normalization are required to understand your relationships and functional dependencies
First Normal Form:
A row is in first normal form (1NF) if all underlying domains contain atomic values only.
- Eliminate duplicative columns from the same table.
- Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key).
Second Normal Form:
An entity is in Second Normal Form (2NF) when it meets the requirement of being in First Normal Form (1NF) and additionally:
- Does not have a composite primary key. Meaning that the primary key can not be subdivided into separate logical entities.
- All the non-key columns are functionally dependent on the entire primary key.
- A row is in second normal form if, and only if, it is in first normal form and every non-key attribute is fully dependent on the key.
- 2NF eliminates functional dependencies on a partial key by putting the fields in a separate table from those that are dependent on the whole key. An example is resolving many: many relationships using an intersecting entity.
Third Normal Form:
An entity is in Third Normal Form (3NF) when it meets the requirement of being in Second Normal Form (2NF) and additionally:
- Functional dependencies on non-key fields are eliminated by putting them in a separate table. At this level, all non-key fields are dependent on the primary key.
- A row is in third normal form if and only if it is in second normal form and if attributes that do not contribute to a description of the primary key are move into a separate table. An example is creating look-up tables.
Boyce-Codd Normal Form:
Boyce Codd Normal Form (BCNF) is a further refinement of 3NF. In his later writings Codd refers to BCNF as 3NF. A row is in Boyce Codd normal form if, and only if, every determinant is a candidate key. Most entities in 3NF are already in BCNF.
Fourth Normal Form:
An entity is in Fourth Normal Form (4NF) when it meets the requirement of being in Third Normal Form (3NF) and additionally:
Has no multiple sets of multi-valued dependencies. In other words, 4NF states that no entity can have more than a single one-to-many relationship.
ORACLE SET OF STATEMENTS:
Data Definition Language :(DDL)
Create
Alter
Drop
Truncate
Data Manipulation Language (DML)
Insert
Update
Delete
Data Querying Language (DQL)
Select
Data Control Language (DCL)
Grant
Revoke
Transactional Control Language (TCL)
Commit
Rollback
Save point
Syntaxes:
CREATE OR REPLACE SYNONYM HZ_PARTIES FOR SCOTT.HZ_PARTIES
CREATEDATABASELINK CAASEDW CONNECTTO ITO_ASA IDENTIFIEDBY exact123 USING'CAASEDW’
Materialized View syntax:
CREATEMATERIALIZEDVIEW EBIBDRO.HWMD_MTH_ALL_METRICS_CURR_VIEW
REFRESHCOMPLETE
STARTWITHsysdate
NEXTTRUNC(SYSDATE+1)+ 4/24
WITHPRIMARYKEY
AS
select * from HWMD_MTH_ALL_METRICS_CURR_VW;
Another Method to refresh:
DBMS_MVIEW.REFRESH('MV_COMPLEX','C');
Case Statement:
Select NAME,
(CASE
WHEN (CLASS_CODE = 'Subscription')
THEN ATTRIBUTE_CATEGORY
ELSE TASK_TYPE
END) TASK_TYPE,
CURRENCY_CODE
From EMP
Decode()
Select empname,Decode(address,’HYD’,’Hyderabad’,
‘Bang’, Bangalore’, address) as address from emp;
Procedure:
CREATE OR REPLACE PROCEDURE Update_bal (
cust_id_IN In NUMBER,
amount_IN In NUMBER DEFAULT 1) AS
BEGIN
Update account_tbl Set amount= amount_IN where cust_id= cust_id_IN
End
Trigger:
CREATE OR REPLACE TRIGGER EMP_AUR
AFTER/BEFORE UPDATE ON EMP
REFERENCING
NEW AS NEW
OLD AS OLD
FOR EACH ROW
DECLARE
BEGIN
IF (:NEW.last_upd_tmst > :OLD.last_upd_tmst) THEN
-- Insert into Control table record
Insert into table emp_w values('wrk',sysdate)
ELSE
-- Exec procedure
Exec update_sysdate()
END;
ORACLE JOINS:
- Equi join
- Non-equi join
- Self join
- Natural join
- Cross join
- Outer join
Left outer
Right outer
Full outer
Equi Join/Inner Join:
SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno=d.deptno;
USING CLAUSE
SQL> select empno,ename,job ,dname,loc from emp e join dept d using(deptno);
ON CLAUSE
SQL> select empno,ename,job,dname,loc from emp e join dept d on(e.deptno=d.deptno);
Non-Equi Join
A join which contains an operator other than ‘=’ in the joins condition.
Ex: SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno > d.deptno;
Self Join
Joining the table itself is called self join.
Ex: SQL> select e1.empno,e2.ename,e1.job,e2.deptno from emp e1,emp e2 where e1.empno=e2.mgr;
Natural Join
Natural join compares all the common columns.
Ex: SQL> select empno,ename,job,dname,loc from emp natural join dept;
Cross Join
This will gives the cross product.
Ex: SQL> select empno,ename,job,dname,loc from emp cross join dept;
Outer Join
Outer join gives the non-matching records along with matching records.
Left Outer Join
This will display the all matching records and the records which are in left hand side table those that are not in right hand side table.
Ex: SQL> select empno,ename,job,dname,loc from emp e left outer join dept d on(e.deptno=d.deptno);
Or
SQL> select empno,ename,job,dname,loc from emp e,dept d where
e.deptno=d.deptno(+);
Right Outer Join
This will display the all matching records and the records which are in right hand side table those that are not in left hand side table.
Ex: SQL> select empno,ename,job,dname,loc from emp e right outer join dept d on(e.deptno=d.deptno);
Or
SQL> select empno,ename,job,dname,loc from emp e,dept d where e.deptno(+) = d.deptno;
Full Outer Join
This will display the all matching records and the non-matching records from both tables.
Ex: SQL> select empno,ename,job,dname,loc from emp e full outer join dept d on(e.deptno=d.deptno);
OR
SQL> select p.part_id, s.supplier_name
2 from part p, supplier s
3 where p.supplier_id = s.supplier_id (+)
4 union
5 select p.part_id, s.supplier_name
6 from part p, supplier s
7 where p.supplier_id (+) = s.supplier_id;
What’s the difference between View and Materialized View?
View:
Why Use Views?
• To restrict data access
• To make complex queries easy
• To provide data independence
A simple view is one that:
– Derives data from only one table
– Contains no functions or groups of data
– Can perform DML operations through the view.
A complex view is one that:
– Derives data from many tables
– Contains functions or groups of data
– Does not always allow DML operations through the view
A view has a logical existence but a materialized view has
a physical existence.Moreover a materialized view can be
Indexed, analysed and so on....that is all the things that
we can do with a table can also be done with a materialized
view.
We can keep aggregated data into materialized view. we can schedule the MV to refresh but table can’t.MV can be created based on multiple tables.
Materialized View:
In DWH materialized views are very essential because in reporting side if we do aggregate calculations as per the business requirement report performance would be de graded. So to improve report performance rather than doing report calculations and joins at reporting side if we put same logic in the MV then we can directly select the data from MV without any joins and aggregations. We can also schedule MV (Materialize View).
Inline view:
If we write a select statement in from clause that is nothing but inline view.
Ex:
Get dept wise max sal along with empname and emp no.
Select a.empname, a.empno, b.sal, b.deptno
From EMP a, (Select max (sal) sal, deptno from EMP group by deptno) b
Where
a.sal=b.sal and
a.deptno=b.deptno
What is the difference between view and materialized view?
View / Materialized viewA view has a logical existence. It does not contain data. / A materialized view has a physical existence.
Its not a database object. / It is a database object.
We cannot perform DML operation on view. / We can perform DML operation on materialized view.
When we do select * from view it will fetch the data from base table. / When we do select * from materialized view it will fetch the data from materialized view.
In view we cannot schedule to refresh. / In materialized view we can schedule to refresh.
We can keep aggregated data into materialized view. Materialized view can be created based on multiple tables.
What is the Difference between Delete, Truncate and Drop?
DELETE
The DELETE command is used to remove rows from a table. A WHERE clause can be used to only remove some rows. If no WHERE condition is specified, all rows will be removed. After performing a DELETE operation you need to COMMIT or ROLLBACK the transaction to make the change permanent or to undo it.
TRUNCATE
TRUNCATE removes all rows from a table. The operation cannot be rolled back. As such, TRUCATE is faster and doesn't use as much undo space as a DELETE.
DROP
The DROP command removes a table from the database. All the tables' rows, indexes and privileges will also be removed. The operation cannot be rolled back.
Difference between Rowid and Rownum?
ROWID
A globally unique identifier for a row in a database. It is created at the time the row is inserted into a table, and destroyed when it is removed from a table.'BBBBBBBB.RRRR.FFFF' where BBBBBBBB is the block number, RRRR is the slot(row) number, and FFFF is a file number.
ROWNUM
For each row returned by a query, the ROWNUM pseudo column returns a number indicating the order in which Oracle selects the row from a table or set of joined rows. The first row selected has a ROWNUM of 1, the second has 2, and so on.
You can use ROWNUM to limit the number of rows returned by a query, as in this example:
SELECT * FROM employees WHERE ROWNUM < 10;
Rowid is an oracle internal id that is allocated every time a new record is inserted in a table. This ID is unique and cannot be changed by the user. / Row-num is a row number returned by a select statement.
Rowid is permanent. / Row-num is temporary.
Rowid is a globally unique identifier for a row in a database. It is created at the time the row is inserted into the table, and destroyed when it is removed from a table. / The row-num pseudocoloumn returns a number indicating the order in which oracle selects the row from a table or set of joined rows.
Order of where and having:
SELECT column, group_function
FROM table
[WHERE condition]
[GROUP BY group_by_expression]
[HAVING group_condition]
[ORDER BY column];
The WHERE clause cannot be used to restrict groups. you use the
HAVING clause to restrict groups.
Differences between where clause and having clause
Where clause / Having clauseBoth where and having clause can be used to filter the data.
Where as in where clause it is not mandatory. / But having clause we need to use it with the group by.
Where clause applies to the individual rows. / Where as having clause is used to test some condition on the group rather than on individual rows.
Where clause is used to restrict rows. / But having clause is used to restrict groups.
Restrict normal query by where / Restrict group by function by having
In where clause every record is filtered based on where. / In having clause it is with aggregate records (group by functions).
MERGEStatement
You can use merge command to perform insert and update in a single command.
Ex:Merge into student1 s1
Using (select * from student2) s2
On (s1.no=s2.no)
When matched then
Update set marks = s2.marks
When not matched then
Insert (s1.no, s1.name, s1.marks) Values (s2.no, s2.name, s2.marks);
What is the difference between sub-query & co-related sub query?
A sub query is executed once for the parent statement
whereas the correlated sub query is executed once for each
row of the parent query.
Sub Query:
Example:
Select deptno, ename, sal from emp a where sal in (select sal from Grade where sal_grade=’A’ or sal_grade=’B’)
Co-Related Sun query:
Example:
Find all employees who earn more than the averagesalary in their department.
SELECT last-named, salary, department_id FROM employees A
WHERE salary > (SELECT AVG (salary)
FROM employees BWHERE B.department_id =A.department_id
Group by B.department_id)
EXISTS:
The EXISTS operator tests for existence of rows in
the results set of the subquery.
Select dname from dept where exists
(select 1 from EMP
where dept.deptno= emp.deptno);
A sub-query is executed once for the parent Query / Where as co-related sub-query is executed once for each row of the parent query.
Example:
Select * from emp where deptno in (select deptno from dept); / Example:
Select a.* from emp e where sal >= (select avg(sal) from emp a where a.deptno=e.deptno group by a.deptno);
Indexes:
- Bitmap indexes are most appropriate for columns having low distinct values—such as GENDER, MARITAL_STATUS, and RELATION. This assumption is not completely accurate, however. In reality, a bitmap index is always advisable for systems in which data is not frequently updated by many concurrent systems. In fact, as I'll demonstrate here, a bitmap index on a column with 100-percent unique values (a column candidate for primary key) is as efficient as a B-tree index.
- When to Create an Index
- You should create an index if:
- A column contains a wide range of values
- A column contains a large number of null values
- One or more columns are frequently used together in a WHERE clause or a join condition
- The table is large and most queries are expected to retrieve less than 2 to 4 percent of the rows
- By default if u create index that is nothing but b-tree index.
Why hintsRequire?
It is a perfect valid question to ask why hints should be used. Oracle comes with an optimizer that promises to optimize a query's execution plan. When this optimizer is really doing a good job, no hints should be required at all.
Sometimes, however, the characteristics of the data in the database are changing rapidly, so that the optimizer (or more accuratly, its statistics) are out of date. In this case, a hint could help.
You should first get the explain plan of your SQL and determine what changes can be done to make the code operate without using hints if possible. However, hints such as ORDERED, LEADING, INDEX, FULL, and the various AJ and SJ hints can take a wild optimizer and give you optimal performance
Tables analyze and update Analyze Statement
The ANALYZE statement can be used to gather statistics for a specific table, index or cluster. The statistics can be computed exactly, or estimated based on a specific number of rows, or a percentage of rows:
ANALYZE TABLE employees COMPUTE STATISTICS;
ANALYZE TABLE employees ESTIMATE STATISTICS SAMPLE 15 PERCENT;
EXEC DBMS_STATS.gather_table_stats('SCOTT', 'EMPLOYEES');
Automatic Optimizer Statistics Collection
By default Oracle 10g automatically gathers optimizer statistics using a scheduled job called GATHER_STATS_JOB. By default this job runs within maintenance windows between 10 P.M. to 6 A.M. week nights and all day on weekends. The job calls the DBMS_STATS.GATHER_DATABASE_STATS_JOB_PROC internal procedure which gathers statistics for tables with either empty or stale statistics, similar to the DBMS_STATS.GATHER_DATABASE_STATS procedure using the GATHER AUTO option. The main difference is that the internal job prioritizes the work such that tables most urgently requiring statistics updates are processed first.
Hint categories:
Hints can be categorized as follows:
- ALL_ROWS
One of the hints that 'invokes' the Cost based optimizer
ALL_ROWS is usually used for batch processing or data warehousing systems.
(/*+ ALL_ROWS */)
- FIRST_ROWS
One of the hints that 'invokes' the Cost based optimizer
FIRST_ROWS is usually used for OLTP systems.
(/*+ FIRST_ROWS */)
- CHOOSE
One of the hints that 'invokes' the Cost based optimizer
This hint lets the server choose (between ALL_ROWS and FIRST_ROWS, based on statistics gathered.
- Hints for Join Orders,
- Hints for Join Operations,
- Hints for Parallel Execution, (/*+ parallel(a,4) */) specify degree either 2 or 4 or 16
- Additional Hints
- HASH
Hashes one table (full scan) and creates a hash index for that table. Then hashes other table and uses hash index to find corresponding records. Therefore not suitable for < or > join conditions.
/*+ use_hash */