TEN DATABASE MYSTERIES

NORTHERN CALIFORNIA

ORACLE USERS GROUP

November 19, 1998

Chris Lawson

chris_lawson @ yahoo.com


FOCUS OF PRESENTATION:

Explore some "strange" rdbms problems that have baffled some DBAs.

Most of the mysteries occurred on critical production systems, although some were on development systems.

Some of these problems lasted for MONTHS, until someone took the time to investigate; others were solved in a few hours.

ALL the mysteries were eventually explained.

Depending on your personal experience, some of these "mysteries" will seem trivial or commonplace; others will indeed seem mysterious.

Most mysteries have a simple explanation.

Most mysteries have a simple fix.


WHY SPEND TIME ON THESE RDBMS MYSTERIES?

Each DBA has a unique set of experiences and biases. What one DBA thinks is obvious, another will not.

An oracle "detective" is part scientist/part artist. Many solutions require CREATIVITY, not just logic.

A superior DBA will look for ways to "stretch" and learn ways to handle difficult problems.

Without working difficult problems, you will not advance as a DBA.

You will be the "hero" if you encounter a mystery and solve it; but remember the solution, as you may see it again!

NOTE: If you have a better solution to any of these mysteries, please share it!


MYSTERY #1 "THE CASE OF THE BERSERK APPLICATION"

Clue # 1: “Big Phone Company” 1997

Using HPUX, rdbms 7.3.2.3.

Help-desk application (Vantive) that connects to oracle rdbms suddenly goes berserk, creating thousands of connections.

Program had worked normally for many months.

DBAs watch helplessly as CPU load driven from 1 to 50.

As DBAs kill extra processes, more take their place.

Alert log and recent rdbms trace files show nothing unusual.

DBAs are united in accusing the application as the culprit.


MYSTERY #1 (cont’d)

Clue # 2: “Big Publisher Ltd.” 1998

Running SUN Solaris, rdbms 7.3.2.3

Users complain that performance has degraded in recent months.

Manager states that “something must be wrong with the network.”

Application is CORIS, a document management/printing app.

DBA investigates. Discovers that time to connect in sql*plus is about 30-45 seconds, even though server load is very low.

Connect time is terrible whether remote (PC) or directly on server.

Server load (file I/O and CPU) is generally low.


MYSTERY # 2 "THE CASE OF THE RELUCTANT PATCH"

BACKGROUND:

To correct several bugs, decision is made to upgrade from 7.3.2.2 to 7.3.2.3 (HPUX)

Patch is obtained from Oracle, and applied to test server. DBA notes that patch ran very quickly, and runs again "just to be sure"

Bug is now gone on Test server.

PROBLEM:

Patch is similarly applied to production server--same OS & server.

Production application is tested, but bug is still there!

Another DBA reviews patch file, location, etc. All seem correct.


MYSTERY # 3 “THE CASE OF THE SLEAZY SQL”

“Big Publisher Ltd” runs an MRP system called "AVALON," similar to Oracle Manufacturing. Database stores inventory, part information, vendors, etc.

Server is ATT3555, running NCR UNIX. RDBMS version is 7.1.6

Issue: Users report that certain common operations are very slow.

DBA investigates, and queries v$sqlarea using:

Select sql_text from v$sqlarea where

Disk_reads/executions > 1000;

Query yields troublesome sql statement, with these stats:

DISK_READS PER EXECUTION = 5,000

BUFFER_GETS PER EXECUTION = 5,100


MYSTERY # 3 cont’d

Statement has been accidentally designed to ensure worst possible performance by making index usage totally impossible:

SELECT * FROM ABC WHERE

NVL (COL_W) = NVL (:1) AND

NVL (COL_X) = NVL (:2) AND

NVL (COL_Y) = NVL (:3) AND

NVL (COL_Z) = NVL (:4);

TABLE ABC SIZE = 3 mb, about 25,000 rows

DBA requests developers to alter statement to eliminate NVL (COL_N) functions.

DBA advised that no resources available to make change.

Problem: If code can’t be changed, what can be done to improve performance?


MYSTERY # 4 “THE CASE OF THE NON-OPTIMAL OPTIMIZER”

A large software company based in “Cedar Shores” has designed a large financials application. Program has been tuned for Rule Based Optimizer.

The application runs very well, and is a mature product. Product is used in thousands in companies around the world.

Some users clamor for new features--more horns and whistles.

The new development team, afraid to become obsolete, wants to convert to Cost-Based Optimizer (CBO). They also wisely consider that Oracle recommends using CBO on new projects.

The older developers, now nearing peaceful retirement, predict disaster if the database is switched to CBO, because the execution plans will change.

Issue: How can Optimizer be selectively switched to CBO without changing the code?


MYSTERY # 5 “THE CASE OF THE FORGETFUL MEMORY”

A new internet-transaction application, ECXpert, and its database have been installed on a Sun ULTRA Enterprise Server.

Sun Solaris 2.5.1; RDBMS version 7.2.3.

Application appears to run smoothly for several months, although it occasionally creates large dump files.

Trace files appear occasionally with ORA-4030 “Out of Process Memory” and recommends “increase process memory quota.”

New “feature”: Server seems to “hang” occasionally. Server reboot fixes.

Sysadmin checks kernel parameters related to memory. All correct, and match other servers. Not using any large stored procedures.

Problem: What is causing memory/hang problems?


MYSTERY # 6 “THE RELUCTANT INDEX AFFAIR”

BACKGROUND:

DBA asked to analyze and tune Australia manufacturing database. RDBMS is running CBO. One particularly bothersome sql statement is identified.

The "where" condition is perfect for a new index, because of excellent selectivity.

Index is quickly created. Table is also analyzed.

PROBLEM:

Even though index is a "perfect" solution to the query, a full table scan is used instead.


MYSTERY # 7 "MYSTERY OF THE HANGING DATABASE"

At random intervals, a 7.3.2.3 database "hangs." No trace files, and nothing unusual in the alert log.

When problem occurs, no response to new connections requests; over 1200 existing connections "hang." At this point, the DBAs get to know the IS VP on a first name basis.

Oracle Support is alerted to priority 1 problem; experts across the world investigate for weeks.

Early one morning, DBA happens to be using OEM Lock Mgr tool and notices user who is blocking about 25 other users. The “hang” occurs soon after.

Oracle Australia recommends checking indexes. This suggestion led to the solution.

PROBLEM: How did index problems hang database?


MYSTERY # 8 "THE CASE OF THE MYSTERIOUS PACKAGE"

A new manufacturing application was installed on a SUN ULTRA 3000 server. A small database was created for testing purposes. RDBMS version 7.2.3. Shared Pool size about 60 mb.

At first, all went well. Then, seemingly randomly, when the users began to try new features, they would receive a "funny" error message, and the application failed.

A trace file recommended increasing shared pool.

PROBLEM:

How can application fail with such a sizable shared pool?

Aside from massive increase in shared pool, what can be done?


MYSTERY # 9 “THE CASE OF THE UNCOOPERATIVE ROLLBACK”

In mid-afternoon, DBA (running "OEM Top Sessions") notices many users "ACTIVE" but showing 0 file i/o. Lock manager reveals one user performing big update blocking all.

Culprit tracked down--agrees to be terminated. DBA disconnects session.

Locks are not released, but user is "marked for kill."

Very little file i/o activity.

Alert log shows very slow switching of redo logs.

DBA performs shutdown abort then startup. Database starts up after 2 minutes. All is well.

PROBLEM?

Why did user not rollback and release locks?


MYSTERY # 10 THE SINGULAR CASE OF THE PHANTOM USERS

A manufacturing database in Sydney, Australia, needed performance tuning. Sql tuning on US databases had yielded good results.

The table v$sqlarea was queried to find resource-intensive sql statements. Several commonly run statements were isolated.

Performance was improved through index additions.

Statistics were re-examined over the next 4 hours, in order to confirm improvements.

However, repeated look at execution statistics showed no change.

DBA puzzles over enigma for several hours, then realizes that NOTHING is WRONG! What did he finally realize?


MYSTERY # 11 THE CASE OF THE SLOW PHYSICIAN

Health application is experiencing slow run times. Analysis shows following sql statement causing 3000 disk reads

select * from view “COSIGN_VIEW” where doctor_id = ‘DR. MCKENZIE’

COSIGN_VIEW is a join of 2 tables (DOCS + COSIGN), joined on patient_id (indexed)

Search criteria “DR.MCKENZIE’ is very selective; thus, nested loop IS expected choice for optimizer, with DOCS as Driving table.

Even with index on DOCS(doctor_id), optimizer (CBO) insists on using hash-join, and refuses to ever use index on doctor_id !

Repeated analyze table commands do not correct.

Substituting query not using a view yields expected NL result.

Why does using the view cause optimizer to make the “wrong” choice?


SOLUTION TO MYSTERY # 1

OTRACE is the culprit. It is active by default on many 7.3 rdbms versions.

Excerpt from Oracle Corp Alert:

“Problems described here can occur when Oracle Trace is not configured and is widely enabled.”

TO DETECT

Check directory ORACLE_HOME/rdbms/otrace: As size of files process.dat and regid.dat approach 10mb, problems arise.

-rw-r--r-- 1 oracle dba 3161424 Jun 05 09:43 process.dat

-rw-r--r-- 1 oracle dba 263808 Jun 05 09:43 regid.dat

To correct simply remove these two files, then issue command otrccref

SOLUTION TO MYSTERY # 1 (cont’d)

TO PREVENT

1) Add line to listener.ora for each database (after ORACLE_HOME):

(ENVS=’EPC_DISABLED=TRUE’)

2) Set and export environment variable EPC_DISABLED=TRUE for ALL USERS.

Put standard profile in /etc directory.

3) Restart all databases and restart listener.

SOLUTION TO MYSTERY #2

DBA happens to notice that upon sqlplus startup, rdbms is 7.3.2.2 !

The patch was really only applied on the SECOND run. This is apparently a quirk in the patch readme file.

The command what oracle (then grep for patch#) can be used to determine which patches are applied.


SOLUTION TO MYSTERY #3

CACHE the table! For example, alter table xyz cache;

Normally, blocks from full-table scans are designated for rapid age-out; otherwise, they would “wipe-out” the db cache. Cache of table causes blocks to be treated “normally.”

Caching table disables rapid age-out of this table

Logical reads will not be reduced, but disk reads approach zero!

Note: DB_BLOCK_BUFFERS was slightly increased to compensate for the cached table that now consumes a few megabytes of database cache.


SOLUTION TO MYSTERY #4

Simply substitute a view having a “hint” for the table needing CBO.

For example:

rename DEPT to DEPT_ORIG

create view DEPT as select /*+ ALL_ROWS */

* from DEPT_ORIG;

Now, application will use the VIEW when it looks for DEPT.

All queries using DEPT will use CBO.

Note: Upon renaming a table, the indexes and constraints will “move” with the table; however, synonyms and grants may need to be reset.


SOLUTION TO MYSTERY #5

DBA checks /tmp (swap area on server) and notes 99% consumed.

Investigation reveals that application occasionally goes berserk and consumes ENTIRE SWAP area with log files.

Deletion of log files does not return disk space, as application is still “holding” the files.

Reboot of server cleaned up /tmp area, thereby correcting problem.

SUGGESTION:

If memory-related errors messages exist, check swap area first.


SOLUTION TO MYSTERY #6

The values in the table are very "lopsided." Optimizer, however, will assume uniform distribution, which is incorrect in many cases.

Re-analyze and specify histogram:

ANALYZE TABLE XYZ FOR ALL INDEXED COLUMNS SIZE 75

This creates histogram of 75 “buckets” for each indexed column.

With these statistics, optimizer will “know” how values are distributed, and will more often make right decision to use an index or not.


SOLUTION TO MYSTERY #7

Application design flaw.

There are hundreds of foreign keys in the database; 99% had indexes; A few did NOT, violating good design practice. When batch program began updates, locking increased rapidly.

Without FK index, updates on parent table completely block updates on child (vice versa for 7.1.6)

Ref: Server Application Developers Guide.

Although not admitted as RDBMS "bug," RDBMS was completely overwhelmed by the locks.

Once indexes on all FK’s created, problem disappeared.


SOLUTION TO MYSTERY #8

The application used about 20 massive PL/SQL packages. Some are 5x the "standard" package. When a package load is attempted, it will not fit in the shared pool.

Memory-intensive packages should be "pinned" or ‘kept” in shared pool after rdbms startup.

execute sys.dbms_shared_pool.keep (‘OBJECT_NAME’);

But first, must find the “big” packages (will also list ‘Standard’):

select owner, name, sharable_mem from v$db_object_cache where sharable_mem > 100000;

Example script to find "big" packages and generate sql script to “pin” them in memory:

select ‘execute sys.dbms_shared_pool.keep(‘||’’’’||owner||’.’||name||’’’’ ||’);’

from v$db_object_cache

where sharable_mem > 100000

and type not in (‘VIEW’, ‘SYNONYM’, ‘TABLE’)

and name not like ‘%SHARED_POOL%’;


SOLUTION TO MYSTERY #9

If session is terminated, speed of rollback is proportional to init.ora parameter CLEANUP_ROLLBACK_ENTRIES

If default value (20) is used, rollback of killed session can take 50x time of original update. Alternatively, shutdown abort/startup cleans up database much faster.

Rationale: Parameter prevents rollback of one user from hogging all the resources on a busy system.

Solution: Increase parameter to reduce rollback time (since shutdown abort is usually not an option)

SOLUTION TO MYSTERY #10

Nothing is wrong, because the users were still asleep, as it was still 5 AM in Sydney.


SOLUTION TO MYSTERY #11

Everything seemed to point to a problem with the view, because all worked normally as long as the view was excluded.

Finally, DBA compared view definition (in OEM Schema Mgr) to definition seen using “describe table” syntax. The columns did NOT match!

Examining the object-create script revealed that the view switched column names, so that column DOC_ID in the view did NOT match DOC_ID in the table!

Once the correct column was indexed, a Nested-Loop Join was selected by the optimizer.