Computer Architecture Impact on Oracle OLTP and OLAP Performance
James McKenna

Introduction

This paper analyzes the current hardware choices for database management systems (DBMS). The hardware analysis starts at the microprocessor level and builds to the system level. The workloads for DBMSs are divided into OLTP (online transaction processing) and OLAP (online analytical processing). OLTP and OLAP (often called DSS - decision support system) workloads are defined, and then specific architectures are analyzed for their suitability for OLTP and OLAP workloads.

System Processors

Many high-end DBMS servers are built using commodity microprocessors. There has been a definite trend away from custom processors and toward commodity microprocessors such as the Intel x86 and the PowerPC. The reasons are compelling: cost and performance. The commodity chips are produced on a huge scale. This allows fixed costs to be spread over many more units than proprietary chips leading to a lower unit cost. Also sophisticated design techniques formerly reserved for high-end proprietary designs have been applied to commodity microprocessors making commodity microprocessors competitive with all but the fastest proprietary designs.

Microprocessors are improving rapidly. Transistor count for a microprocessor doubles every 18 months (Moore’s Law), and microprocessor clock frequency increases 50 % per year. The disk density increases 50 % per year, and DRAM density increases just under 60 % per year. Program’s address spaces need between ½ to 1 additional bit per year, and 32-bit address spaces are now sometimes insufficient.

The dominant microprocessor is the Intel x86. The introduction of the P6 (marketing name: Pentium Pro) was significant. What appeared to be challenging problems (complex instruction decoding, matching RISC levels of performance, and a lack of GPRs) have been solved with the P6. Very few microprocessors can claim to be much faster than the new P6 (the DEC Alpha is one that clearly is). For system designers of all but the fastest systems it’s now very hard to justify use of anything but the P6.

It is worth understanding the P6 in more detail since it is often the building block for many high-end systems. The P6 has been called a CISC chip (esp. by RISC chip vendors). It is no longer helpful to distinguish between CISC and RISC. CISC has incorporated RISC ideas as needed (and vice versa). As an example, since the introduction of the 486, x86 microprocessors no longer execute complex instructions internally. Instead small RISC-like instructions called micro-ops are executed internally.

The P6 is a superscalar (i.e. more than one instruction can be dispatched per clock cycle), super-pipelined (i.e. memory access is decomposed into stages), MCM (multi-chip module) processor that performs out-of-order execution, dynamic branch prediction, and speculative execution.

The P6’s superscalar design can issue three (actually five, but only three instructions can be decoded and retired per clock cycle) instructions per clock cycle [MICROAPR96A]. There are five parallel execution units: two integer, one floating-point, one load, and one store.

The pipeline has 14 stages divided into an in-order front section (eight stages) which does fetching and decoding, an out-of-order execution section (3 stages) which does dispatching and executing (including speculative), and an in-order retirement section (3 stages) which commits instructions. All sections are de-coupled, and Intel prefers to represent the P6 as three pipelines.

Figure 1 - P6 Microarchitecture

The P6 performs dynamic scheduling (hardware re-arranges instruction execution to reduce stalls) and speculative execution (instructions are executed but the results are not committed immediately). A variation of Tomasulo’s algorithm [HennPatt96A] is used including register renaming and reservation stations (see Figure 1 [MICRODEC96B]). The x86 architecture only has 8 integer GPRs (general purpose registers) and 8 floating-point GPRs. This significantly constrains compilers. Fortunately after decoding, the 16 logical x86 GPRs are renamed into 40 physical x86 registers by the RAT (register alias table) [MICROAPR96A] [BYTEAPR95D]. The expanded register file is wide enough to hold integer and floating-point values. The expanded register file functions as a general purpose instruction pool called the ROB (reorder buffer). Uniform sized (118 bits long) micro-ops leave the ROB and are transferred to a 20 entry reservation station in front of the functional units.

The P6 is an MCM. The processor die is packaged with a separate die holding a 256KB non-blocking L2 (level 2) cache. The P6 CPU die has a non-blocking 8KB L1 (level 1) Instruction-cache and 8KB L1 Data-cache.

The bus out of the core (referred to as “front-side”) is much slower (e.g. 66MHz vs. a 200MHz core clock). The “front-side” and “dedicated” buses transfer data as transactions. The P6 bus architecture allows multiple P6s to be interconnected via the “front-side” bus to form SMP (symmetric multiprocessing systems).

The P7 (co-designed with Hewlett-Packard) is expected to at least double the number of functional units, make extensive use of instruction pre-decoding and adding tag bits to decoded micro-ops to indicate the destination functional unit. The tagging could also specify ranges of instructions that could be executed safely in parallel. While these ideas are associated with VLIW techniques [HennPatt96B], the P7 is expected to remain a super-scalar, super-pipelined processor that performs dynamic scheduling and speculative execution.

I/O Technologies

There have been some definite improvements in I/O technologies that can be exploited for both OLTP and OLAP workloads. On the node-to-node level (clusters) the ANSI/IEEE 1596-1992 SCI (Scalable Coherent Interface) has become a popular standard [MICROFEB96A]. Clusters using SCI have announced 1 Gbit/sec. bandwidths. The dominant I/O bus standard is Intel’s PCI (peripheral component interface) rated at 111 MBytes/sec. Parallel interfaces such as SCSI are now thirty-two bits wide, and are rated at 40 MBytes/sec. Serial interface standards like Serial Storage Architecture (SSA) and Fibrechannel are becoming popular. SSA can deliver 80 MBytes/sec. to a node (typically a RAID array) using two full-duplex (20 MBytes/sec. in each direction) ports . Fibrechannel (a.k.a. Fibre Channel Arbitrated Loop) connections are rated at between 25 and 100 Mbytes/sec. Fibrechannel is a true loop (transfers occur in a unidirectional fashion) [IDCSI95]. Also IEEE 1394 (Firewire) rated at up to 50MBytes/sec. should soon be available for PC platforms. These new interface technologies compete with IBM mainframe’s I/O which can have 100s of 17Mbytes/sec. ESCON channels

Architectural Choices

The architectural choices available in industry and academia are extensive. The focus will be on architectures that have achieved commercial success. Tanenbaum’s [TanDistOS95] details a full taxonomy for parallel and distributed computers. Useful definitions include the distinction between tightly-coupled and loosely-coupled machines. Tightly-coupled machines are usually called multiprocessors, and all microprocessors share the same memory (address space). Loosely-coupled machines do not share the same memory (address space). Each node runs its own copy of the operating system (OS). These loosely-coupled machines are sometimes called multi-computers. Another important distinction is the interconnection between machines (CPUs): bus or switched. A bus implies there is a single network, backplane, bus, or cable that connects all the machines. A switched system doesn’t have a single bus to connect machines (CPUs). Instead there can be many connections between different machines (CPUs), and decisions are made dynamically about which wire a message should be sent on.

With these definitions the parallel and distributed architectures of interest include:

SMP (symmetric multiprocessor) which is a tightly-coupled multiprocessor (and can internally use a bus or a crossbar). The term shared-memory is often used as well. Examples include IBM System/370, Sequent Symmetry, IBM RS6000 J40, HP 9000 K-Series, NCR 3455 and 3525, and Sun SS1000 and SS2000.

MPP (massively parallel processor) which is a loosely-coupled multi-computer. Examples include nCUBE, Tandem Himalaya, NCR WorldMark 3600 and 5100.

NUMA (non-uniform memory access) which is a tightly-coupled bus-based (usually) multiprocessor with interconnects which have different speeds. NUMA is also known as cache-coherent NUMA or ccNUMA. Examples include Sequent’s NUMA-Q, Data General’s CC-NUMA, and Stanford’s Dash/Flash.

Clusters which are groups of loosely-coupled uniprocessors, SMPs, or MPPs connected via a cluster interconnect. There are many varieties and features of clusters. One important variety is called shared-disk. Examples of shared-disk clusters include the original VAXcluster, IBM RS/6000 with HACMP, and IBM’s Sysplex.

There are many variations of course. Many architectures do not fit neatly into one category.

SMP

SMP systems were introduced during the 1960’s with mainframes. SMPs run a single system image. The OS has access to all processors, and can dispatch processes (or threads in thread-aware systems) to each processor. Data coherency is provided via a coherency protocol. The two most common are snoopy protocols and directory-based protocols. Many SMP systems have a snoopy bus where each CPU monitors changes made to data by other CPUs, and invalidates any data in its local cache that is changed by another CPU. A directory-based coherency system maintains the status and location of all data blocks in one location in memory (used by NCR and Sequent).

SMP’s have a hard time scaling past 32 processors. If more capacity is required clusters, NUMAs, or MPPs can be used. A process that can’t be broken into several processes/threads benefits little from a SMP. The main appeal of SMP is ease of software development. Programs can initially run without change, and later be converted to take advantage of the multiple processors. SMP has proved very popular in the market place as shown by the list of vendors with SMP offerings.

The SMP architecture has been the focus of R&D efforts because of its commercial popularity. NUMA resulted from efforts to eliminate the 32 CPU limit. Simpler techniques have been used as well. For example IBM’s SMP offerings (the RS6000 J40 and R40 servers) use a crossbar switch to create multiple data paths between units (CPU to CPU or CPU to memory) to increase bandwidth. This will provide more bandwidth and accommodate more CPUs than the single bus on many SMP servers. Sun uses a similar technique in their high-end Ultra servers.

SMP servers will soon be commodities now that Intel is selling a 4-processor P6 motherboard called the Standard High Volume (SHV) motherboard which makes developing P6-based SMPs simple. These 4xP6 SMP machines are widely available. The SHV even allows third-party devices to control the processor bus to create even larger systems (see NUMA section). Intel will be introducing an 8-way motherboard co-developed with NCR in early 1997.

MPP

The MPP model offers enormous computing power and relatively low costs. The original MPP model was a shared nothing model where each node (often called a cell or plex) in the MPP system had its own CPU(s), memory system, interconnect interface, disk, and ran its own copy of the operating system. The cells could have one or more CPUs. Communications between cells occurred via a message passing library such as PVM (Parallel Virtual Machine) or MCI (Message-Passing Interface) [MICROFEB96B]. Synchronization via message passing is complex (and much slower than hardware synchronization), so effectively hardware simplicity causes software complexity.

The MPP architecture generated much excitement [CACMJUN92A]. The relational model fit well with the highly parallel MPPs for read-only DSS queries by spreading the query across many commodity processors connected only by a messaging system. The results of the queries were combined, and then another relational operator could be applied on the resultant relation (partitioned and pipelined parallelism ). OLTP-type operations that add or modify data are more difficult for MPPs.

The market acceptance of MPP machines has been slower than expected (including some recent filings for bankruptcy protection). The problem for MPP vendors is that MPPs are harder to program than single image systems and clusters. The data needs to be properly partitioned to avoid saturating the interconnection network. Partitioning is not simple since some data, e.g. customer information and price information, will be needed by many cells. Compounding the problem is the availability of single image systems that scale past the limits of SMP by utilizing NUMA technologies. However, while more difficult to program and partition, MPPs have many advantages including lower cost (since they can be made from commodity parts without complex and expensive coherency hardware) and enormous computing power.

MPPs are used for very high-end database systems such as Tandem’s Non-Stop SQL/MP (running the NonStop Kernel OS) and NCR’s Teradata DBC. Tandem’s Himalaya can scale to 1000s of RISC processors. NCR’s WorldMark 5100M MPP can run Teradata and Oracle Parallel Server on a P6-based system that can scale to 4096 P6s. The focus of MPPs is often for DSS/OLAP.

There are two popular types of I/O systems for MPPs. One type divides all the cells of the MPP into two non-overlapping sets: compute cells and I/O cells. Compute cells run programs, and I/O cells contain disk drives and run the parallel file system (used by nCUBE, Intel’s iPSC and Paragon, and Connection Machine’s CM-5). A parallel file system is required since one job can have multiple processes all performing independent I/O operations. The second choice is the same as the first except that I/O devices can be connected to compute cells rather than the interconnection network (used by the IBM SP2 and the Meiko CS-2) [TOCSAUG96A]. The SP2 has a flexible feature that allows blocks of MPP cells to be configured as up to 8-way SMP systems.

The interconnection network technologies are quite varied. The Teradata employs a redundant tree-structured communication network. The Tandem uses a three-level duplexed network with two levels within groups of processors, and rings connecting the groups. The SP2 uses multistage switches running at 40 Mbytes/s.

NUMA

Sequent’s recently announced NUMA-Q system is an example of how SMP systems can be scaled past 32 processors. The NUMA-Q’s building block is called a “quad” and consists of four P6 microprocessors, two PCI I/O busses (133 Mbytes/sec), seven PCI slots, memory and a 500 Mbytes/sec system bus. The interconnect is called IQ-Link and is rated at 1 Gbyte/sec. and is based on the SCI standard. The systems can support 63 “quads” for a total of 252 P6s all running the same address space (parts of the address space are assigned to each quad).

Clusters

A cluster is composed of several independent systems. These independent systems can be uniprocessors, SMPs, or MPPs. The systems can communicate using network messages and shared-disks. Clusters can provide (1) failover (hot standby) mode or (2) application-scaling/application-sharing/cluster aware mode. In failover configurations if one of the cluster’s nodes fails another cluster node can take over the failed node’s transactions and locks via the shared disk. The surviving node can also assume the failed nodes network address (via an extra network card), so that clients bound to the original node can continue operation. Capacity of the cluster interconnect is not a large issue for failover clusters since heart-beats are the main cluster-specific interconnect traffic. If the application running on the cluster is cluster-aware (such as Oracle Parallel Server), the application can use all nodes in the cluster, and a distributed lock manager (DLM) can handle concurrency issues. This mode is particularly powerful since cluster nodes can be serviced by diverting jobs to the other cluster nodes and then temporarily removing the node from the cluster to perform service.

The most famous cluster is the VAXcluster [DTJSEP87A] [DTJSUM91A]. It’s hard to overestimate the impact of the 1984 introduction of Digital’s VAXcluster. While Tandem may have introduced the first cluster with their Guardian system, Digital’s wide market appeal in the 1980’s ensured that a wide audience would be exposed to cluster technology. Digital’s clustering knowledge and technology have been viewed as Digital’s corporate jewels. Digital has installed over 400,000 clusters.

Digital’s original cluster design used a shared disk approach. The VAXcluster’s operating system is called VMS. The cluster network interconnection is called the CI (Computer Interconnect). Each VAX in the cluster connects via a CI adapter card. The disk drives and tape drives in the high-end VAXcluster are connected to the CI via an HSC (Hierarchical Storage Controller) connection (see Figure 2). All VAXs in the cluster can access the cluster’s disks and tape drives. A strong group membership protocol is enforced by the CM (Connection Manager) using the SCS (System Communication Services) layer. Disk and tape drives can not be written to unless the cluster’s membership has a 50%+ quorum. This prevents network partitions from causing data corruption. A sophisticated distributed lock manager used the CM to handle concurrency. Disks and tape drives were no longer viewed as being captive to one node. As long as the software was cluster-aware scaling service became simple - add another node to the VAXcluster.