Intel Itanium Architecture, Overview (12/1/2011)

Intel® Itanium ® Architecture HM

Intel ® Itanium ® Architecture, Overview (12/1/2011)

The Itanium® processor is Intel’s first published, commercial 64-bit computer product, launched in 2001. 64-bit means that the logical address range spans 264 different memory cells (bytes), and that natural integer objects are 64 bits wide. The exact format of integer objects is described in section Data and Memory. Floating-point formats are not outlined here. During its development, the first generation of Itanium processors was code-named Merced. The family is now officially called IPF, for Itanium Processor Family, while early in its development it was referred to as IA-64, for Intel 64-bit architecture. This architecture is radically different from the widely used IA-32 architecture. IA-32 should be referred to as x86 architecture, lest one infers that it be restricted to 32-bit addresses and integer types. That is no longer the case since the introduction of 64-bit computing about ½ year after AMD’s extension of IA-32 to 64 bits; see EM64T below.

Interestingly, IA-32 object code is also executable on the Itanium processor. More interesting yet, even the Hewlett-Packard PA-RISC code is natively executable on this new 64-bit IPF processor. HP was Intel’s strategic partner in the definition, development, and cost sharing of the IPF. One should not make any performance inference: For example, just because IA-32 object code is executable on IPFs, you should not deduce that such code executes on an Itanium processor as fast as, or faster than on an IA-32 processor assuming identical clock speeds.

This chapter succinctly characterizes the IPF, enough to give a flavor of the new architecture; not enough to start programming in Itanium assembly language. It is meant as material for one week’s worth of lectures on Architecture.

Characterization

The Itanium processor is Intel’s (and HP’s) first instance of the new EPIC architecture. EPIC stands for explicitly parallel instruction computing. It is Intel’s first launched 64-bit architecture; the second one was launched a few years later (1q04), with EM64T, the 64-bit version of the old x86 architecture. HP already had a 64-bit version of its Performance Architecture (PA) RISC processor at the time Itanium was launched.

Explicit means that it is the intellectual burden of the assembly language programmer (or of the smart compiler) to take advantage of the parallelism in the architecture, i.e. to write Itanium code that exploits the numerous computing modules. One consequence of this is that compilers for the IPF are highly complex. This is not desirable, since increased complexity means more errors, decreased object code quality, something the promoter of a new architecture should avoid. On the other hand, the IPF has provided some explicit architectural support that supports writing of highly optimizing compilers. A case in point is the architectural support for software-pipelined (SW PL) loops. Certain source constructs let the compiler emit SW PL loops that need no prologue and epilogue. This not only renders the object code more compact, but also faster.

Parallel means an Itanium processor gains speed not (primarily) via high clock rates, but via simultaneous execution of multiple operations at the same time.

Key concepts refined, or newly introduced, in IPF include predication, branch prediction, branch elimination, conditional move, speculation, parallel comparisons, and a large register file.

Itanium is the first implementation of the new 64-bit Intel Architecture. Contrary to what you would expect, initially Itanium only implemented 44 physical of the 64 logical address bits in its first generation (Merced). This means, the total address range with that first Itanium implementation was only about a millionth of the logical address range, but still about 4000 times larger than that of a conventional 32-bit architecture. In the second generation, 56 physical bits of the 64-bit logical address space were implemented in hardware (Itanium®2). In the short-term, no severe limitations are expected due to the short 56-bit addresses. Integer type operands are of course full 64 bits wide.

Unlike earlier parallel VLIW architectures, EPIC does not use a fixed width instruction encoding. Instead, operational functions can be combined to operate in parallel from a single instruction to as many instructions as desired. What is critical in EPIC is that all code is written assuming parallel semantics within a group (to be explained later), and sequential semantics across groups. To be able to run in parallel, the machine is built with multiple execution modules that can all work at the same time. This allows a natural architectural migration from say, 6 HW modules executing on today’s Itanium, to as many as can be crammed into a future silicon chip a few years from now. To illustrate a sample taken from ref [1]:

Consider 2 memory operands a and b to be swapped.

temp := a;

a := b;

b := temp;

The semicolon operator ‘;’ implies sequential semantics. On a machine with parallel semantics, it would be sufficient to write:

a := b,

b := a;

with the comma operator ‘,’ implying parallel semantics, similar to syntactic conventions in the programming language Algol-68. This source snipped is just an example, and is NOT a sample of the Itanium assembly language.

Synopsis

· Definitions

· Data and Memory

· Itanium Registers

· Itanium Instruction Set Architecture (ISA)

· References

Definitions

See also related definitions in the separate chapter on Software Pipelining.

Branch Elimination:

Replacing object code that has multiple execution paths with an equivalent but different instruction stream that is straight-line, lacking branches. The second version with branches eliminated must be semantically equivalent to the original code with branches.

Bundle:

Group of 3 instructions plus a template, that all fit into a 16-byte long, 16-byte aligned section of instruction memory.

Conditional Move:

Move instruction that transfers bits from a source to some destination, but only if an associated condition is true. Otherwise the instruction operates like a noop. Such a move can serve as a special case of branch elimination. For example, the C source construct:

if ( a > 0 ) x = 99; -- HL source program

could be mapped into the conditional move:

cmov x, #99, a, #0, gt -- hypothetical asm

which has no branches. Source operand #99 is moved into memory location x only if the greater-than condition holds between operands a and integer literal #0.

Endianness:

Convention defining in which order the higher-valued bytes of a data object are addressed. If the higher address byte holds the higher numeric value, we call this little-endian; the other we call big-endian ordering.

EPIC:

Explicitly Parallel Instruction Computing, with IPF being the first commercial architecture that implements EPIC.

Epilogue:

When the steady state of a software-pipelined loop terminates, some valid operands are yet to be used. These last operands must be consumed, some even be generated through operations, and ultimately the pipeline must be drained. This is accomplished in the object code after the steady state, and is called the Epilogue.

Group:

A sequence of instructions, each with an associated template and a defined stop. A group is composed of one bundle or more. The stop means, the hardware cannot start executing any subsequent group, until the current one has completed. Syntax notation for stop in Itanium assembler is the double-semicolon ;;.

Parallel Comparison:

A composite source condition of the form ( ( a > b ) & ( c <= d ) ) requires multiple steps to compute a boolean predicate. Generally, on a sequential architecture these multiple steps are combined via explicit instructions for anding and oring, or else the flow of control of execution selects a matching true label. All this takes time. The Itanium processor allows parallel evaluation of certain composite Boolean expressions in a single step. The result can be used as a predicate in subsequent instructions. Notice that such combined Boolean expressions must be side-effect free. For example, another expression ( fun( j, k ) & ( i < MAX ) ) cannot be mapped into a parallel compare, since one operand is a call fun( i, k ) with a possibly large number of parameters, and may have a side-effect on one of the other operands, for example “i” which is yet to be compared.

Predication:

Association of a boolean condition with the execution of an instruction, or an instruction sequence. This allows the following: Two instruction streams can be executed in parallel, clearly requiring multiple hardware modules. Both streams have a predicate associated with their operations. Only the stream with the predicate eventually being true is actually retired; the other will be aborted and ignored. Note that the abort can happen as soon as the predicate is known. This means, the computation of the predicate can proceed in parallel with the execution of the two code streams, but must be complete by the time these 2 code streams are waiting for who’ll be the winner.

An ISA with predication requires bits that specify, which of the predicates to use, and which direction (true? or false?) to select. Also, the discarded code path may contain no side-effect, such as write to memory.

Prologue:

Before a SW pipelined loop body can be initiated, hardware resources (e.g. registers) must be initialized, AKA primed. This is accomplished in the object code before the steady state, called the Prologue

Register File:

The IPF has a rich set of registers. This includes 128 general purpose registers (for integer operations), 128 floating-point-, 64 predicate-, 64 branch-, and 128 so-called Application Registers. Also a variety of special purpose register is visible (i.e. can be accessed by the assembly language programmer), including user mask, stack marker (frame marker), ip, processor id, and performance monitoring registers.

Speculation:

If it is suspected that (but not known for sure whether) an operand o will be used in the future, and this operand is not readily available (say in a high-speed register), and it takes long –relative to instruction execution– to fetch o, a processor may initiate the fetch well before it is actually used. Advantage: by the time o is needed, it is already available without delay. Disadvantage: if the flow of control never reaches the place where o was thought to be needed, then the speculative fetch was superfluous. However, this may still be meaningful, if a) no side-effects occurred that are harmful to program correctness, and b) if the hardware resource required to fetch o was available at no loss to other work.

Steady State:

The software-pipelined object code executed repeatedly, after the Prologue has been initiated, before the Epilogue will be active, is called the Steady State. Each iteration of the Steady State makes progress toward multiple iterations of the original source loop.

Syllable:

The instruction-only portion of a bundle. A bundle always holds 3 instructions plus a template, the template specifying additional necessary information about an instruction. The instruction alone, without the needed template information, is a syllable.

Data and Memory

The native data types of the IPF are similar to conventional 32-bit architectures, except for the longer 64-bit integer and unsigned formats. An extension over IA-32 object code is the IPF bundle. Types include integer, unsigned, floating-point, and pointer. Integers are of different widths: byte, word, double-word, or quad-word precision. Length in bits as well as min and max values are listed below.

Type / Byte / Word / Double-word+ / Quad-word+

Integer [bits]

/ 8 / 16 / 32 / 64
Unsigned [bits] / 8 / 16 / 32 / 64
Pointer[bits] / NA / NA / Compatibility 32 / 64
Float[bits] / NA / NA / 32, 64 / 64, 80
Type / byte / Word / Double-word / Quad-word

Minint

/ -128 / -32,768 / -2,147,483,648 / -9,223,372,036,854,775,808
Maxint / +127 / +32,767 / +2,147,483,647 / +9,223,372,036,854,775,807
Minunsigned / 0 / 0 / 0 / 0
Maxunisgned / 255 / 65,535 / 4,294,967,295 / 18,446,744,073,709,551,615

Negative numbers are represented in two’s complement format, with the sign-bit in the most-significant position. To represent floating-point data, IPF uses the IEEE 754 standard. Bits representing integer values are numbered from 0 in the least significant position (rightmost position) to higher values. For example, the most significant bit in a double word is in position 31, shown in the leftmost position of all 32 bits. The maximum address on the first generation Itanium processor (Merced) was only 17,592,186,040,322 or 244-1. It grew in the second generation to 56 bits, and is now a full 64-bits long.

Bytes are stored in little-endian order by default. However, it is possible to programmatically select little- or big-endian order, by setting the be bit in the user mask, a special status register. The be bit does not affect how instructions are stored or fetched from memory. Object code is always represented in little-endian order; programmer selected endianness only impacts data.

In little-endian order, data byte with the lowest numeric value are stored in the byte with the lowest address; conversely for big-endian order. A quad-word of data 0x1102030455060708 would be stored:

Data stored in 8 adjacent bytes in memory in little-endian order: