Comparison of the AMD64 and Intel IA-32e Extensions and the Itanium Architecture
By:
Mathew Makai
Michael Parrill
Elizabeth Rommel
David Winfield
Table of Contents
Introduction
AMD64
Itanium
IA-32e 64-bit Extension
Comparison
Conclusion
Bibliography
Introduction
For years the computer world has been using 32-bit processors, but as technology continues to advance this is quickly becoming inadequate. As consumers continue to demand more and more memory, the number of addresses available from 32 bits is no longer sufficient. For this reason, among others, many companies have begun producing and marketing 64-bit processors, or extensions to 32-bit processors. One example of these extensions is the AMD 64, which expands current AMD 32-bit processors, following the x86 architecture. The IA-32e 64-bit expansion is another extension to an Intel 32-bit processor that also hopes to solve the problem. The Itanium Processor by Intel is a 64-bit processor which is sold in various models. Some of these models have been released on the market, and are on sale, while others are still in the prototype and developmental stages.
AMD64
AMD’s approach to the 64-bit computing market is the AMD64 technology. The AMD64 technology doubles the number of processor registers and increases the addressable memory space well beyond the 4GB limit. The AMD64 technology offers leading-edge performance on current software applications while providing a seamless migration to future 64-bit computing.
The industry standard x86 instructions set architecture has been expanded by AMD64 into the AMD 64-bit platform. The AMD 64-bit platform is unique because it is the first to be fully backwards compatible with existing x86 solutions and still deliver 64-bit performance. In April 2003, AMD released the AMD Opteron Processor, which is a 64-bit processor for servers and workstations. The Opteron utilizes the AMD64 architecture and opened the door for AMD’s new class of computing. AMD then released the world’s first and only Window’s compatible 64-bit desktop and mobile processor, in September 2003 (Bennett, 2003). AMD64 processors are available for servers, workstations, desktops, and mobile PC’s which allow the technology to available to anyone.
The AMD64 Instruction Set Architecture (ISA) was created by AMD to extend the x86 architecture, which has been the foundation of all PC’s since the 8086 processor was developed in 1978 to support 64-bit registers. The ISA enables 64-bit computing, while remaining compatible with 32-bit x86 applications. This lets consumers still use 32-bit applications and operating systems while making the transition to 64-bit computing at their own pace. AMD achieves this by letting the AMD64 processors run in two modes. The first mode is Legacy mode, which removes all 64-bit support and lets the processor run in 32-bit mode. Long mode is consists of two modes, compatibility mode and 64-bit mode. Compatibility mode is designed for 64-bit operating systems, such as the upcoming Microsoft Windows XP 64-bit Edition and the Windows Server 2003 64-bit edition. The advantage to compatibility mode is even though 32-bit applications are still limited to 4 gigabytes of memory, each program can have all of the 4 gigabytes to itself since 64-bit addressing will let the computer address additional memory space. 64-bit mode is only used in a pure 64-bit environment, and offers an advantage of eight extra general purpose registers which are only available in the 64-bit mode. The AMD64 Instruction Set Architecture improves many aspects of the x86 instruction set. AMD64 is different than Intel’s approach to 64-bit processors as seen in their Itanium line, which uses a completely different architecture for their chips (Bennett, 2003).
Internal DDR memory controllers are built into each AMD64 processor to lower the time that it takes for the CPU to access memory. The data flow still needs to travel between the processor and memory, but the communication with the memory controller does not need to be passed outside of the processor. The Opteron and Athlon 64 FX processors incorporate a dual-channel memory controller instead of the single-channel controller present in the other models of AMD64 processors. The memory path in this controller is doubled from 64 to 128 bits. Even thought it is technically not a dual-channel controller, data still flows along two distinct paths, but doubles the data pathways and makes a true 128-bit connection to the system memory. Theoretically, a 64-bit processor could conceivably access over 18 exabytes of physical memory, but in the AMD64 the physical address space now supports up to one terabyte of installed RAM. The upcoming release of Microsoft Windows XP 64-Bit Edition for 64-Bit Extended Systems supports up to 32 gigabytes of RAM and up to 16 terabytes of virtual memory. Current four gigabyte limitations are only problems in Compatibility mode and Legacy mode for the AMD64 ISA (Shimpi, 2003).
AMD64 architecture widens the general purpose registers to 64-bits and increased the amount by eight, totaling 16, which quadruples the general purpose register space currently available. Sixteen 128-bit XMM, or extended memory management, registers for enhanced multimedia performance doubles the space currently provided for SSE/SSE2 implementations. Level 1 cache has been increased to 64kb and is organized as 2-way set associative. A 2-way set associative cache has 2 cache locations in each set of cache locations. Level 1 cache supports two simultaneous 64-bit operations. Level 2 cache has been increased to 1Mb and is organized as 16-way set associative. When a given cache line in the L2 cache contains instruction stream information, the ECC bits associated with the given line are used to store pre-decode and branch prediction information. The SSE2 instruction set has also been introduced which supports 3DNow! Professional, which also includes SSE and 3DNow! Enhanced. Since AMD chose to extend the x86 architecture, software can expand using the x86 technology while receiving all of the benefits of 64-bit computing (Welker, 2004).
One of the most important features of the AMD64 processors is the HyperTransport technology. HyperTransport technology is a high performance system bus technology. HyperTransport supports an overall bus speed of 1.6GHz for up to 6.4GB/sec of bandwidth by using a series of data paths using links from point to point. HyperTransport links the CPU to the Northbridge chipset, Southbridge chipset, system memory, and even other CPU’s in a multiprocessor Opteron system. This technology is constant throughout all of the AMD64 processor line, but some have fewer HyperTransport links (Bennett, 2003). Coherent HyperTransport links are used to communicate cache-coherency information between multiple processors in a multiprocessor system configuration that share data. Opterons have three HyperTransport links, allowing a total system bandwidth of 19.2 GB/sec. The 800 Series Opteron has three coherent links, which allow up to four CPU’s, while the 200 Series combines one coherent, allowing up to 2 CPU’s with two non-coherent links. The Opteron 100 Series consists of three non-coherent HyperTransport links, because with just one processor there's no coherency issue to worry about. The Athlon 64 and Athlon 64 FX are made for single processor use, so they have no need for coherent links. Their system bandwidth is limited to 6.4 GB/sec with one HyperTransport link (Shimpi, 2003). The HyperTransport specifications allow for plenty of bandwidth for current hardware and applications and will support upcoming technologies, such as PCI Express which is said to be the next standard for graphic card interfaces, replacing the current AGP standards.
The many types of AMD 64 Processors released include the Opteron, Athlon 64, Athlon 64 FX, and the Mobile Athlon 64. The Opteron series supports multiprocessing and is marketed towards servers and workstations. The Athlon 64 FX is a port of the Opteron to the single processor line and marketed towards gamers. The Athlon 64 and Mobile Athlon 64 are marketed as high end processors for desktops and notebooks (Welker, 2004). AMD is making the transition from 32-bit computing to 64-bit computing easy by making the technology available to everyone.
With all of the improvements made on the x86 instruction set, AMD is paving the way for the future of computing. Introducing the first Windows compatible 64-bit processor and making sure that it was backwards compatible with current and older instructions lets people adapt to the new technology at their own pace. Therefore, they aren’t throwing everything out the window and starting from scratch as with the Itanium architecture. Until now, AMD has always been seen as the second best to Intel, but the AMD64 technology has shown that AMD has what it takes to at least compete with Intel, if not perform better than Intel.
Itanium
The Itanium Processor is the result of a partnership between Hewlett Packard (HP) and Intel Corporation. While many other companies have created 64-bit and even 128-bit processors, Intel and HP hope to successfully introduce a 64-bit processor into an environment which is dominated by 32-bit (IA-32) processors.
Throughout the years, computer scientists have underestimated how fast we will need computers to be in the future. The 64-bit processor is simply a continuation of this idea. One reason to move past the 32-bit processors is because of limitations. One of these limitations is with memory addressing. The IA-32 processors can, without special help, address only 4GB of memory at one time. There is an option to implement special memory paging application program interfaces (API), however. These APIs, however, slow down performance, simply because it takes longer to translate the address (Intel Corporation, 2002a). Half of the memory addressable units, under a Microsoft Windows environment, are allocated to the operating system. The Itanium processors can address as much as 264 memory allocations and this will be sufficient enough for at least some time.
Initially, software written and developed for the IA-32 processors would not run as efficiently on the Itanium processors as it did the IA-32. The reason is because the software written for IA-32 computers had to be run using an emulator (Intel Corporation 2002a). The only other option would be to port the software to the new 64-bit environment. Due to improvements and with the release of Itanium 2 Processors, this is no longer the case. The Itanium 2 processor can now run both 32-bit and 64-bit software.
The Itanium 2 has a top clock speed 1.5 GHz with a system bus of 400MHz. It has 3 caches all on-die. This means that the caches are all on the processor chip. L1 cache is not a unified cache, which means that the instruction cache (L1I) and the data cache (L1D) are separate. The L1D only caches data from the integer data; another cache must be used for floating point data. Both the L1I and the L1D are of size 16KB, for a combined total of 32KB. The L2 is a unified cache and can take floating-point instructions (Intel Corporation, 2002a). The L2 cache size is 256KB. The L3 cache, while still being on-die, is separate from the regular system bus. It must be accessed by a 128-bit back-side bus. Out of the three caches, the L3 is the largest; its size is 6mb. The L2 and L3 cache can be accessed at the full clock speed of the processor, therefore minimizing overall latency (Intel Corporation, 2002a).
There are three Translation Lookaside Buffers with the Itanium processors. They are the first level Data Translation Lookaside Buffer (DTLB1), the second level Data Translation Lookaside Buffer (DTBL2) and the Instruction Translation Lookaside Buffer (ITLB). The DTBL1 only has 32 entries and its main job is to keep a cached copy of the main table in DTBL2. The DTBL2, on the other hand, has 96 entries and keeps a record of the defined page sizes, the data Translation Registers (TR) and entries in the data Translation Cache (TC) (Intel Corporation, 2002a). The ITLB is very similar to the DTLB2, except that it only holds 64 entries and also holds instruction TRs and instruction TCs instead of data ones.
The Itanium Processor pipeline is based on EPIC (Explicitly Parallel Instruction Computing) when executing its “fetch, decode and execute” cycle. The actual pipeline is broken up into 10 different stages; thought not all can be done parallel to each other. The first three stages fetch the instructions and then deliver them in such a way that one end of the machine can work independently of the back end of the machine. The next two stages involving dispersing and renaming of the registers. The 6th and 7th stage involve the reading of the register file and dispersement of the data. The last three stages involve the parallel execution, exception handling as well as retirement of the instruction. The pipeline also provides hardware for execution of multiple units. These units are six integer ALUs, six multimedia ALUs, two Extended Precision Floating Point Units, two Single Precision Floating Point Units as well as two Load/Store Units. With this combination, and since all these instructions are not done in parallel, the processor can fetch, decode and execute 6 instructions per clock cycle (Intel Corporation, 2002a).
Working alongside the pipeline is structure that deals with prefetching. Prefetching acts as a bridge between the L1I and L2 cache. Instructions are prefetched from the L2 cache to hopefully prevent misses with the L1I. Instructions that are prefetched are done on a speculative basis, and the logic of which instructions to prefetch is handled by branch prediction. The processor will hold up to 4 different predictions, all of which are done parallel to the prefetch instruction from L1I to L2.
The Floating Point Units listed above have a four stage pipeline. This pipeline can allow either two Floating Point (FP) operations or two Integer multiplications; along with two FP load and two FP store commands. Two FP Multiply Accumulate (FMAC) hardware devices are also supported. These FMACs can execute single, double or mixed FP operations.
The Itanium architecture implements a massive amount of registers. With a large amount of registers, writing and reading to memory is minimized. There are 128 general registers (GR), 128 floating point registers, 64 predicate registers and 6 branch registers. Each general register is 64-bits and provides the processor with integer and multimedia integer computation. All general registers available to all programs. General registers are broken up into two sets, GR0-GR31 are static general registers, and GR0 is always 0 when it is an operand. GR32 through GR127 are deemed as stackable registers and are available to programs in groups. All floating point registers (FR) are 82-bits and are also broken up into two sets. Registers FR0 through FR31 are static registers and FR0 and FR1 are given the value +0.0 and +1.0 when sourced as an operand. Registers FR32-FR127 may be renamed in order to accelerate loops.
Predicate registers (PR) are used when a comparison takes place. There are 64 PRs and are each only 1 bit in length. PR0 through PR15 are static and PR0 always reads 1 when an operand. The rest, PR16 through PR63 are considered rotating registers and are used to support efficiency in pipeline loops. Branch registers (BR) are used to hold the branching information discussed above. There are only 8 of these registers and they are 64-bit. There is no subdivision, unlike the other registers, and the branching information consists of the address of each predicted branch.
The Register Stack Engine (RSE) is used to maintain the registers. This way, a register is not overwritten if there is an empty register of the same type available. It also helps in over-spilling of registers. If for instance, all the registers have information in them, the RSE will save all the registers that will be overwritten into memory. This way, the content of each register may be recalled easily. Such an implementation can give the illusion of infinite registers (provided the computer’s memory does not fill up.)
To be able to switch between IA-32 and the regular Itanium instruction set, all that is needed are three special instructions and interruptions (Intel Corporation, 2002b). The first one is jpme, which is an IA-32 instruction. This instruction jumps to a targeted Itanium instruction. It also tells the processor to use the Itanium instruction set. The next instruction is called br.ia. This is an Itanium instruction, which means to branch to an IA-32 instruction, while also making the IA-32 the active instruction set (Intel Corporation, 2002b). The last instruction is rfi, which stands for “Return from Interrupt.” This is another Itanium instruction that will return to either a targeted IA-32 or Itanium instruction.
IA-32e 64-bit Extensions
The IA-32e 64-bit extension set is Intel’s response to the AMD64 architecture used with Athlon64 and Opteron processors. Although Intel already sold the 64-bit Itanium processor, widespread adoption of the architecture remains elusive in both the desktop and server markets. The central problem with the Itanium is that 32-bit applications must run in an emulation mode which is vastly inferior to current 32-bit-specific processors. The new IA-32e extensions allow users to run current 32-bit programs and also retain support for future 64-bit applications. However, the specifications of the new IA-32e operating modes are different from previous chips, and recompilation of applications is required to take advantage of the 64-bit extensions. Both the Xeon server-line and the Pentium desktop-line of processors will support the new IA-32e extensions when they are implemented in mid-2004 (Turner 2004).