Chapter 8
CPU and Memory: Design, Enhancement, and Implementation

8.1 (BL3) To keep the x86 current, Intel has added numerous features over the past thirty years since the basic design was first released. The goal has been to maintain the ability to execute older programs while providing increased capabilities for modern systems. The result is an instruction set with instruction words ranging from one byte to many tens of bytes, a variety of op codes of different lengths, a number of different and inconsistent addressing formats, registers ranging in size from eight bits to 128 bits, and more. All of these modifications have made support for superscalar processing difficult. Nonetheless, Intel has come up with a number of creative solutions to make superscalar processing possible and effective.

(1) Since instructions don't fall on well-defined, consistent address boundaries, early x86 processors used a buffer to load a large number of bytes into the CPU for instruction analysis and preprocessing. This allowed the CPU to identify upcoming instructions in advance of their execution. More recently, Intel expanded this buffer into a large instruction pool.

(2) Intel created a secondary, uniform-length internal instruction set. Commonly used instructions are reformatted into the secondary instruction set for pipelining. Rarely used instructions are executed as built-in software subroutines.

(3) Intel created multiple execution units with different pipelines for different types of instructions and replaceable sets of registers that could be used for the temporary storage of data. This allowed Intel to create algorithms for determining data dependencies and allowing out-ot-order processing. An instruction completion pool area allows the machine to complete instructions in the correct order. The parallel execution units make it possible to process multiple instructions in parallel. Essentially, the CPU pulls instructions from the pool as they become available for execution, requiring only that doing so does not affect the outcome of the original program. This allows the CPU to keep the various pipelines filled as much as possible.

With this flexibility in place, the x86 is able to implement superscalar processing.

8.2(BL2) The instruction that requires four execution steps stalls in the pipeline, while the instruction behind it waits for three instruction cycles, creating a three cycle delay.

8.3This problem illustrates the issue of data dependencies.

a.(BL3) If both instructions access an integer unit simultaneously, the ADD instruction will operate on the initial data in R5, 0, instead of the result from the multiplication, as obviously intended. Furthermore, the final value in R5 will then be destroyed, when the MULTIPLY completes, and the MULTIPLY instruction stores its own value, 24, into R5. It is as though the ADD instruction never executed. If data dependencies are taken into account, the CPU recognizes that the MULTIPLY instruction must complete first, before the ADD instruction can proceed. Therefore, the execution unit with the ADD instruction sits idle for 15 cycles.

b.(BL3) If out-of-order execution is permitted, the idle execution unit could execute other, non-dependent instructions while it is waiting. The problem states that the later ADD instruction meets the requirements. Although it can perform the addition, however, it must wait for availability of the R5 register to hold the result until the previous MULTIPLY and ADD are completed, so there is no real gain. If the system provides rename registers, however, the later ADD can be executed and stored in a rename register named R5. When the instruction is retired, the result is then moved to the actual R5 register, preserving the correct flow of the program. In this case, we have gained the cycles normally required for the last ADD instruction by executing it out of order.

8.4(BL3) A branch dependency requires a delay while the pipeline is refilled with instructions from the correct branch. If it is known that the CPU always executes the two instructions following the branch instruction, then those instructions can be executed while the refilling operation takes place, eliminating the delay. To be useful, this technique requires reordering instructions in such a way that the two instructions following a branch are free of dependencies and would normally be executed prior to the branch.

8.5(BL3-) The student would observe that most backward conditional branches occur as a result of loops, and most forward branches result from IF-THEN-ELSE conditions. These results are consistent with normal programming situations. IF-THEN-ELSE situations always execute a block of code following the decision point; statistically, true conditions are executed more frequently than false conditions, so that a program design that branches foward on true condtions will predict coreectly more than 50% of the time. The prediction that a backward branch will be taken will perform even better, since most loops are repeated multiple times, often many times.

8.6(BL2+ - 4) This is a creative problem, with no single best answer, but one would expect a solution that provides multiple personnel, representing each execution unit, and possibly one or more assembly lines to represent pipelining. One solution suggests a fast food restaurant in which orders are taken and hung in order on clips awaiting action by the kitchen personnel. A number of cooks, representing execution units, take the orders and prepare them. To keep the diners happy, orders for a single table are delivered together (data dependencies) and orders are delivered to tables in the order that they were taken (in-order completion).

8.7a. (BL2+) The LOAD and STORE each take five steps (or four, if you assume that the PC may be incremented in parallel with other operations). The ADD and SUBTRACT also require five (or four) steps, IN and OUT require four (or three), SKIPs require four (or three), and JUMPs require three. Then a typical program mix requires

S = 0.25 (5 + 5) + 0.10 (5 + 5 + 4 + 4) + 0.05 (4 + 3) = 4.65 steps per instruction on average.

If the clock ticks at 10 MHz., the number of instructions executed in a second,

N = 10,000,000 / 4.65 = approximately 2.17 instructions per second.

If we assume that the PC is incremented in parallel with execution, then the result is

N = 10,000,000 / ( 0.25 (4 + 4) + 0.10 (4 + 4 + 3 + 3) + 0.05 (3 + 3) )

= approx. 2.7 million IPS

b.(BL2+) With pipelining, each instruction is reduced by the two steps required for the fetch. (This, of course, assumes that bubbles in the pipeline can usually be avoided.) Then,

N = 10,000,000 / ( 0.25 (2 + 2) + 0.10 (2 + 2 + 1 +1) + 0.05 (2 + 1) )

= approx. 5.7 million IPS

Although these calculations are only rough estimates of CPU performance, they illustrate clearly the substantial gains that have resulted from sophisticated modern design techniques.

b.(BL2+) The data is found in 1279 + 10 + 12 (in X) = 1301. This instruction leaves 2111 + 1322 = 3433 in A.

8.8 (BL1) 2 billion instructions per second; approximately 6 billion instructions per second, assuming that instructions can be processed continuously, with no delays for memory access and data dependencies.

8.9 (BL3) To see the interaction between the program and the cache memory, first suppose that the program processes the array one column at a time. Each cache line holds sixteen values. Since the cache line holds the next sixteen values in column order, all sixteen values will be used immediately, at which time, those values are completed “forever.” This will be true for the entire array, so the total number of cache transfers is 160,000/16, or 10,000 transfers.

Now consider processing the array one row at a time. This situation is more difficult. As the program processes the first row, each value is found in a separate part of linear memory, and a new cache line must be loaded for each value in the row. When the 301st value is accessed, the cache is full, so it is necessary to replace a previous cache line. No matter which cache line is replaced, values in that line will be required for the next several rows, and the line will have to be reloaded. There are a number of different ways of selecting a cache line for replacement, but for simplicity, we can assume that the oldest line is replaced. If this is the case, the first 100 blocks will be replaced, in order to finish the first row. Processing the second row will then require replacing each of those lines. To do so, the next hundred will be eliminated. The result is a new cache line transfer for every value in the array. Although it would be possible to design a replacement algorithm that would reduce the number of transfers somewhat, the improved algorithm would be more complex, and would only apply to particular cases. This makes such an algorithm impractical and not worthwhile.

8.10(BL1+) When a cache miss occurs, a new cache line must be moved from memory to cache before the memory access can occur. In the simplest case, there is space available in cache, and the only delay is the additional time required to access conventional memory to load the cache line. If it is necessary to write back another cache line to make space for the new line, however, there is an additional delay.for writing the old cache line to memory.

8.11(BL1+) The tag identifies the actual physical memory addresses corresponding to the data stored in a particular cache block. As such, the tags together act as a directory to identify the physical memory addresses for everything stored in cache memory.

8.12(BL1+) Both memory write-through and write-back result from the fact that writes to cache are actually destined for physical memory, so that writes to cache must ultimately be transferred to physical memory. Memory write-through writes every cache write back to memory immediately. This assures that physical memory is always current, and simplifies the design. The cost is that a memory write is required for every cache write, which slows down performance. The write-back technique only writes to physical memory when a cache line is replaced, which is faster, but there is a small risk of data loss, and careful design is required to assure that data loss doesn't occur.

8.13. (BL 2) Master-slave multiprocessing requires the use of a single CPU core to manage and delegate services to the slave cores. The master CPU represents a single point of failure for the system. In addition, the master CPU carries the load of delegating work to the slaves. If the system load is heavy, the master may, itself, be too busy to keep the slaves fully occupied. Conversely, master-slave multiprocessing is useful when the slaves are assigned specific, known tasks that can be executed relatively independently. This is the case, for example, with the Cell processor, which is used in the Sony PlayStation 3 game controller. The slaves perform specifically-assigned graphics processing tasks which are compute-intensive, reducing the burden on the master CPU.

For most general purpose processing, symmetric multiprocessing is to be preferred. Work can be divided evenly among the various CPU cores, and each has access to the data and program instructions, so that any CPU can pick up work as it becomes available. There is no single point of failure. However, the shifting of a program from core to core during execution can make it more difficult to implement cache memory in such a way that data integrity is maintained. Furthermore, the operating system requires more sophistication to delegate work effectively and efficiently.

In general, you would select symmetric multiprocessing for fault tolerant systems, since the failure of a single CPU slows down, but does not normally disable, system processing. This is true because in symmetrical multiprocessing any CPU can execute any task.

8.14. (BL 2) The tasks performed by the slave processors are determined, of course, by the game designers or programmers for the applications that are run on the processor. Typically, the assigned tasks represent processing-intensive, repetitive tasks that can take advantage of the vector processing capability of a slave and that can be run with relatively independently of the master CPU. For gaming, these tasks include terrain generation, lighting and shadowing, MPEG encoding and decoding, transformation of 3-D graphics, audio decoding, and other vector and multimedia processing tasks.

The primary role of the master processor is to manage the system and to delegate work to the slave processors. For multimedia-intensive applications this configuration of master-slave multiprocessing is well-suited, because typical applications of this type require intensive processing that is well-defined and easily isolated from the main application.

Other applications of this type include many scientific applications, in particular, problems with large amounts of data to be analyzed. Current problems observed in the literature include cancer analysis, physics particle analysis, weather analysis, and DNA analysis.

8.15. (BL 3) When the number of cores grows large, there is usually a diminishing rate of return. There are two primary reasons for this.

(1). The number of processors exceeds the number of programs that require execution time, on average, at any given time; therefore a number of the processors are sitting idle, not performing useful work. Even when the number of programs being executed is relatively large, typically some of them would be waiting for I/O, and therefore not capable of execution at the time.

(2). Each core must have access to memory to execute its instructions. As the number of cores grows large, the time required for each memory access becomes a limiting resource. Although the use of cache memory for each core alleviates this problem to some extent, cache memory creates other, additional problems. (See (3) below.) Also, cache misses require access to memory to replace cache lines. As the number of cores grows, the number of cache lines that must be loaded also grows, consuming memory access time.

(3). The amount of overhead required by the operating system to manage and keep track of each program being executed grows large. A considerable amount of effort, for example, is needed to keep track of data changes in the cache memory of each processor, to make sure that the data throughout the cache memory in every core is consistent.

The number of cores can be increased somewhat if we can assure that the programs running in each core are relatively independent of each other (which solves problem (3), and if the system is designed in such a way that each processor has access to its own local memory (which solves problem (2). ) In this case, access to the main, shared computer memory is limited to block DMA transfers, treating main memory as I/O. This is the case for master-slave multiprocessors like the Cell processor. Similar designs have been created in the past for transaction processing applications, where each processor handled its own transactions independently.