Chapter 11Software Pipelining

Software PipeliningHM

CS 410 / 510Mastery in Programming

Chapter 11Software Pipelining

Herbert G. Mayer, PSU CS

Status 8/5/2013

Software Pipelining(8/5/2013)

Software Pipelining (SW PL) is a performance enhancing technique that exploits instruction-level parallelism on a wide instruction word to accomplish execution speed-up. As the name implies, software pipelining here is not primarily achieved through hardware, but rather refers to the interleaved execution of distinct iterations of a SW loop. Parts of multiple iterations of a source program’s loop body are executed simultaneously in a pipelined fashion.

This requires that the target machine have multiple HW modules that can execute simultaneously. For example, an LIW architecture has instructions that perform, say, a floating point multiply at the same time they execute a floating point add, and perhaps also an integer addition plus a load operation. We have seen this on a smaller scale with superscalar machines.

Similarly, a VLIW architecture provides instructions that do all of the above plus a store, several other integer operations, and perhaps loop instructions. Very roughly, the boundary between VLIW and LIW is 128 instruction bits.

Synopsis

Motivation
Outline of Software Pipelining
General Pattern of VLIW Instructions Used
Definitions
Comparison of Pipelining, SW PL, and Superscalar Execution
Typical VLIW instruction
Example of Simple SW PL Loop
Peeling-Off Iterations
Software Pipelined Matrix Multiply
More Complex SW Pipelined Loop
Skipping the Epilogue?
Literature References

Motivation

Imagine you construct a super-machine instruction, one that can perform numerous operations in parallel. The time for the novel instruction to complete equals the time of the longest sub operation. Once the faster sub operations complete, their HW modules sit idle, waiting for another instruction to keep them busy, until the most complexsub operation terminates. Let us name this super-machine instruction a Very Long Instruction Word (VLIW) operation.

This VLIW operation may compute a floating point (FP) multiply (fmul) at the same time it performs an FP add (fadd), and a load (ld), and a store (st), and two integer operations, and perhaps more. Below is a sequence of 3 fictitious assembler operations, executed in sequence; later we shall package them into a single VLIW operation:

0-- 2 loads, 1 add-- assume good addresses in r1, r2

1ldr3, (r1)++-- load r3 ind. thru r1, post-incr.

2ldr4, (r2)++-- load into r4 memory word at (r2)

3faddr5, r3, r4-- Floating-point add r5 ←r3 + r4

The sequence of instructions above executes sequentially. Our novel VLIW instruction can do all of the above, and more in parallel. The new super-assemblerexpresses this simultaneousness via new assembly pseudo-operations, the { and } braces. These braces say: All enclosed operations are packed into a single VLIW instruction, and all the sub opcodes, regardless of the order listed, are executed in parallel, in a single logical step, in a single instruction. See the new VLIW instruction below:

{-- assume good addresses in r1, r2

1ldr3, (r1)++-- load r3 indirect thru r1, post-incr.

2ldr4, (r2)++-- load into r4 memory word at (r2)

3faddr5, r3, r4-- FP add r5 ← r3 + r4

}

It is not reasonable to define the architecture such that the sum of r3 and r4, ending up in r5, would be the addition of the newly fetched memory words, indirectly accessed through r1 and r2. Instead, the old values in r3 and r4 that were already in the registers when the instruction started, those older values would be added into r5. New values then are loaded into r3 and r4. So the meanings of the sequential and VLSI instruction sequences above are not equivalent. But then this seems that a VLIW operation cannot truly be used for executing in parallel and fast, what had to be evaluated in sequence and slowly. And that is a correct observation; yet maybe we can invent something great with VLIW operations after all?

This is where Software Pipelining comes into play. It is a technique that takes advantage of the built in parallelism, while overcoming the limitation that the destination register of one sub operation cannot be used in the same logical step as a source for another. Software Pipelining packs parts of multiple iterations of one source loopby assembling them into the same VLIW instruction. For this to work the Software Pipelined loop must be suitably prepared; this is done in the loopPrologue. And before the iterations are all complete, the VLIW instruction must be emptied or drained; that is done in the loop Epilogue.

Outline of Software Pipelining

Software Pipelining requires a target processor with multiple executing agents (HW modules) that can run simultaneously
Target may be an LIW, a VLIW, or a heavily superscalar processor
These multiple HW modules operate in parallel on (parts of) different iterations of a source program’s loop
It is key that none of these multiple, simultaneous operations generate results that would be expected as input of another operation in the same iteration
Instead, in step 1 an operation generates a result in destination register r1, while in step 2 another operation can use r1 as an input to generate the next result
It is the compiler’s or programmer’s task to pack operations of different iterations of the source loop into suitable fields of one long instruction in a way that they be executed in one step
In extreme cases, a complete loop body may consist of a single VLIW instruction; clearly a desirable, perhaps rare and special case
So done at Intel’s iWarp program in the 1990s for a double-precision matrix multiply

General Pattern of VLIW Instructions Used:

In the following examples we use VLIW instructions. The assembly language syntax is similar to conventional assemblers. There will be individual load and store operations, floating point adds and multiplies. In addition, VLIW instructions can group numerous of these operations together; this is expressed by the { and } braces, indicating that all operations enclosed can be executed together, jointly in a single step.The sequential order in which the various VLIW suboperations are written in assembler source does by no means imply sequential execution order! Instead, all suboperations are --permitted to be-- initiated at the same time. For example:

1ldr3, (r1)++-- load r3 indirect thru r1, post-incr.

2addr0, r6, #2-- add int 2 to r6 and put sum into r0

3 {-- start of VLIW operation

4faddr5, r3, r4-- add current r3 + r4 into r5

5ldr3, (r1)++-- new value in r3, old value used

6ldr4, (r2)++-- new value in r4, old value used

7 }-- end of VLIW instruction

Lines 1 and 2 above display typical sequential load and add instructions. However, lines 3 through 7 display one single VLIW operation that lumps a floating point add, two integer increments, and two loads into one single executable step. Note that r3 and r4 are both used as operands, as inputs to the floating point add operation, but they also receive new values as a result of the load functions. This works perfectly well, as long as the processor latches their old values for use in the fadd. After completion of the VLIW instruction the loaded values in r3 and r4 can be used as operands again.

The execution time of the above VLIW instruction is dominated by the time to execute the two loads. This could be the tome for a single load, if the 2 addresses (in r1 and r2) happen to activate different memory banks. In that case, the time for two loads would be the time for one load plus a small number ofclock cycles. Else the 2 load are executed in sequence.

Loop Instructions, Not germane to VLIW

The loop instruction used in the examples here is akin to the Intel 8086 processor. The hypothetical operation:

loopfoo-- using special-purpose, implied loop register rx

has the following meaning: An implied loop register rx is assumed initially to hold the number of iterations of a countable loop; this integer value is decreased by 1 each time the loop instruction is executed. If that decremented value is not yet 0 execution continues at label foo. Typically foo resides at an address physically before the loop instruction. So this results in a back branch most of the time, and once in a fall through.

The fall though happens, when the special register has counted down to 0

Definitions

Column:

A Software Pipelined loop operates on (parts of) more than one iteration of a source program’s loop body. Each distinct loopiteration executed in one step towards which a compute step progresses is called a column. Hence, if one execution of the loop makes some progress toward 4 different iterations, such a SW-pipelined loop is said to have 4 columns.

Draining:

Execute further instructions on a VLIW architecture after VLIW loop execution, such that a SW-pipelined loop completes correctly. That means, all registers still holding good values after loop termination will finally be used up by such sequential instructions. Synonym: flushing. Antonym: Priming.

Epilogue:

When the (steady state of the) loop terminates, there will generally be some valid operands in the hardware resources used. For example, an operand may have been loaded and is yet to be used. Or, a product has been computed into a register that is yet to be used = added to the final sum. Thus the last operands must be consumed, the pipeline must be drained. This is accomplished in the object code after the steady state and is called the Epilogue. Antonym: Prologue.

LIW:

Long Instruction Word; an instruction requiring more opcode bits than a conventional (e.g. RISC) instructions, because multiple simultaneous operations are packed and encoded into a single LIW instruction. Typically, LIW instructions are longer than 32 but shorter than 128 bits.

The opcode proper may be short, possibly even a single bit, but generally there will be further bits to specify sub-opcodes. For example, the floating point add may perform either a subtract, negate, or unsigned add etc. and that must be specified via bits for sub opcodes.

Peeling Off:

Removal of one iteration from the original complete loop is called peeling off. Usually this is done to perform or prepare for an optimization. In software pipelining this is done to ensure the pipeline is primed before and drained after execution of the loop. The object code of peeled off iterations can be scheduled together with other instructions. Hence the Prologue and Epilogue may end up in VLIW instructions of code preceding or following a software pipelined loop.

Priming:

Execute sequential instructions on a VLIW architecture before loop execution, such that a SW-pipelined loop can run correctly. That means, all registers are holding good values at the start of the loop, but hold also partly unused values at termination of the loop. Antonym: Flushing.

Prologue:

Before the Software Pipelined loop body can be initiated, the various hardware resources (e.g. registers) that partake in the SW PL must be initialized. For example, the first operands may have to be loaded, or the first sum of two loaded operands must be computed. Thus the first operands must be generated, the pipeline must be primed. This is accomplished in the object code before the steady state and is called the Prologue.Antonym: Epilogue.

Steady State:

The object code executed repeatedly, after the Prologue has been initiated and before the Epilogue will be active is called the Steady State. Each single execution of the Steady State makes some progress toward multiple iterations of the source loop. This loop was mapped into Prologue, SteadyState, plus Epilogue by the Software Pipelining compiler.

VLIW:

Very Long Instruction Word; like an LIW instruction, but VLIW instructions typically consume 128 or more bits for the opcode, sub-opcodes, plus all operands. Some of the sub-opcodes may actually be NOPs.

Comparison of Pipelining, SW PL, and Superscalar Execution

Pipelining:

Assumes: multiple independent HW units that collectively execute one single instruction in sequence. Hardware modules are not replicated
Such multiple hardware units collectively execute one instruction
Pipelining does: at any step (clock cycle) each HW module executes one part of a different instruction
Pipelining allows: simultaneous execution of multiple parts of different instructions at the same time. This does not accelerate the execution time of a single instruction
Pipelining speeds up: the throughput of numerous consecutive instructions, provided there are no stalls (AKA hazards), or few stalls relative to the number of instructions run

Superscalar Execution:

Superscalar architecture assumes: multiple independent HW units that each can execute a complete instruction. Also assumes that some instruction sequences are arranged in the object code such that they can be fetched and executed together; implies they have no data dependence on one another
Superscalar architecture does: at any step execute either a single, or sometimes multiple, instructions simultaneously.Always fetches greedily.
Superscalar architecture allows: simultaneous execution of some select sequences of instructions. Note that not all instruction sequences (or instructions pairs in a superscalar architecture with a maximum of 2 identical HW Modules) can be executed concurrently
Superscalar architecture speeds up: those select instruction sequences, for which the architecture provides multiple HW modules, and for which the logic of the running program provides data independence, i.e. the output of one is not required as the input to the other

Software Pipelining:

Software Pipelining assumes: VLIW or LIW instructions.
Software Pipelining does: at any VLIW-step executes multiple operations at the same time, packaged into a single VLIW instruction.
Software Pipelining allows: execution of parts of multiple iterations of the same source loop at run time. But generally one VLIW instruction does not map into a full source loop; only a portion of a whole loop body.
At each iteration, the Software pipelined loop executes VLIW instructions, which together hold instructions for multiple source statements of a loop body. Thus VLIW technology speeds up the total throughput of a complete loop.

Typical VLIW Instruction Performs

floating point multiply
floating point add
load from memory, with optional auto-increment, decrement, pre- or post
second load or store, with auto-increment, decrement, pre- or post
integer add, also with optional auto-increment etc. Can also subtract, invert sign, etc.
integer multiply or divide
loop-related operation

Note: If both a load and a store are performed in one VLIW instruction on a system with a single memory controller, the load should be initiated first. The execution time for the overall instruction can be greater than the maximum of the times for a load and a store. Having two store operations as part of the same VLIW instruction would require special care to define the case of both memory destinations being identical or partly overlapping. But this discussion we neglect here; fits into an architecture class, not Programming Mastery.

Example of Simple SW PL Loop

The program to be software pipelined adds all elements of floating-point vector a[] to the corresponding elements of vector b[] placing the sums intoa[]. First we show the source program in pseudo-C, with a possible mapping into some hypothetical assembly language. Then we show a picture of the problem with the hardware resources used, and a software pipelined version inVLIW assembly language. Note that registers r1, and r2 are used as pointers. Register r1 points to the next element in a[] to be fetched, r2 to the next available address in b[].

Source of Vector Add Program:

for ( i = 0; i < N; i++ ) {

a[ i ] = a[ i ] + b[ i ];--a[] and b[] are floats

// same as: a[ i ] += b[ i ]

} //end for

Sequential Assembly Program of Vector Add:

movr0, Addr( a[] )-- address of a[0] is in r0 for load

movr1, r0-- copy pointer to r1 for store; need both?

movr2, Addr( b[] )-- address of b[0] in r2 for load

movrx, N-- Number of iterations into registerrx

L_1:

ldr3, (r0)++-- load word indirect into r3 through r0

ldr4, (r2)++-- what r2 points to is loaded into r4

faddr5, r3, r4-- r5 holds sum of two elements

str5, (r1)++-- store result and post-increment

loopL_1-- if not completed, jump to L_1

Pictorial Representation of Vector Add Program:

Software Pipelined Assembly Version of Vector Add:

The iteration space spans 0 throughN-1. Of those N iterations, 2 will be peeled off in the Prologue plus Epilogue. Each iteration in the Steady Stateconsists of two loads, two integer additions done via post-increment, a floating point add, a store, and the loop-overhead step. It is fairly unrealistic to have one VLIW instructions do 2 loads and a store!

The Prologue executes a pair of loads and one floating point add. Then comes another pair of loads, since the previously loaded values are used up in the floating point add. The Epiloguethen completes the last but one store, performs the last add and concludes by storing that final value.

--Initialization:-- N is the iteration count

movr0, Addr( a[] )-- r0 is pointer to float array a[0]

movr1, r0-- copy address of a[0] into r1

movr2, Addr( b[] )-- r2 is pointer to b[0]

movrx, N-2-- rx holds iteration count N-2

--Prologue primes the SW pipe: peeling off parts of 2 iterations

ldr3, (r0)++-- r3 holds a[0], r0now points to a[1]

ldr4, (r2)++-- r4 holds b[0], r2 now points to b[1]

faddr5, r3, r4-- old a[0] + b[0] in r5

ldr3, (r0)++-- a[1] in r3, r0 now points to a[2]

ldr4, (r2)++-- b[1] in r4, r2 now points to b[2]

--Steady State executes repeatedly, (N-2) times:

L_2:

{

str5, (r1)++-- store into address pointed to by r1

faddr5, r3, r4-- r5 holds sum of two elements

ldr3, (r0)++-- load indirect into r3 through r0

ldr4, (r2)++-- what r2 points to is loaded into r4

loopL_2-- decrement rx, if != 0, jump to L_2

}

--r1 now points to last-but-2 element of a[] for storing

--Epilogue drains the SW pipe:

str5, (r1)++-- increment r1

faddr5, r3, r4-- use last pair of FP operands

str5, (r1)-- and store last sum in a[N-1]

Peeling Off Iterations

Two iterations had to be peeled off. The overall net effectof all peeled-off operations is 2 pairs of loads, 2 floating point-adds, and 2 stores. All those loads and one add are done early in the Prologue. The other add and the 2 stores are executed in the Epilogue. The former primes the software pipeline for the Steady State; the latter drains the pipe after completion of the Steady State