# ECE 252 / CPS 220 Advanced Computer Architecture I

# Lecture 6 Pipelining – Part 1

Benjamin Lee Electrical and Computer Engineering Duke University

www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class\_ece252fall11.html



#### 29 September – Homework #2 Due

- Use blackboard forum for questions
- Attend office hours with questions
- Email for separate meetings

#### 4 October – Class Discussion

Roughly one reading per class. Do not wait until the day before!

- 1. Srinivasan et al. "Optimizing pipelines for power and performance"
- 2. Mahlke et al. "A comparison of full and partial predicated execution support for ILP processors"
- 3. Palacharla et al. "Complexity-effective superscalar processors"
- 4. Yeh et al. "Two-level adaptive training branch prediction"



Latency = (Instructions / Program) x (Cycles / Instruction) x (Seconds / Cycle)

#### Performance Enhancement

- Increases number of cycles per instruction
- Reduces number of seconds per cycle

### Instruction-Level Parallelism

- Begin with multi-cycle design
- When one instruction advances from stage-1 to stage=2, allow next instruction to enter stage-1.
- Individual instructions require the same number of stages
- Multiple instructions in-flight, entering and leaving at faster rate

| Multi-cycle | insn0.fetch | insn0.dec   | insn0.exec |             |           |            |
|-------------|-------------|-------------|------------|-------------|-----------|------------|
|             |             |             |            | insn1.fetch | insn1.dec | insn1.exec |
| Pipelined   | insn0.fetch | insn0.dec   | insn0.exec |             | _         |            |
|             |             | insn1.fetch | insn1.dec  | insn1.exec  |           |            |





- All objects go through the same stages
- No resources shared between any two stages
- Equal propagation delay through all pipeline stages
- An object entering the pipeline is not affected by objects in other stages
- These conditions generally hold for industrial assembly lines
- But can an instruction pipeline satisfy the last condition?

## **Technology** Assumptions

- Small, very fast memory (caches) backed by large, slower memory
- Multi-ported register file, which is slower than a single-ported one
- Consider 5-stage pipelined Harvard architecture



#### Pipeline Overheads

- Each stage requires registers, which hold state/data communicated from one stage to next, incurring hardware and delay overheads
- Each stage requires partitioning logic into "equal" lengths
- Introduces diminishing marginal returns from deeper pipelines

### **Pipeline Hazards**

- Instructions do not execute independently
- Instructions entering the pipeline depend on in-flight instructions or contend for shared hardware resources



### First, build MIPS without pipelining

- Single-cycle MIPS datapath

### Then, pipeline into multiple stages

- Multi-cycle MIPS datapath
- Add pipeline registers to separate logic into stages
- MIPS partitions into 5 stages
- 1: Instruction Fetch (IF)
- 2: Instruction Decode (ID)
- 3: Execute (EX)
- 4: Memory (MEM)
- 5: Write Back (WB)

## **5-Stage Pipelined Datapath (MIPS)** Figure A.17, Page A-29



 $\mathsf{IR} \leftarrow \mathsf{mem}[\mathsf{PC}]; \mathsf{PC} \leftarrow \mathsf{PC} + 4; \mathsf{Reg}[\mathsf{IR}_{\mathsf{rd}}] \leftarrow \mathsf{Reg}[\mathsf{IR}_{\mathsf{rs}}] \mathsf{op}_{\mathsf{IRop}} \mathsf{Reg}[\mathsf{IR}_{\mathsf{rt}}]$ 

## **5-Stage Pipelined Datapath (MIPS)** Figure A.17, Page A-29



WB  $\leftarrow$  Result; Reg[IR<sub>rd</sub>]  $\leftarrow$  WB



Figure A.2, Page A-8



© 2007 Elsevier, Inc. All rights reserved.



Hazards prevent next instruction from executing during its designated clock cycle

#### Structural Hazards

- Hardware cannot support this combination of instructions.
- Example: Limited resources required by multiple instructions (e.g. FPU)

#### Data Hazards

- Instruction depends on result of prior instruction still in pipeline
- Example: An integer operation is waiting for value loaded from memory

#### **Control Hazards**

- Instruction fetch depends on decision about control flow
- Example: Branches and jumps change PC





A single memory port causes structural hazard during data load, instr fetch





Stall the pipeline, creating bubbles, by freezing earlier stages  $\rightarrow$  interlocks Use Harvard Architecture (separate instruction, data memories)





© 2007 Elsevier, Inc. All rights reserved.

Instruction depends on result of prior instruction still in pipeline



#### Read After Write (RAW)

- Caused by a dependence, need for communication
- Instr-j tries to read operand before Instr-I writes it

i: add r1, r2, r3 j: sub r4, r1, 43

#### Write After Read (WAR)

- Caused by an anti-dependence and the re-use of the name "r1"
- Instr-j writes operand (r1) before Instr-I reads it

i: add r4, r1, r3 j: add r1, r2, r3 k: mul r6, r1, r7

#### Write After Write (WAW)

- Caused by an output dependence and the re-use of the name "r1"
- Instr-j writes operand (r1) before Instr-I writes it

i: sub r1, r4, r3 j: add r1, r2, r3 k: mul r6, r1, r7





## Strategy 1 – Interlocks and Pipeline Stalls

- Later stages provide dependence information to earlier stages, which can stall or kill instructions
- Works as long as instruction at stage i+1 can complete without any interference from instructions In stages 1 through I (otherwise, deadlocks may occur)





Resource Usage







- Compare the <u>source registers</u> of instruction in decode stage with the <u>destination registers</u> of uncommitted instructions
- Stall if a source register in decode matches some destination register?
- No, not every instruction writes to a register
- No, not every instruction reads from a register
- Derive stall signal from conditions in the pipeline





Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions.





Should we always stall if RS/RT matches some RD? No, because not every instruction writes/reads a register. Introduce write/read enable signals (we/re)



| R-type: | ор | rs | rt          | rd    | func    |
|---------|----|----|-------------|-------|---------|
| I-type: | ор | rs | rt          | immed | liate16 |
| J-type: | ор |    | immediate26 |       |         |

| instruction |                                  | source(s) | <u>destination</u> |
|-------------|----------------------------------|-----------|--------------------|
| ALU         | rd $\leftarrow$ (rs) func (rt)   | rs, rt    | rd                 |
| ALUi        | rt 🗲 (rs) op imm                 | rs        | rt                 |
| LW          | rt $\leftarrow$ M[(rs) + imm]    | rs        | rt                 |
| SW          | M [(rs) + imm]                   | rs, rt    |                    |
| ΒZ          | cond (rs)                        |           |                    |
|             | true: PC $\leftarrow$ (PC) + imm | rs        |                    |
|             | false: PC $\leftarrow$ (PC) + 4  | rs        |                    |
| J           | PC 🗲 (PC) + imm                  |           |                    |
| JAL         | r31 ← (PC), PC ← (PC) + imm      |           | R31                |
| JR          | $PC \leftarrow (rs)$             | rs        |                    |
| JALR        | r31 ← (PC), PC ← (rs)            | rs        | R31                |





Should we always stall if RS/RT matches some RD? No, because not every instruction writes/reads a register. Introduce write/read enable signals (we/re)



| <u>Cdest</u> | WS  | Case(opcode)<br>ALU:<br>ALUi:<br>JAL, JALR:                   | ws ← rd<br>ws ← rt<br>ws ← R31           |
|--------------|-----|---------------------------------------------------------------|------------------------------------------|
|              | we  | Case(opcode)<br>ALU, ALUi, LW<br>JAL, JALR<br>otherwise       | we ← (ws != 0)<br>we ← 1<br>we ← 0       |
| <u>Cre</u>   | rel | Case(opcode)<br>ALU, ALUi<br>LW, SW, BZ<br>JR, JALR<br>J, JAL | re1 ← 1<br>re1 ← 1<br>re1 ← 1<br>re1 ← 0 |
|              | re2 | Case(opcode)<br><< same as re1 b                              | ut for register rt>>                     |



Xrs denote register rs for instruction in pipeline stage X Xrt denote register rt for instruction in pipeline stage X Xws denote destination register for instruction in pipeline stage X

| <u>Cstall</u> | stall-1 ←          | (           | (Drs == Ews) & Ewe  <br>(Drs == Mws) & Mwe  <br>(Drs == Wws) & Wwe |
|---------------|--------------------|-------------|--------------------------------------------------------------------|
|               |                    | ) & Dre1    |                                                                    |
|               | stall-2 ←          |             | (Drt == Ews) & Ewe  <br>(Drt == Mws) & Mwe  <br>(Drt == Wws) & Wwe |
|               |                    | ) & Dre2    |                                                                    |
|               | stall <del>(</del> | stall-1   s | itall-2                                                            |



 $M[(r1)+7] \leftarrow (r2)$ r4  $\leftarrow M[(r3)+5]$ 

What is the problem here? What if (r1)+7 == (r3+5)?

Load/Store hazards may be resolved in the pipeline or may be resolved in the memory system. More later.



## Strategy 2 – Forwarding (aka Bypasses)

- Route data as soon as possible to earlier stages in the pipeline
- Example: forward ALU output to its input









This forwarding path only applies to the ALU operations...

| Eforward | Case(Eopcode) |                                 |
|----------|---------------|---------------------------------|
|          | ALU, ALUi     | Eforward $\leftarrow$ (ws != 0) |
|          | otherwise     | Eforward $\leftarrow$ 0         |

...and all other operations will need to stall as before

| Estall | Case(Eopcode) |                               |
|--------|---------------|-------------------------------|
|        | LW            | Estall $\leftarrow$ (ws != 0) |
|        | JAL, JALR     | Estall 🗲 1                    |
|        | otherwise     | Estall 🗲 0                    |

Asrc ← (Drs == Ews) & Dre1 & Eforward

Remember to update stall signal, removing case covered by this forwarding path

# **Multiple Forwarding Paths**



© 2007 Elsevier, Inc. All rights reserved.

D A









# **Forwarding Loads/Stores**

Figure A.8, Page A-19



© 2007 Elsevier, Inc. All rights reserved.

# Data Hazard Despite Forwarding

Figure A.9, Page A-20



LD cannot forward (backwards in time) to DSUB. What is the solution?



#### Try producing faster code for

- A = B + C; D = E F;
- Assume A, B, C, D, E, and F are in memory
- Assume pipelined processor

| <u>Slow Code</u> |            | <u>Fast Code</u> |            |  |
|------------------|------------|------------------|------------|--|
| LW               | Rb, b      | LW               | Rb, b      |  |
| LW               | Rc, c      | LW               | Rc, c      |  |
| ADD              | Ra, Rb, Rc | LW               | Re, e      |  |
| SW               | a, Ra      | ADD              | Ra, Rb, Rc |  |
| LW               | Ree        | LW               | Rf, f      |  |
| LW               | Rf, f      | SW               | a, Ra      |  |
| SUB              | Rd, Re, Rf | SUB              | Rd, Re, Rf |  |
| SW               | d, RD      | SW               | d, RD      |  |



These slides contain material developed and copyright by

- Arvind (MIT)
- Krste Asanovic (MIT/UCB)
- Joel Emer (Intel/MIT)
- James Hoe (CMU)
- John Kubiatowicz (UCB)
- Alvin Lebeck (Duke)
- David Patterson (UCB)
- Daniel Sorin (Duke)