## ESE532: System-on-a-Chip Architecture

Day 7: September 27, 2021
Pipelining

Penn ESE532 Fall 2021 -- DeHon



### Previously

- · Pipelining in the large
  - Not just for gate-level circuits
- Throughput and Latency
- · Pipelining as a form of parallelism

Penn ESE532 Fall 2021 -- DeHon

2

### Today

Pipelining details (for gates, primitive ops)

- Systematic Approach (Part 1)
- Justify Operator and Interconnect Pipelining (Part 2)
- · Loop Bodies
- Cycles in the Dataflow Graph (Part 3)
- C-slow [online-only recording] (Part 4)

Penn ESE532 Fall 2021 -- DeHon

3

### Message

 Pipelining is an efficient way to reuse hardware to perform the same set of operations at high throughput

enn ESE532 Fall 2021 -- DeHon

4

# 

## Cycle

Two uses of term in this lecture:

- · Repetitive waveform
  - E.g. sine wave or square wave
- · Graph cycle



enn ESE532 Fall 2021 -- DeHon

















# Pipeline Reuse • Lower delay between clocks – Higher clock rate – Higher potential throughput – Faster we reuse our logic – More capacity get out of design • Assuming registers cheap in area and time overhead – T<sub>setup</sub>, T<sub>clk→q</sub> ~ 20ps, T<sub>add</sub> ~ 500ps – Registers ~ 10 transistors/bit – Adder ~ 40—50 transistors/bit Penn ESE532 Fall 2021 - DeHon











# Note Registers on Links • Some links end up with multiple registers. • Why?









### Add Registers and Move

- If we're willing to add pipeline delay
  - Add any number of pipeline registers at input
  - Move registers into circuit to reduce cycle time
    - Reduce max delay between registers

Penn ESE532 Fall 2021 -- DeHon

26



### Add Register and Retime

- · Add chain of registers on every input
- · Retime registers into circuit
  - Minimizing delay between registers

Penn ESE532 Fall 2021 -- DeHon

28

## Add Registers and Retime

- · Lets us think about behavior
  - What the pipelining is doing to cycles of delay
- Separate from details of how redistribute registers
- Behavioral equivalence between the registers-at-front and properly retimed version of circuit

nn ESE532 Fall 2021 -- DeHon

29

# Justify Pipelining

(or composing pipelined operators)

Part 2

### **Handling Pipelined Operators**

- Given a pipelined operator
  - (or a pipelined interconnect)
- · Discipline of picking a frequency target and designing everything for that
  - May be necessary to pipeline operator since its delay is too high
- · Due to hierarchy
  - Pipelined this operator and now want to

use it as a building block

**Examples** 

- Run at 500MHz
- · Floating-point unit that takes 9ns
  - Can pipeline into 5, 2ns stages
- · Multiplier that takes 6ns
- · Memory can access in 2ns
  - Only if registers on address/inputs and output
  - i.e. exist in own clock stage

nn ESE532 Fall 2021 -- DeHon

32

### Interconnect Delay

- Chips >> Clock Cycles
- · May have chip 100s of Operators wide
- · May only be able to reach across 10 operators in a 2ns cycle
- · Must pipeline long interconnect links

33

# Interconnect Example 34

## Methodology: Pipelined Operator Graph

- · Start with logical, unpipelined graph
- · Treat each pipelined operator as a set of unit-delay operators of mandatory depth
- · Treat each interconnect pipeline stage as a unit-delay buffer
- · Add registers at input
- · Retime into graph



# Pipeline Loop (and use for justify pipeline example)



### **Example Operators**

- · Operator and Interconnect delays
  - Multiplier 3 cycles
  - Reading from Input array
    - Memory op is cycle after computing address
    - Takes one cycle delay bring data back to multiplier (or adder)

Penn ESE532 Fall 2021 -- DeHon







### Pipelining Lesson

- Can always pipeline an acyclic graph (no graph cycles) to fixed frequency target
  - fixed pipelining of primitive operators
  - Pipeline interconnect delays
- Need to keep track of registers to balance paths
  - So see consistent delays to operators

Penn ESE532 Fall 2021 -- DeHon

43

Graph Cycles

Watch: Clock cycle
Cycle time
Cycle in Graph
Part 3

Preclass 3

• Can we retime to reduce clock cycle time?

A0
B0

Penn ESE532 Fall 2021 – DeHon



## (Graph) Cycle Observation

- Retiming does not allow us to change the number of registers inside a graph cycle.
- · Limit to clock cycle time
  - Max delay in graph cycle / Registers in graph cycle
- Pipelining doesn't help inside graph cycle
  - Cannot push registers into graph cycle

enn ESE532 Fall 2021 -- DeHon

47

# Simple Graph Cycle • Delay of graph cycle? • Registers in graph cycle? • What happens to graph cycle if try to apply lead/lag?

















### Lesson

- Cyclic dependencies limit throughput on single task or data stream
  - Cycle-length / registers-in-cycle

Penn ESE532 Fall 2021 -- DeHon

57

### **Vector Pipelines**

- Data Parallel Vector Operations are interesting even when Vector Lanes
- Within Vector operation, data parallel so no cyclic dependencies
  - So get an II=1 issuing Vector Lane operations
  - May have data dependences between Vector operations

Penn ESE532 Fall 2021 - DeHon

58

# Vector Pipeline Example for (int i=0;i<32; i++) c[i]+=a[i]\*b[i] Penn ESE532 Fall 2021 -- DeHon



### Big Ideas

- Pipeline computations to reuse hardware and maximize computational capacity
- Can compose pipelined operators and accommodate fixed-frequency target
  - Be careful with data retiming
- Graph cycles limit pipelining on single stream -- II
- C-slow to share hardware among multiple, data-parallel streams (part 4) 61

### Admin

- Remember Feedback form
   Including HW3
- · Reading for Day 8 on web
- · HW4 due Friday

Penn ESE532 Fall 2021 -- DeHor

62

### C-Slow

(Probably record separately)
Part 4

Penn ESE532 Fall 2021 -- DeHon

63

### Mem A[i] C[i-1] Problem Mul1 · Pipelining cannot push registers Mul2 into a graph cycle · Graph cycles can prevent running Mul3 at full pipeline target (maximum Mod1 clock frequency) Mod2 · If not reusing operators at full ₩ Mod3 pipeline target are underutilizing resources C[i] · Can we use the resources for 64 Essomething?

### C-Slow

- Observation: if we have data-level parallelism, can use to solve independent problems on same hardware
- Transformation: make C copies of each register
- **Guarantee:** C computations operate independently
- Do not interact with each other







### Automation

- No mainstream tool today will perform C-slow transformation for you automatically
- Synthesis tools will retime registers

E532 Fall 2021 -- DeHon

### Lesson

- Cyclic dependencies limit throughput on single task or data stream
  - II=Cycle-length / registers-in-cycle
- Can use on C (C<=II) independent (data parallel) tasks

Penn ESE532 Fall 2021 -- DeHon

70

### Big Ideas

- Pipeline computations to reuse hardware and maximize computational capacity
- Can compose pipelined operators and accommodate fixed-frequency target
  - Be careful with data retiming
- Graph cycles limit pipelining on single stream -- II
- C-slow (C<=II) to share hardware among multiple, data-parallel streams (part 4)

enn ESE532 Fall 2021 -- DeHon

71