### ESE532: System-on-a-Chip Architecture

Day 7: September 28, 2020 Pipelining

Penn ESE532 Fall 2020 -- DeHon



### Previously

- · Pipelining in the large
  - Not just for gate-level circuits
- Throughput and Latency
- · Pipelining as a form of parallelism

enn ESE532 Fall 2020 -- DeHon

2

### Today

Pipelining details (for gates, primitive ops)

- Systematic Approach (Part 1)
- Justify Operator and Interconnect Pipelining (Part 2)
- · Loop Bodies
- Cycles in the Dataflow Graph (Part 3)
- C-slow [probably separate record] (Part 4)

Penn ESE532 Fall 2020 -- DeHor

3

### Message

 Pipelining is an efficient way to reuse hardware to perform the same set of operations at high throughput

enn ESE532 Fall 2020 -- DeHon

4

### Multiplexer Gate • MUX - When S=0, output=i0 - When S=1, output=i1 $\frac{5}{0} \quad \frac{10}{0} \quad \frac{11}{0} \quad \frac{\text{Mux2(S,i0,i1)}}{0}$ $\frac{1}{0} \quad \frac{1}{0} \quad \frac{1}{0} \quad \frac{1}{0}$ $\frac{1}{0} \quad \frac{1}{0} \quad \frac{1}{0} \quad \frac{1}{0}$

### Cycle

Two uses of term in this lecture:

- · Repetitive waveform
  - E.g. sine wave or square wave
- · Graph cycle



Penn ESE532 Fall 2020 -- DeHon











## Synchronous Circuit Discipline Registers that sample inputs at clock edge and hold value throughout clock period Compute from registers-to-registers Clock Cycle time large enough for longest logic path between registers Min cycle = Max path delay between registers

















## Note Registers on Links • Some links end up with multiple registers. • Why?









### Add Registers and Move

- If we're willing to add pipeline delay
  - Add any number of pipeline registers at input
  - Move registers into circuit to reduce cycle time
    - Reduce max delay between registers

Penn ESE532 Fall 2020 -- DeHon

26



### Add Register and Retime

- · Add chain of registers on every input
- · Retime registers into circuit
  - Minimizing delay between registers

Penn ESE532 Fall 2020 -- DeHon

28

### Add Registers and Retime

- · Lets us think about behavior
  - What the pipelining is doing to cycles of delay
- Separate from details of how redistribute registers
- Behavioral equivalence between the registers-at-front and properly retimed version of circuit

nn ESE532 Fall 2020 -- DeHon

### Justify Pipelining

(or composing pipelined operators)

Part 2

Penn ESE532 Fall 2020 -- DeHon

29

### **Handling Pipelined Operators**

- Given a pipelined operator
  - (or a pipelined interconnect)
- · Discipline of picking a frequency target and designing everything for that
  - May be necessary to pipeline operator since its delay is too high
- · Due to hierarchy
  - Pipelined this operator and now want to

use it as a building block

**Examples** 

- Run at 500MHz
- · Floating-point unit that takes 9ns
  - Can pipeline into 5, 2ns stages
- · Multiplier that takes 6ns
- · Memory can access in 2ns
  - Only if registers on address/inputs and output
  - i.e. exist in own clock stage

nn ESE532 Fall 2020 -- DeHon

32

### Interconnect Delay

- Chips >> Clock Cycles
- · May have chip 100s of Operators wide
- · May only be able to reach across 10 operators in a 2ns cycle
- · Must pipeline long interconnect links

33

### Interconnect Example 34

### Methodology: **Pipelined Operator Graph**

- · Start with logical, unpipelined graph
- · Treat each pipelined operator as a set of unit-delay operators of mandatory depth
- · Treat each interconnect pipeline stage as a unit-delay buffer
- · Add registers at input
- · Retime into graph



# Pipeline Loop (and use for justify pipeline example)



### **Example Operators**

- · Operator and Interconnect delays
  - Multiplier 3 cycles
  - Reading from Input array
    - Memory op is cycle after computing address
    - Takes one cycle delay bring data back to multiplier

Penn ESE532 Fall 2020 -- DeHon









### **Pipelining Lesson**

- Can always pipeline an acyclic graph (no graph cycles) to fixed frequency target
  - fixed pipelining of primitive operators
  - Pipeline interconnect delays
- Need to keep track of registers to balance paths
  - So see consistent delays to operators

Penn ESE532 Fall 2020 -- DeHon

44







### (Graph) Cycle Observation

- Retiming does not allow us to change the number of registers inside a graph cycle.
- · Limit to clock cycle time
  - Max delay in graph cycle / Registers in graph cycle
- Pipelining doesn't help inside graph cycle
  - Cannot push registers into graph cycle

Penn ESE532 Fall 2020 -- DeHon





## Loop • Consider – [multiply and mod each take 3 cycles] • For (i=0;i<N;i++) C[i]=(C[i-1]\*A[i])%N;

















### Vector Pipelines • Data Parallel Vector Operations are interesting even when Vector Lanes<Vector Length • Within Vector operation, data parallel so no cyclic dependencies - So get an II=1 issuing Vector Lane operations - May have data dependences between Vector operations

### **Vector Pipelines**

- · Data Parallel Vector Operations are interesting even when Vector Lanes<Vector Length
- · Within Vector operation, data parallel so no cyclic dependencies
  - So get an II=1 issuing Vector Lane operations
  - May have data dependences between Vector operations

61



### Dependence between **Vector Operations**

for (int i=0; i<32; i++)c[i]+=a[i]\*b[i]for (int i=0; i<32; i++)c[i]+=d[i]\*e[i]



### Big Ideas

- · Pipeline computations to reuse hardware and maximize computational capacity
- · Can compose pipelined operators and accommodate fixed-frequency target
  - Be careful with data retiming
- · Graph cycles limit pipelining on single stream
- · C-slow to share hardware among multiple, data-parallel streams

64

### Admin

- · Remember Feedback form
  - Including HW3
- · Reading for Day 8 on web
- HW4 due Friday

nn ESE532 Fall 2020 -- DeHon

65

### C-Slow

(Probably record separately) Part 4

### **Problem**

- Pipelining cannot push registers into a graph cycle
- Graph cycles can prevent running at full pipeline target (maximum clock frequency)
- If not reusing operators at full pipeline target are underutilizing resources
- Can we use the resources for something?

Penn ESE532 Fall 2020 -- DeHon

67

### C-Slow

- Observation: if we have data-level parallelism, can use to solve independent problems on same hardware
- Transformation: make C copies of each register
- **Guarantee:** C computations operate independently
  - Do not interact with each other

68





### Equivalence • The 2-slow operator is equivalent to two data parallel operators running at half the speed • E.g. processing separate audio channels

### Automation

- No mainstream tool today will perform C-slow transformation for you automatically
- · Synthesis tools will retime registers

Penn ESE532 Fall 2020 -- DeHon

### Lesson

- Cyclic dependencies limit throughput on single task or data stream
  - Cycle-length / registers-in-cycle
- Can use on C independent (data parallel) tasks

Penn ESE532 Fall 2020 -- DeHon