# ESE5320: System-on-a-Chip Architecture

Day 7: September 26, 2022 Pipelining

one ESEE320 Foll 2022 Dollar



#### Previously

- · Pipelining in the large
  - Not just for gate-level circuits
- Throughput and Latency
- · Pipelining as a form of parallelism

Penn ESE5320 Fall 2022 -- DeHon

1

#### Today

Pipelining details (for gates, primitive ops)

- Systematic Approach (Part 1)
- Justify Operator and Interconnect Pipelining (Part 2)
- · Loop Bodies
- Cycles in the Dataflow Graph (Part 3)
- C-slow [supplemental recording] (Part 4)

Penn ESE5320 Fall 2022 -- DeHon

3

3

## Message

 Pipelining is an efficient way to reuse hardware to perform the same set of operations at high throughput

Penn ESE5320 Fall 2022 -- DeHon

4

#### Multiplexer Gate

- MUX
  - When S=0, output=i0
  - When S=1. output=i

|   | i0 | i1 | Mux2(S,i0,i1) |
|---|----|----|---------------|
| 0 | 0  | 0  | 0             |
| 0 | 0  | 1  | 0             |
| 0 | 1  | 0  | 1             |
| 0 | 1  | 1  | 1             |
| 1 | 0  | 0  | 0             |
| 1 | 0  | 1  | 1             |
| 1 | 1  | 0  | 0             |
| 1 | 1  | 1  | 1             |

Cycle

Two uses of term in this lecture:

- · Repetitive waveform
  - E.g. sine wave or square wave
- · Graph cycle



4

Penn ESE5320 Fall 2022 -- DeHon

5





Internal

10

Latch · Element that can hold a previous value of an input

Register Use a pair to create a flip-flop - Also call register · What happens when - CLK is low (0)? - CLK is high (1)? - CLK transitions from 0 to 1? What output Q until next 0 to 1 CLK transition?

10

Register · Use a pair to create a flip-flop - Also call register Sample D input on 0→1 transition of clock (CLK) · Never an open path from  $D \rightarrow Q$ - One of the mux latches always in hold state 11 ESE5320 Fall 2022 -- DeHon 11

Synchronous Circuit Discipline Registers that sample inputs at clock edge and hold value throughout clock period · Compute from registers-to-registers · Clock Cycle time large enough for longest logic path between registers • Min cycle = Max path delay between registers 12

12







Preclass 2: What Happens? · What would be wrong with this pipelining? 









19

Note Registers on Links

• Some links end up with multiple registers.

• Why?

Consistent Pipelining

• Makes sure a consistent input set arrives at each gate/operator

– Don't get mixing between input sets

22

22

21

Legal Register Moves

Retiming Lag/Lead

Lag: remove register every input add register every output

Lead: remove register every output add register every input add register every input add register every input add register every input



23 24



# Add Registers and Move

- · If we're willing to add pipeline delay
  - Add any number of pipeline registers at input
  - Move registers into circuit to reduce cycle time
    - Reduce max delay between registers

25

26





27

29

28

#### Add Register and Retime

- · Add chain of registers on every input
- · Retime registers into circuit
  - Minimizing delay between registers

29 n ESE5320 Fall 2022 -- DeHon



#### **Justify Pipelining**

(or composing pipelined operators) Part 2

31

31

#### **Examples**

- Run at 500MHz
- · Floating-point unit that takes 9ns
  - Can pipeline into 5, 2ns stages
- · Multiplier that takes 6ns
- · Memory can access in 2ns
  - Only if registers on address/inputs and output
  - i.e. exist in own clock stage

35

33

33

# Interconnect Example

Given a pipelined operator

**Handling Pipelined Operators** 

- (or a pipelined interconnect)
- Discipline of picking a frequency target and designing everything for that
  - May be necessary to pipeline operator since its delay is too high
- Due to hierarchy
  - Pipelined this operator and now want to use it as a building block

32

#### Interconnect Delay

- Chips >> Clock Cycles
- May have chip 100s of Operators wide
- · May only be able to reach across 10 operators in a 2ns cycle
- · Must pipeline long interconnect links

34

# Methodology: **Pipelined Operator Graph**

- · Start with logical, unpipelined graph
- Treat each pipelined operator as a set of unit-delay operators of mandatory depth
- · Treat each interconnect pipeline stage as a unit-delay buffer
- Add registers at input
- Retime into graph

36

34



Pipeline Loop

(and use for justify pipeline example)

37 38



**Example Operators** 

- Operator and Interconnect delays
  - Multiplier 3 cycles
  - Reading from Input array
    - Memory op is cycle after computing address
    - Takes one cycle delay bring data back to multiplier (or adder)

40

Penn ESE5320 Fall 2022 -- DeHon

40



41 42

















Pipeline Graph • Result after next retime (top register)? 









#### **Pipelining Lesson**

- · Can always pipeline an acyclic graph (no graph cycles) to fixed frequency target
  - fixed pipelining of primitive operators
  - Pipeline interconnect delays
- · Need to keep track of registers to balance paths
  - So see consistent delays to operators

56

58



Preclass 3 · Can we retime to reduce clock cycle time? 58

**Retiming Limits?** · What prevents us from retiming? 59

59

· Retiming does not allow us to change the number of registers inside a graph cycle. · Limit to clock cycle time - Max delay in graph cycle / Registers in graph

(Graph) Cycle Observation

· Pipelining doesn't help inside graph cycle

- Cannot push registers into graph cycle

ESE5320 Fall 2022 -- DeHon

60

10





61

Initiation Interval (II)

- Cyclic dependencies in a dataflow graph can limit throughput
- Due to data-dependent cycles in graph,
  - May not be able to initiate a new computation on every clock cycle
- II clock cycles (delay) before can initiate
- Throughput = 1/II



63

Loop

64

- Consider

  Imultiply and mad each take
  - [multiply and mod each take 3 cycles]
- For (i=0;i<N;i++)</li>
   C[i]=(C[i-1]\*A[i])%N;

Penn ESE5320 Fall 2022 -- De

64

Loop

• For (i=0;i<N;i++)

C[i]=(C[i-1]\*A[i])%N;

Mul2

Mul3

Mod2

Mod3

Penn ESE5320 Fall 2022 -- DeHon

Loop

• For (i=0;i<N;i++)

C[i]=(C[i-1]\*A[i])%N;

• Initiation Interval?

Penn ESE5320 Fall 2022 -- DeHon

Mem

C[i-1] A[i]

Mul1

Mul2

Mul3

For (i=0;i<N;i++)

Mul1

Mul2

Mod3

C[i] 66

65 66







II and Latency • II? (assume willing to pipeline inputs) · Latency? 70

70



Lesson • Cyclic dependencies limit throughput on single task or data stream - Cycle-length / registers-in-cycle 72

71 72

#### **Vector Pipelines**

- Data Parallel Vector Operations are interesting even when
  - Vector Lanes<Vector Length
- Within Vector operation, data parallel so no cyclic dependencies
  - So get an II=1 issuing Vector Lane operations
  - May have data dependences between Vector operations

EGESSZOT All ZOZZ -- Deliton



74

# Dependence between Vector Operations

for (int i=0;i<32; i++) d[i]=a[i]\*b[i]+c[i] for (int i=0;i<32; i++)



75

73

#### Big Ideas

- Pipeline computations to reuse hardware and maximize computational capacity
- Can compose pipelined operators and accommodate fixed-frequency target
  - Be careful with data retiming
- Graph cycles limit pipelining on single stream – II (Initiation Interval)
- C-slow to share hardware among multiple, data-parallel streams (part 4) 76

76

#### Admin

- · Remember Feedback form
  - Including HW3
- · Reading for Day 8 on web
- · HW4 due Friday

ESEE220 Foll 2022 Dollar

77

C-Slow

(See uploaded recording)
Part 4



C-Slow

- · Observation: if we have data-level parallelism, can use to solve independent problems on same hardware
- Transformation: make C copies of each register
- Guarantee: C computations operate independently
  - Do not interact with each other

80

79

2-Slow Simple Cycle · Replace register with pair Retime

2-Slow Simple Cycle · Replace register with pair

· Observe independence of red/blue

computations

Retime

82

81

82

80

## Equivalence

• The 2-slow operator is equivalent to two data parallel operators running at half the speed



Automation

- · No mainstream tool today will perform C-slow transformation for you automatically
- · Synthesis tools will retime registers

84

83

#### Lesson

- Cyclic dependencies limit throughput on single task or data stream
  - II=Cycle-length / registers-in-cycle
- Can use on C (C<=II) independent (data parallel) tasks

enn ESE5320 Fall 2022 -- DeHon

85

# Big Ideas

- Pipeline computations to reuse hardware and maximize computational capacity
- Can compose pipelined operators and accommodate fixed-frequency target
   Be careful with data retiming
- Graph cycles limit pipelining on single stream -- II
- C-slow (C<=II) to share hardware among multiple, data-parallel streams (part 4)

E0EE000 E-110000 D-11--

86

86