# ESE5320: System-on-a-Chip Architecture

Day 5: September 18, 2023 **Dataflow Process Model** 



nn ESE5320 Fall 2023 -- DeHor

1

Today

**Dataflow Process Model** 

- Terms (part 1)
- Issues
- Abstraction
- Performance Prospects (part 2)
- · Basic Approach
- As time permits (part 3)
  - Dataflow variants
  - Motivations/demands for variants

#### Message

- · Parallelism can be natural
- · Expression can be agnostic to substrate
  - Abstract out implementation details
  - Tolerate variable delays may arise in implementation
- · Divide-and-conquer
  - Start with coarse-grain streaming dataflow
- · Basis for performance optimization and parallelism exploitation

3

# Programmable SoC Implementation Platform for innovation - This is what you target (avoid NRE) - Implementation vehicle

Reminder

Goal: exploit parallelism on heterogeneous PSoC to achieve desired performance (energy)



n ESE5320 Fall 2023 -- DeHon

Term: Process

- · Abstraction of a processor
- · Looks like each process is running on a separate processor
- · Has own state, including
  - Program Counter (PC)
  - Memory
  - Input/output
- May not actually run on processor
  - Could be specialized hardware block
- May share a processor

2

5 6

#### **Thread**

- Has a separate control location (PC)
- May share memory (contrast process)
  - Run in common address space with other threads
- May not actually run on processor
  - Could be specialized hardware block
  - May share a processor

Penn ESE5320 Fall 2023 -- DeHon

/

Day 4 **FIFO** DataIn -/-DataOut Full Hardware Block · Tell it when you are Outputs data in consuming data order received Read - First-In, First-Out · Tells you when it's Tell it when you are empty and has no providing data data to provide - Write · Tells you when it's - May choose not to full and can hold insert on a cycle nothing else · Need to signal enn ESE5320 Fall 2023 -- DeHoi

8

#### **Process**

- Processes (threads) allow expression of independent control
- Convenient for things that advance independently
- Process (thread) is the easiest way to express some behaviors
  - Easier than trying to describe as a single process
- Can be used for performance optimization to improve resource utilization

Penn ESE5320 Fall 2023 -- DeHon

9

ESE5320 Fall 2023 -- Denon

#### Preclass 2

- Average time for TF, SG independently?
  - $-\,1$  cycle 99% of time, 100 cycles 1% of time
- Throughput TF->SG with no FIFO?
  - Hint: what must wait on TF miss? SG miss?
- Throughput with FIFO?
   How is FIFO changing?
- What benefit from FIFO and processes?



10

#### Preclass 2

- · Independent probability of miss
  - $-P_f, P_q$
- · Concretely
  - 1 cycle in map
  - 100 run function and put in map
- · If each runs independently (in isolation)
  - T~= 1\*(1-P)+P\*100
- · If run together in lock step
  - Either can stall: P=P<sub>f</sub>+P<sub>a</sub>-P<sub>f</sub>P<sub>a</sub>
  - T~= 1\*(1-P)+(P)\*100

11

#### Multithread Web Page Load

- Typical browsers load images in separate threads
  - Allows parallelism in image loads
  - Doesn't block display of text content (images that have already downloaded)
    - · Get to see that even if image load slow
  - Separate thread keeps track of separate location in each image load

Penn ESE5320 Fall 2023 -- DeHon

12

### Model (from Day 4) **Communicating Threads**

- · Computation is a collection of sequential/control-flow "threads"
- · Threads may communicate
  - Through dataflow I/O
  - (Through shared variables)
- View as hybrid or generalization
- · CSP Communicating Sequential Processes → canonical model example

ESE5320 Fall 2023 -- DeHon

14

13

#### Today's Stand

- Communication FIFO-like channels
- Synchronization dataflow with FIFOs
- · Determinism how to achieve
  - ...until you must give it up.
    - · Only hint at giving up at end of lecture, time permitting

15

15

## Operation/Operator

- Operation logical computation to be performed
  - A process that communicates through dataflow inputs and outputs
- Operator physical block that performs an Operation
  - E.g. processor, hardware block

nn ESE5320 Fall 2023 -- DeHon

17

17

#### **Dataflow Process Model** Multithread/CSP Dataflow Sequential Control Dynamic DF with Peek Sequential Control with Allocation Dynamic Streaming DF Data-centric Synchronous Dataflow (SDF) Finite Single-Rate SDF

Issues

Communication – how move data

Synchronization – how define how

processes advance relative to each

• **Determinism** – for the same inputs, do

- What latency does this add?

- Throughput achievable?

we get the same outputs?

between processes?

other?

nn ESE5320 Fall 2023 -- DeHon

16

#### Dataflow / Control Flow

#### **Dataflow**

- Program is a graph of operations
- Operation consumes tokens and produces tokens
- All operations run concurrently
  - All processes

nn ESE5320 Fall 2023 -- DeHon

#### Control flow (e.g. C)

- Program is a sequence of operations
- Operation reads inputs and writes outputs into common store
- One operation runs at a time
  - defines successor

16

Day 4

18



19

#### **Streams**

- · Captures communications structure
  - Explicit producer → consumer link up
- Abstract communications
  - Physical resources or implementation
  - Delay from source to sink
  - Delay of Operators
- Contrast

21

- C: producer->consumer implicit through memory
- Verilog/VHDL: cycles visible in implementation

can add on top of either C or Verilog)

1

# Variable Delay Source to Sink • How would placement of source and sink operator impact delay? P dedicated wire C1 P dedicated wire C2 • How could sharing of interconnect between source and sink impact delay? Penn ESES320 Fall 2023 -- DeHon

22

20



On-Chip Delay

• Delay is proportional to distance travelled

• Make a wire twice the length

— Takes twice the latency to traverse

— (can pipeline)

• Modern chips

— Run at 100s of MHz to GHz

— Take 10s of ns to cross the chip

23 24



Dataflow Process Network

Collection of Operations
Connected by Streams
Communicating with Data Tokens
(CSP restricted to stream communication)

26

enn ESE5320 Fall 2023 -- DeHo

#### **Dataflow Abstracts Timing**

- · Doesn't say
  - on which cycle calculation occurs
- Does say
  - What order operations occur in
  - How data interacts
    - · i.e. which inputs get mixed together
- Permits
  - Scheduling on different # and types of resources
  - Operators with variable delay
- Variable delay in interconnect

27

27 28



**Dataflow Graphs** 

Parallel Performance Prospect

Part 2

# Synchronous Dataflow (SDF) with fixed operators

- · Particular, restricted form of dataflow
- · Each operation
  - Consumes a fixed number of input tokens
  - Produces a fixed number of output tokens
  - Operator performs fixed number of operations (in fixed time) – data independent
  - When full set of inputs are available
    - · Can produce output
  - Can fire any (all) operations with inputs

m ESE5320 Faryailable at any point in time

30

28

29 30



**Processor Model** 

- Simple (for today's lecture)
  - Assume one primitive operation per cycle
- · Could embelish
  - Different time per operation type
    - E.g. adds: 1 cycle, multiply: 3 cycles
  - Multiple memories with different timings

enn ESE5320 Fall 2023 -- DeHon

32

31

Time for Graph Iteration on Processors

• Single processor  $T_{one} = \sum_{i} Nops_{i}$ 

One processor per Operation (process)
 □ T<sub>each</sub> = max(Nop<sub>1</sub>,Nop<sub>2</sub>,Nop<sub>3</sub>,...)

General

$$T_{map} = max \left( \sum_{i} c(1, i) \times Nops_{i}, \sum_{i} c(2, i) \times Nops_{i}, \sum_{i} c(3, i) \times Nops_{i}, \dots \right)$$

$$c(x, y) - 1 \text{ if Processor x runs task y}$$
SE5320 Fall 2023 – Deton (simplified resource bound model)

33

Intel Knights Landing

Knights Landing Overview

THE 2 YPU OM 2 YPU OM 1 10 11 12 OWN 1 11 12

34

32

GRVI/Phallanx

- Puts 1680 RISC-V32b Integer cores
- On XCVU9P FPGA

nn ESE5320 Fall 2023 -- DeHon

• http://fpga.org/2017/01/12/grvi-phalanx-joins-the-kilocore-club/



35 36







37



Apple A16 Bionic • ? 110+mm<sup>2</sup>, 4nm • 16 Billion Tr. • iPhone 14 · 6 ARM cores - 2 fast (3.5GHz) 4 low energy (2GHz) • 5 custom GPUs (1.4GHz) · 16 Neural Engines – 17 Trillion ops/s?

39



Heterogeneous Processor 7,500 each 3,000 Windowe Entropy Quantize • GPU perform 10 primitive FFT Ops per cycle · Fast CPU can perform 2 ops/cycle · Slow CPU 1 op/cycle · Map: FFT to GPU, Select to 2 Fast CPUs, quantize and Entropy each to own Slow CPU Cycles/graph iteration? 42 nn ESE5320 Fall 2023 -- DeHon

42

40







**Custom Accelerator** · Dataflow Process doesn't need to be mapped to a processor · Map FFT to custom datapath on FPGA - Read and produce one element per cycle - 1024 cycles to process 1024-point FFT 1024 15,000 3,000 2,000 Windowed FFT Entropy Encode Select Freq. Quantize 46

#### **Operations**

- · Can be implemented on different operators with different characteristics
  - Small or large processor
  - Hardware unit
  - Different levels of internal
    - · Data-level parallelism
    - · Instruction-level parallelism
    - · Pipeline parallelism
- · May itself be described as
  - Dataflow process network, sequential,

hardware register transfer language

47

#### **Streams**

46

- · Stream: logical communication link
- Some implementation options:
  - TCP/IP link over Internet
  - On-Chip bus
  - Buffer in memory
- · Appropriate for
  - -2 processes on separate processors on same chip
  - -2 threads on same processor
  - One process at Penn, one at Amazon

nn ESE5320 Fall 2023 -- DeHo

47 48



#### Semantics (meaning)

- · Need to implement semantics
  - i.e. get same result as if computed as indicated
- · But can implement any way we want
  - That preserves the semantics
  - Exploit freedom of implementation

Penn ESE5320 Fall 2023 -- DeHon

50

52

50

49

### Basic Approach

Penn ESE5320 Fall 2023 -- DeHon

51

### Approach (1)

- · Identify natural parallelism
- · Convert to streaming flow
  - Initially leave operations in software
  - Focus on correctness
- Identify flow rates, computation per operator, parallelism needed
- · Refine operations
  - Decompose further parallelism?
  - E.g. data parallel split, ILP implementations
- model potential hardware

52

54

51

# Approach (2)

- Refine coordination as necessary for implementation
- Map operations and streams to resources
  - Provision hardware
  - Scheduling: Map operations to operators
  - Memories, interconnect
- · Profile and tune
- Refine SE5320 Fall 2023 -- DeHon

53

#### **Dataflow Variants**

#### Part 3:

(coverage here depends on time available)

E5320 Fall 2023 -- DeHon

#### Variable Delay

- Two different causes of "variable" delay
  - 1. Operator-dependent
  - 2. Data-dependent
- Operator-dependent
  - Depends on operator select
    - · Fast processor, slow processor, GPU
    - Fixed time once select
- Data-Dependent
  - Depends on data being processed
    - Examples to come

55

55

## Data-Dependent Variable **Delay Operators**

- · Why might a multiplier have datadependent variable delay?
  - Hint: consider shift-and-add multiply
  - Multiply by 3 vs. multiply by 16,777,215

57

#### **Data-Dependent Variable Delay Operators**

- · Operators with Data-Dependent Variable Delay
  - Cached memory or computation
  - Shift-and-add multiply
  - Iterative divide or square-root

nn ESE5320 Fall 2023 -- DeHon

59

57

#### **Motivations and Demands** for Dataflow Options

Time Permitting

enn ESE5320 Fall 2023 -- DeHor

56

#### GCD (Preclass 3)

What is delay of GCD computation?

- while(a!=b)
  - t=max(a,b)-min(a,b)

56

- a=min(a,b)
- b=t
- return(a);

58

#### **Dynamic Rates?**

- · Dynamic rates use of inputs or production of outputs is data-dependent
  - if (good input(x)) out.write(x)
  - If (destination high(x) high.write(x) else low.write(x)
- · What is implication of static rates
  - on compression?
  - Filtering?
    - (e.g. discard all spam packets)

60

58

59 60

# **Data-Dependent Rates?**

- Static Rates limiting
  - Compress/decompress
    - Lossless
    - Even Run-Length-Encoding
  - Filtering
    - · Discard all packets from spamRus
  - Anything data dependent

Penn ESE5320 Fall 2023 -- DeHon

61

61

# Primitives

Non-Blocking Stream

- Blocking
  - only primitives are read, write
  - If data not present, block for data to be available
- Non-blocking
  - Add operations to ask if data is available (if stream ready for write)

if (not(empty(in1)) next\_pkt=in1.read()
else if (not(empty(in2)) next\_pkt=in2.read()

enn ESE5320 Fall 2023 -- DeHon

62

#### When non-blocking necessary?

- What are cases where we need the ability to ask if a data item is present?
- · Consider a server with multiple clients
- Clients requests are independent, random
  - No guarantee make same number or rate of requests
  - What happens if must wait for a request from each of clients?
- What would prefer to do?

Penn ESE5320 Fall 2023 -- DeHo

63

63

### When non-blocking necessary?

· Consider an IP packet router:



· Why need non-blocking here?

Penn ESE5320 Fall 2023 -- DeHon

64

#### Non-Blocking

- · Removed model restriction
  - Can ask if token present
- · Gained expressive power
  - Can grab data as shows up
- · Weaken our guarantees
  - Possible to get non-deterministic behavior
    - · Depends on timing
      - -Which we've said may vary with mapping
- · Use when necessary, avoid if possible

Penn ESE5320 Fall 2023 -- DeHon

65

### **Turing Complete**

- Can implement any computation describable with a Turing Machine
  - (theoretical model of computing by Alan Turing)
- Turing Machine captures our notion of what is computable
  - If it cannot be computed by a Turing Machine, we don't know how to compute it

Penn ESE5320 Fall 2023 -- DeHon

66

64

#### **Process Network Roundup** SDF+fixed-delay operators N SDF+variable (data-dependent) Ν N delay operators Dynamic Rate DF blocking N Dynamic Rate DF non-blocking N N Υ Good Completeness (Compute anything) For nn ESE5320 Fall 2023 -- DeH correctness Real-Time

# Big Ideas

- Capture gross parallel structure with Process Network
- Use dataflow synchronization for determinism
  - Abstract out timing of implementations
  - Give freedom of implementation
- Exploit freedom to refine mapping to optimize performance
- Minimally use non-determinism as necessary

68

68

67

#### Admin

- · Remember feedback
  - Today's lecture and HW2
- Reading for Day 6 on web
- · HW3 due Friday
  - Implementing multiprocessor solutions on homogeneous (ARM) processor cores

Penn ESE5320 Fall 2023 -- DeHon

69