## ESE532: System-on-a-Chip Architecture Day 12: October 14, 2020

Day 12: October 14, 2020 Data Movement (Interconnect, DMA)

Penn ESE532 Fall 2020 -- DeHon





### 



#### Previously

- Want data in small memories
   Low latency, high bandwidth
- · FPGA has many memories all over fabric

5

DE ESSESSE FAIL 2020 De Han



#### Previously

- · Want data in small memories
  - Low latency, high bandwidth
- · FPGA has many memories all over fabric
- · Want C arrays in small memories
  - Partitioned so can perform enough reads (writes) in a cycle to avoid memory bottleneck

Penn ESE532 Fall 2020 -- DeHon

7

#### Today

- · Interconnect Infrastructure
- Peripherals (Part 2)
- Data Movement Threads (Part 3)
- DMA -- Direct Memory Access (Part 4)

enn ESE532 Fall 2020 -- DeHon

8

#### Message

- · Need to move data
- Shared interconnect to make physical connections
- Useful to move data as separate thread of control
  - Dedicating a processor is inefficient
  - Useful to have dedicated data-movement hardware: Direct Memory Access (DMA)

Penn ESE532 Fall 2020 -- DeHor

9

#### Memory and I/O Organization

- · Architecture contains
  - Large memories
    - · For density, necessary sharing
  - Small memories local to compute
    - · For high bandwidth, low latency, low energy
  - Peripherals for I/O
- Need to move data
  - Among memories and I/O
    - · Large to small and back
    - · Among small
- From Inputs, To Outputs

10

#### Term: Peripheral

- "On the edge (or perhiphery) of something"
- Peripheral device device used to put information onto or get information off of a computer
  - E.g.
    - Keyboard, mouse, modem, USB flash drive, ...

Penn ESE532 Fall 2020 -- DeHon



#### Memory and I/O Organization

- · Architecture contains
  - Large memories
    - · For density, necessary sharing
  - Small memories local to compute
    - · For high bandwidth, low latency, low energy
- Peripherals for I/O
- · Need to move data
  - Among memories and I/O
    - · Large to small and back
    - · Among small
- From Inputs, To Outputs

13

#### How move data?

- · Abstractly, using stream links.
- Connect stream between producer and consumer.
- · Ideally: dedicated wires

Penn ESE532 Fall 2020 -- DeHon

14

#### **Dedicated Wires?**

 What might prevent us from having dedicated wires between all communicating units?

Penn ESE532 Fall 2020 -- DeHon

15

#### **Making Connections**

- · Cannot always be dedicated wires
  - Programmable
  - Wires take up area
  - Don't always have enough traffic to consume the bandwidth of point-to-point wire
  - May need to serialize use of resource
    - E.g. one memory read per cycle
  - Source or destination may be sequentialized on hardware

uwai e on





# Alternate: Crossbar • Provide programmable connection between all sources and destinations • Any destination can be connected to any single source











































#### Masters and Slaves

- · Two kinds of entities on interconnect
- Master can initiate requests
  - E.g. **processor** that can perform a read or write
- Slaves can only respond to requests
  - E.g. **memory** that can return the read data from a read request

41

nn ESE532 Fall 2020 -- DeHon





#### **Timing Demands**

- · Must read each input before overwritten
- Must write each output within real-time window
- Must guarantee processor scheduled to service each I/O at appropriate frequency
- How many cycles between 32b input words for 1Gb/s network and 32b, 1GHz processor?
  - Consider input data shifted into register 1b per ns
  - Must read out 32b register before overwritten

enn ESE532 Fall 2020 -- DeHon

44

46



### Long Latency Memory Operations

i ait o

532 Fall 2020 -- DeHon

#### Day 3

- · Large memories are slow
  - Latency increases with memory size
- Distant memories are high latency
  - Multiple clock-cycles to cross chip
  - Off-chip memories even higher latency

47

n ESEE22 Foll 2020 - Dollan

#### Day 3, Preclass 2

- 10 cycle latency to memory
- If must wait for data return, latency can degrade throughput
- 10 cycle latency + 10 op + (assorted)
  - More than 20 cycles / result

```
for(i=0;i<MAX;i++) {
  in=a[i]; // memory read
  out=f(in); // 10 cycle compute
  b[i]=out;
}</pre>
```

#### Preclass 3

 Throughput using 3 threads on 3 processors: P1, P2, P3?

```
P1: for(i=0;i<MAX;i++) Astream.write(a[i]);
P2: while(1) {Astream.read(aval); Bstream.write(f(aval));}
P3: for(i=0;i<MAX;i++) Bstream.read(b[i]);
```

Penn ESE532 Fall 2020 -- DeHon

49

#### Fetch (Write) Threads

- Potentially useful to move data in separate thread
- · Especially when
  - Long (potentially variable) latency to data source (memory)
- · Useful to split request/response

Penn ESE532 Fall 2020 -- DeHon

50

#### DMA Part 4

**Direct Memory Access** 

Penn ESE532 Fall 2020 -- DeHon

51

## Preclass 4a P1: for(i=0;i<MAX;i++) Astream.write(a[i]); WriteAstart NewAddr[24] WriteAstop FIFO\_Has\_Space FIFO\_Has\_Space FIFO\_DataIn[32] int \*p; P1: for(p=&(a[0]);p<&(a[MAX]);p++) Astream.write(\*p);

#### Preclass 4

- How much hardware?
   Counter bits?
   Registers?
   Comparators?
   How much hardware?
   WitteAstant NewAdd(24) WitteAstant N
  - Control Logic gates? (4cd)
- Compare to MicroBlaze
  - small RISC Processor optimized for Xilinx
- minimum config 630 6-LUTs

Penn ESE532 Fall 2020 -- Do

#### Observe

- Modest hardware can serve as data movement thread
  - Much less hardware than a processor
  - Offload work from processors
- Small hardware allow peripherals to be master devices on interconnect





#### **DMA Engine**

- · Data Movement Thread
  - Specialized Processor that moves data
- · Act independently
- · Implement data movement
- Can build to move data between memories (Slave devices)
- E.g., Implement P1, P3 in Preclass 3

Penn ESE532 Fall 2020 -- DeHon

57



#### Programmable DMA Engine

- · What copy from?
- · How much?
- · Where copy to?
- · Stride?
- · What size data?
- · Loop?
- · Transfer Rate?

enn ESE532 Fall 2020 -- DeHoi

59

#### Multithreaded DMA Engine

- One copy task not necessarily saturate bandwidth of DMA Engine
- Share engine performing many transfers (channels)
- · Separate transfer state for each
  - Hence thread (or channel)
- · Swap among threads
  - Simplest: round-robin:







#### AXI nsible Int

- · Advanced eXtensible Interface
  - Originally developed by ARM
  - On-chip communication bus standard
  - Particular communication protocol
- Full AXI
  - Read/write operations with bursts
    - Burst = single address + length
  - Separate send/receive data channels
- AXI-S for streaming connections
- AXI-lite simpler, not burst







#### DMA in Vitis

- Vitis/OpenCL demands that we write code to perform DMA of data to and from accelerators in FPGA fabric
- · We will see specifics on Monday
- · Have some options to control
  - With pragmas
  - With choice of data and burst sizes
  - Explore HW6

Penn ESE532 Fall 2020 -- DeHon

68

#### Big Ideas

- · Need to move data
- Shared Interconnect to make physical connections – can tune area/bw/locality
- Useful to
  - move data as separate thread of control
  - Have dedicated data-movement hardware: DMA

enn ESE532 Fall 2020 -- DeHon

69

#### Admin

- Feedback
- · Hardware distribution
  - Thursday (for those not pick up yesterday)
- HW5
  - Due Friday, long build
- HW6
  - Out soon
  - Assign so at least one partner has Ultra96

enn ESE532 Fall 2020 -- DeHon