# ESE532: System-on-a-Chip Architecture Day 11: October 7, 2019 Data Movement (Interconnect, DMA)







# Previously

- Want data in small memories
   Low latency, high bandwidth
- FPGA has many memories all over fabric

Penn ESE532 Fall 2019 -- DeHon 5



# Previously

- · Want data in small memories
  - Low latency, high bandwidth
- FPGA has many memories all over fabric
- · Want C arrays in small memories
  - Partitioned so can perform enough reads (writes) in a cycle to avoid memory bottleneck

Penn ESE532 Fall 2019 -- DeHon

7

# Today

- · Interconnect Infrastructure
- Data Movement Threads
- Peripherals
- DMA -- Direct Memory Access

Penn ESE532 Fall 2019 -- DeHon

ð

# Message

- · Need to move data
- Shared interconnect to make physical connections
- Useful to move data as separate thread of control
  - Dedicating a processor is inefficient
  - Useful to have dedicated data-movement hardware: Direct Memory Access (DMA)

Penn ESE532 Fall 2019 -- DeHon

9

# Memory and I/O Organization

- · Architecture contains
  - Large memories
    - · For density, necessary sharing
  - Small memories local to compute
    - For high bandwidth, low latency, low energy
  - Peripherals for I/O
- Need to move data
  - Among memories and I/O
    - · Large to small and back
    - Among small
  - From Inputs, To Outputs

10

# Term: Peripheral

- "On the edge (or perhiphery) of something"
- Peripheral device device used to put information onto or get information off of a computer
  - E.g.
    - Keyboard, mouse, modem, USB flash drive, ...

enn ESE532 Fall 2019 -- DeHor



# Memory and I/O Organization

- · Architecture contains
  - Large memories
    - · For density, necessary sharing
  - Small memories local to compute
    - · For high bandwidth, low latency, low energy
  - Peripherals for I/O
- · Need to move data
  - Among memories and I/O
    - · Large to small and back
    - · Among small
- From Inputs, To Outputs

13

#### How move data?

- · Abstractly, using stream links.
- Connect stream between producer and consumer.
- · Ideally: dedicated wires

Penn ESE532 Fall 2019 -- DeHon

14

#### **Dedicated Wires?**

 What might prevent us from having dedicated wires between all communicating units?

Penn ESE532 Fall 2019 -- DeHon

15

# **Making Connections**

- · Cannot always be dedicated wires
  - Programmable
  - Wires take up area
  - Don't always have enough traffic to consume the bandwidth of point-to-point wire
  - May need to serialize use of resource
    - E.g. one memory read per cycle
  - Source or destination may be sequentialized on hardware

e

Model

• Programmable, possibly shared interconnect

P

P

P

Penn ESE532 Fall 2019 – DeHon



# Alternate: Crossbar • Provide programmable connection between all sources and destinations • Any destination can be connected to any single source











#### General Interconnect

- Generally, want to be able to parameterize designs
- · Here: tune area-bandwidth
  - Control how much bandwidth provide

enn ESE532 Fall 2019 -- DeHon

25

#### Interconnect

- How might get design points between bus and crossbar?
- · How could reduce number
  - Inputs to crossbar?
  - Outputs from crossbar?

Penn ESE532 Fall 2019 -- DeHon

26



















# Day 3

- · Large memories are slow
  - Latency increases with memory size
- Distant memories are high latency
  - Multiple clock-cycles to cross chip
  - Off-chip memories even higher latency

Penn ESE532 Fall 2019 -- DeHon

37

# Day 3, Preclass 2

- 10 cycle latency to memory
- If must wait for data return, latency can degrade throughput
- 10 cycle latency + 10 op + (assorted)
  - More than 20 cycles / result

```
for(i=0;i<MAX;i++) {
   in=a[i]; // memory read
   out=f(in); // 10 cycle compute
   b[i]=out;
}</pre>
```

#### Preclass 3

 Throughput using 3 threads on 3 processors: P1, P2, P3?

```
P1: for(i=0;i<MAX;i++) Astream.write(a[i]);
P2: while(1) {Astream.read(aval); Bstream.write(f(aval));}
P3: for(i=0;i<MAX;i++) Bstream.read(b[i]);
```

Penn ESE532 Fall 2019 -- DeHon

39

### Fetch (Write) Threads

- Potentially useful to move data in separate thread
- · Especially when
  - Long (potentially variable) latency to data source (memory)
- Useful to split request/response

Penn ESE532 Fall 2019 -- DeHon

40

# Peripherals

enn ESE532 Fall 2019 -- DeHor

41





#### Masters and Slaves

- · Two kinds of entities on interconnect
- Master can initiate requests
  - E.g. **processor** that can perform a read or write
- Slaves can only respond to requests
  - E.g. memory that can return the read data from a read request

45

Penn ESE532 Fall 2019 -- DeHon

Simple Peripheral Model · Peripherals are usb slave devices - Masters can read M input data ethernet - Masters can write output data - To move data. master (e.g. A/D processor) initiates M **HDMI** 

#### Simple Peripheral Model · Peripherals are slave devices - Masters can read input - usb data - Masters can write output data ethernet To move data, master (e.g. processor) initiates Demanding processor touch A/D every data item has M some negative consequences **HDMI**

#### **Timing Demands**

- · Must read each input before overwritten
- Must write each output within real-time window
- Must guarantee processor scheduled to service each I/O at appropriate frequency
- How many cycles between 32b input words for 1Gb/s network and 32b, 1GHz processor?
  - Consider input data shifted into register 1b per ns
  - Must read out 32b register before overwritten

Penn ESE532 Fall 2019 -- DeHon









# Observe

- Modest hardware can serve as data movement thread
  - Much less hardware than a processor
  - Offload work from processors
- Small hardware allow peripherals to be master devices on interconnect

53

EF22 Fall 2010 DoHan





# **DMA Engine**

- · Data Movement Thread
  - Specialized Processor that moves data
- · Act independently
- Implement data movement
- Can build to move data between memories (Slave devices)
- E.g., Implement P1, P3 in Preclass 3

Penn ESE532 Fall 2019 -- DeHon

56



# Programmable DMA Engine

- · What copy from?
- · How much?
- · Where copy to?
- · Stride?
- · What size data?
- · Loop?
- Transfer Rate?

Penn ESE532 Fall 2019 -- DeHon

58

# Multithreaded DMA Engine

- One copy task not necessarily saturate bandwidth of DMA Engine
- Share engine performing many transfers (channels)
- · Separate transfer state for each
  - Hence thread (or channel)
- Swap among threads
  - Simplest: round-robin:

• 1, 2, 3, .. K, 1, 2, 3, .. K, 1, ...



### Hardwired and Programmable

- Zynq has hardwired DMA engine
   8 channels
- Can also add data movement engines (Data Movers) in FPGA fabric





Header on processor

· Networking Application

header

ethernet

• Payload (encrypt, checksum) on FPGA

Example

chksum

DMA from ethernet→main memory

DMA main memory→BRAM

· Stream between payload components

• DMA from chksum to ethernet out

. .

ethernet

#### Automation

- SDSoC will automatically take care of DMA of memory to and from accelerators in FPGA fabric
  - Inserting logic, programming DMA
- · Mostly need to be aware is happening
  - Understand impacts on performance
- · Have some options to control
  - With pragmas
  - With choice of data sizes

– Explore HW6

63

61

# Big Ideas

- · Need to move data
- Shared Interconnect to make physical connections – can tune area/bw/locality
- · Useful to
  - move data as separate thread of control
  - Have dedicated data-movement hardware: DMA

Penn ESE532 Fall 2019 -- DeHon

64

#### **Admin**

- · Midterm Wednesday, here, classtime
  - Review Tuesday evening office hours
- Day 13 (Monday) reading
  - Chapter nine of Parallel Programming for FPGAs (available on web)
  - DRAM reading if not read on Day 3
- · HW6 out soon??
  - We learned quite a bit putting it together
  - Hopefully, assignment passes on to you

enn ESE532 Fall 2019 -- DeHo