









## Previously • Want data in small memories – Low latency, high bandwidth • FPGA has many memories all over fabric • Want C arrays in small memories – Partitioned so can perform enough reads (writes) in a cycle to avoid memory bottleneck

### Today

- · Interconnect Infrastructure
- · Data Movement Threads
- Peripherals
- DMA

enn ESE532 Fall 2017 -- DeHon

7

### Message

- · Need to move data
- Shared interconnect to make physical connections
- Useful to move data as separate thread of control
  - Dedicating a processor is inefficient
  - Useful to have dedicated data-movement hardware: DMA

Penn ESE532 Fall 2017 -- DeHon

8

### Memory and I/O Organization

- · Architecture contains
  - Large memories
    - For density, necessary sharing
  - Small memories local to compute
    - · For high bandwidth, low latency, low energy
  - Peripherals for I/O
- · Need to move data
  - Among memories and I/O
    - Large to small and back
    - Among small
    - From Inputs, To Outputs

Penn ESE532 Fall 2017 -- DeH

nn ESE532 Fall 2017 -- DeHon

### How move data?

- · Abstractly, using stream links.
- Connect stream between producer and consumer.
- · Ideally: dedicated wires

Penn FSF532 Fall 2017 -- DeHon

10

### **Dedicated Wires?**

 Why might we not be able to have dedicated wires?

11

### **Making Connections**

- · Cannot always be dedicated wires
  - Programmable
  - Wires take up area
  - Don't always have enough traffic to consume the bandwidth of point-to-point wire
  - May need to serialize use of resource
    - E.g. one memory read per cycle
  - Source or destination may be sequentialized on hardware

enn ESE532 Fall 2017 -- DeHon













### General Interconnect

- · Generally, want to be able to parameterize designs
- · Here: tune area-bandwidth
  - Control how much bandwidth provide

19

### Interconnect

· How might get design points between bus and crossbar?

20

### Multiple Busses

- · Think of crossbar as one bus per output
- · Simple bus is one bus total
- · In between,
  - How many simultaneous busses support?



### **Share Crossbar Outputs**

• Group set of outputs together on a bus



### **Share Crossbar Inputs**



### Locality in Interconnect

· How allow physically local items to be closer?

enn ESE532 Fall 2017 -- DeHon





### Interconnect

- Will need an infrastructure for programmable connections
- Rich design space to tune area-bandwidth-locality
  - Will explore more later in course



### Masters and Slaves

- Regardless of form, potentially have two kinds of entities on interconnect
- Master can initiate requests
  - E.g. processor that can perform a read or write
- Slaves can only respond to requests
  - E.g. memory that can return the read data from a read requset

Penn ESE532 Fall 2017 -- DeHon

28

## Long Latency Memory Operations

enn ESE532 Fall 2017 -- DeHon

### Day 3

- · Large memories are slow
  - Latency increases with memory size
- · Distant memories are high latency
  - Multiple clock-cycles to cross chip
  - Off-chip memories even higher latency

Penn ESE532 Fall 2017 -- DeHon

29

### Day 3, Preclass 2

- 10 cycle latency to memory
- If must wait for data return, latency can degrade throughput
- 10 cycle latency + 10 op + (assorted)
  - More than 20 cycles / result

```
for(i=0;i<MAX;i++) {
  in=a[i]; // memory read
  out=f(in); // 10 cycle compute
  b[i]=out;
}</pre>
```

Penn ESE532 Fall 2017 -- DeHon

### Preclass 3

· Throughput using 3 threads?

```
P1: for(i=0;i<MAX;i++) Astream.write(a[i]);
P2: while(1) {Astream.read(aval); Bstream.write(f(aval));}
P3: for(i=0;i<MAX;i++) Bstream.read(b[i]);
```

Penn ESE532 Fall 2017 -- DeHon

32

### Fetch (Write) Threads

- Potentially useful to move data in separate thread
- · Especially when
  - Long (potentially variable) latency to data source (memory)

33

· Useful to split request/response

E532 Fall 2017 -- DeHon

### Peripherals







### **Timing Demands**

- · Must read each input before overwritten
- Must write each output within real-time window
- Must guarantee processor scheduled to service each I/O at appropriate frequency
- How many cycles between inputs for 1Gb/s network and 32b, 1GHz processor?

Penn ESE532 Fall 2017 -- DeHon

38



# DMA Penn ESE532 Fall 2017 -- DeHon

### Preclass 4

- How much hardware to support fetch thread:
  - Counter bits?
  - Registers?
  - Comparators?
  - Other gates?
- · Compare to MicroBlaze
  - (minimum config 630 6-LUTs)

Penn ESE532 Fall 2017 -- DeH

41

### Observe

- Modest hardware can serve as data movement thread
  - Much less hardware than a processor
  - Offload work from processors
- Small hardware allow peripherals to be master devices on interconnect

Penn ESE532 Fall 2017 -- DeHon



### **DMA Engine**

- Data Movement Thread
   Specialized Processor that moves data
- Act independently
- Implement data movement
- Can build to move data between memories (Slave devices)
- E.g., Implement P1, P3 in Preclass 3

Penn ESE532 Fall 2017 -- DeHon

44



### Programmable DMA Engine

- · What copy from?
- · Where copy to?
- · Stride?
- · How much?
- · What size data?
- · Loop?
- Transfer Rate?

Penn ESE532 Fall 2017 -- DeHor

46

### Multithreaded DMA Engine

- One copy task not necessarily saturate bandwidth of DMA Engine
- Share engine performing many transfers (channels)
- Separate transfer state for each
   Hence thread (or channel)
- Swap among threads
  - E.g., round-robin

Penn ESE532 Fall 2017 -- DeHon



### Hardwired and Programmable

- · Zynq has hardwired DMA engine
- Can also add data movement engines (Data Movers) in FPGA fabric



Penn ESE532 Fall 2017 -- DeHon

### Example

· Networking Application



- · Header on processor
- · Payload (encrypt, checksum) on FPGA
- DMA from ethernet→main memory
- DMA main memory→BRAM
- · Stream between payload components
- · DMA from chksum to ethernet out

50

### Big Ideas

- · Need to move data
- Shared Interconnect to make physical connections – can tune area/bw/locality
- · Useful to
  - move data as separate thread of control
  - Have dedicated data-movement hardware:
     DMA

Penn ESE532 Fall 2017 -- DeHon

51

### Admin

- Day 12
  - DRAM reading if not read on Day 3
- HW5 due Friday

enn ESE532 Fall 2017 -- DeHon