# ESE532: System-on-a-Chip Architecture

Day 3: September 11, 2017 Memory

Penn ESE532 Spring 2017 -- DeHor



## Today

#### Memory

- · Memory Bottleneck
- · Memory Scaling
- · Latency Engineering:
  - Scheduling
  - Data Reuse: Scratchpad and cache
- · Bandwidth Engineering
  - Wide word
  - Banking (form of parallelism for data)
- DRAM

nn ESE532 Spring 2017 -- DeHon

2

## Message

- Memory bandwidth and latency can be bottlenecks
- · Minimize data movement
- · Exploit small, local memories
- · Exploit data reuse

enn ESE532 Spring 2017 -- DeHon

# **Memory Scaling**

Penn ESE532 Spring 2017 -- DeHon

On-Chip Delay

- · Delay is proportional to distance travelled
- · Make a wire twice the length
  - Takes twice the latency to traverse
  - (can pipeline)
- · Modern chips
  - Run at 100s of MHz to GHz
  - Take 10s of ns to cross the chip
  - Takes 100s of ns to reference off-chip data
- What does this say about placement of computations and memories?

Penn ESE532 Spring 2017 -- DeHo

# Memory Block • Linear wire delay with distance • Assume constant area/memory bit • N-bit memory, arranged in square • 4x capacity delay change?

# Memory Block • Linear wire delay with distance • Assume constant area/memory bit • N-bit memory, arranged in square • Width? • Depth? • Delay scale N?











# Latency Engineering Scheduling

Penn ESE532 Spring 2017 -- DeHon

13

15

#### Preclass 2

- 10 cycle latency to memory
- Throughput in each case?

```
Case 1:

for(i=0;i<MAX;i++) {
    in=a[i]; // memory read
    out=f(in); // 10 cycle compute
    b[i]=out;
}

b[MAX-1]=f(next_in);

mn ESEE532 Spring 2017 - DeHon

Case 2:
    next_in=a[0];
    in=next_in;
    next_in=a[i];
    out=f(in);
    b[i-1]=out;
}

b[MAX-1]=f(next_in);
```

#### Lesson

- Long memory latency can impact throughput
  - When must wait on it
  - When part of a cyclic dependency
- Overlap memory access with computations when possible
  - exploit parallelism between compute and memory access

Penn ESE532 Spring 2017 -- DeHon

Latency Engineering
Data Reuse

Penn ESE532 Spring 2017 -- DeHon

16

#### Preclass 3abc

• MAX=10<sup>6</sup> WSIZE=5

enn ESE532 Spring 2017 -- DeHon

- How many reads to x[] and w[]?
- How many times execute t+=x[]\*w[] ?

```
• Runtime
read x,w 20
t+=...5
for (i=0;i<MAX;i++) {
    t=0;
    for (j=0;j<WSIZE;j++)
        t+=x[i+j]*w[j];
    y[i]=t;
}</pre>
```





#### Lesson

- · Data can often be reused
  - Keep data needed for computation in
    - Closer, smaller (faster, less energy) memories
  - Reduces latency costs
  - Reduces bandwidth required from large memories
- · Reuse hint: value used multiple times
- Or value produced/consumed

#### **Enhancing Reuse**

 Computations can often be (re-)organized around data reuse

Penn ESE532 Spring 2017 -- DeHon

ESE532 Spring 2017 -- DeHon

22

24

# Reorganization Example

```
· What reused?
```

- Why problematic as written?
- How revise code to promote local reuse?

```
for (i=0;i<MAX;i++) {
    t=0;
    for (j=0;j<WSIZE;j++)
        t+=x[i+j]*w[j];
    y[i]=t;
    }
s=0;
for (i=0;i<MAX;i++) {
    s+=y[i];
    }
avg=s/MAX;</pre>
```

21

23

enn ESE532 Spring 2017 -- DeHon

```
Reorganize
for (i=0;i<MAX;i++) {
                           s=0;
  t=0;
                           for (i=0;i<MAX;i++) {
  for (j=0;j<WSIZE;j++)</pre>
                             t=0:
   t+=x[i+j]*w[j];
                             for (j=0;j<WSIZE;j++)</pre>
  y[i]=t;
                               t+=x[i+j]*w[j];
                             y[i]=t;
for (i=0;i<MAX;i++) {
                             s+=t; // y[i]
 s+=y[i];
                           avg=s/MAX;
avg=s/MAX;
```

#### **Processor Data Caches**

- · Traditional Processor Data Caches are a heuristic instance of this
  - Add a small memory local to the processor · It is fast, low latency
  - Store anything fetched from large/remote memory in local memory
    - · Hoping for reuse in near future
  - On every fetch, check local memory before go to large memory



nn ESE532 Spring 2017 -- DeHon

#### Cache

· Goal: performance of small memory with density of large memory



#### **Processor Data Caches**

- Demands more than a small memory
  - Need to sparsely store address/data mappings from large memory
  - Makes more area/delay/energy expensive than just a simple memory of capacity
- · Don't need explicit data movement
- · Cannot control when data moved/saved
  - Bad for determinism
- · Limited ability to control what stays in small memory simultaneously

# **Terminology**

Cache

ESE532 Spring 2017 -- DeHon

- Hardware-managed small memory in front of larger memory
- Scratchpad
  - Small memory
  - Software (or logic) managed
  - Explicit reference to scratchpad vs. large (other) memories
  - Explicit movement of data

28

# Bandwidth Engineering

nn ESE532 Spring 2017 -- DeHon

29

27

# Bandwidth Engineering

- · High bandwidth is easier to engineer than low latency
  - Wide-word
  - Banking
    - Decompose memory into independent banks
    - · Route requests to appropriate bank

n ESE532 Spring 2017 -- DeHon



- · Relatively easy to have a wide memory
- · As long as we share the address
  - One address to select wide word or bits
- Efficient if all read together



enn ESE532 Spring 2017 -- DeHon



#### Preclass 3 + wide · Impact of 8 word datapath between large and local memory? • WSIZE\*MAX\* $(T_{comp}+3*T_{local})$ for (j=0;j<WSIZE;j++) +WSIZE\*(T<sub>w</sub>+T<sub>x</sub>) local\_w[j]=w[j]; for (j=0;j<WSIZE-1;j++) local\_x[j+1]=x[j]; +MAX\*T<sub>\*</sub> • 5\*10<sup>6</sup>\*(5+3) for (i=0;i<MAX;i++) { +5\*(20+20) for (j=0;j<WSIZE-1;j++) local\_x[j]=local\_x[j+1]; +106\*20 local\_x[WSIZE-1]=x[i+WSIZE-1]; for (j=0;j<WSIZE;j++)</pre> • 6\*10<sup>7</sup> t+=local\_x[j]\*local\_w[j]; y[i]=t; 33



#### Lesson

- Cheaper to access wide/contiguous blocks memory
  - In hardware
  - From the architectures typically build
- Can achieve higher bandwidth on large block data transfer
  - Than random access of small data items

35

enn ESE532 Spring 2017 -- DeHon

Bank Memory

• Break memory into independent banks

– Allow banks to operate concurrently

Proc

large mem
large mem

36







DRAM

Penn ESE532 Spring 2017 -- DeHon















#### **DRAM**

- Latency is large (10s of ns)
- Throughput can be high (GB/s)
  - If accessed sequentially
  - If exploit wide word block transfers
- · Throughput low on random accesses
  - As we saw for random access on wideword memory

Penn ESE532 Spring 2017 -- DeHon

49

## **Memory Organization**

- Architecture contains
  - Large memories
    - · For density, necessary sharing
  - Small memories local to compute
    - · For high bandwidth, low latency, low energy
- Need to move data
  - Among memories
    - Large to small and back
    - Among small

ESE532 Spring 2017 -- DeHon

50

# Big Ideas

- Memory bandwidth and latency can be bottlenecks
- Exploit small, local memories
  - Easy bandwidth, low latency, energy
- · Exploit data reuse
  - Keep in small memories
- · Minimize data movement
  - Small, local memories keep distance short
- Minimally move into small memories

Penn ESE532 Spring 2017 -- DeHon

51

#### Admin

- · Reading for Wednesday on canvas
- HW2 due Friday

enn ESE532 Spring 2017 -- DeHon