### Today ESE532: Memory System-on-a-Chip Architecture Memory Bottleneck Memory Scaling Latency Engineering - Scheduling Day 3: September 9, 2019 - Data Reuse: Scratchpad and cache Memory Bandwidth Engineering - Wide word - Banking (form of parallelism for data) 🐼 Penn 2 ESE532 Fall 2019 -- DeHon ESE532 Fall 2019 -- DeH



5

# On-Chip Delay

- · Delay is proportional to distance travelled
- · Make a wire twice the length
  - Takes twice the latency to traverse
  - (can pipeline)
- Modern chips
  - Run at 100s of MHz to GHz
  - (sub 10ns, 1ns cycle times)
  - Take 10s of ns to cross the chip
  - Takes 100s of ns to reference off-chip data
- What does this say about placement of

### Penn ESE53 COMPUTATIONS and memories?



































| Preclass 3                                                                                                                                                                                                      |                                                                                                                                                                                                                                                                    |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <ul> <li>WSIZE*T<sub>w</sub></li> <li>WSIZE*T<sub>x</sub></li> <li>WSIZE*MAX*T<sub>local</sub></li> <li>MAX*T<sub>x</sub></li> <li>WSIZE*MAX*T<sub>comp</sub></li> <li>WSIZE*MAX*2*T<sub>local</sub></li> </ul> | <pre>for (j=0;j<wsize;j++) (i="0;i&lt;MAX;i++)" (j="0;j&lt;WSIZE;j++)" for="" local_w[j]="w[j];" local_x[j+1]="x[j];" local_x[j]="local_x[j+1];" local_x[wsize-1]="x[i+WSIZE-1]" t="0;" t+="local_x[j]*local_w[j];" th="" y[i]="t;" {="" }<=""></wsize;j++)></pre> |  |

| Preclass 3                                                                                                               |                                                                                                                                                                                 |  |
|--------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| <ul> <li>WSIZE*MAX*(T<sub>com</sub></li> <li>+WSIZE*(T<sub>w</sub>+T<sub>x</sub>)</li> <li>+MAX*T<sub>x</sub></li> </ul> | <pre>p+3*T<sub>local</sub>) for (j=0;j<wsize;j++) (i="0;i&lt;MXx;i++)" (j="0;;&lt;WSIZE-1;j++)" for="" local_w[j]="w[j];" local_x[j+1]="x[j];" pre="" {<=""></wsize;j++)></pre> |  |
| • 5*10 <sup>6</sup> *(5+3)<br>+5*(20+20)<br>+10 <sup>6</sup> *20<br>• 6*10 <sup>7</sup><br>Penne Ste 532 Fail 2019 DeHon | <pre>t=0;<br/>for (j=0;j<wsize-1;j++)< th=""></wsize-1;j++)<></pre>                                                                                                             |  |









## **Processor Data Caches**

- Demands more than a small memory
  - Need to sparsely store address/data mappings from large memory
  - Makes more area/delay/energy expensive than just a simple memory of capacity
- · Don't need explicit data movement
- Cannot control when data moved/saved
   Bad for determinism
- Limited ability to control what stays in small memory simultaneously 31

### Terminology

- Cache
  - Hardware-managed small memory in front of larger memory
- · Scratchpad

ESE532 Fall 2019 -- DeHor

- Small memory
- Software (or logic) managed
- Explicit reference to scratchpad vs. large (other) memories

32

34

- Explicit movement of data



# $\label{eq:transform} \begin{array}{l} \mbox{Terminology: Unroll Loop} \\ \mbox{or Replace loop with instantiations of body} \\ \mbox{For WSIZE=5} \\ \mbox{for (i=0;i<MAX;i++) { for (i=0;i<MAX;i++) { t=x(i+0]*w[0]+x[i+1]*w[1]+x[i+2]*w[2] \\ t=0; for (j=0;j<WSIZE;j++) \\ for (j=0;j<WSIZE;j++) \\ t=x[i+j]*w[3]+x[i+4]*w[4]; \\ y[i]=t; \\ y[i]=t; \\ \} \end{array}$





### Amdahl's Law

- If you only speedup Y(%) of the code, the most you can accelerate your application is 1/(1-Y)
- T<sub>before</sub> = 1\*Y + 1\*(1-Y)
- · Speedup by factor of S
- T<sub>after</sub>=(1/S)\*Y+1\*(1-Y)
- Limit S $\rightarrow$ infinity T<sub>before</sub>/T<sub>after</sub>=1/(1-Y)

37

ESE532 Fall 2019 -- DeHon



































# Admin

- Reading for Wednesday on canvas
- HW2 due Friday

Penn ESE532 Fall 2019 -- DeHon

55