## Midterm

① This is a preview of the draft version of the quiz

Started: Oct 6 at 10:27pm

## **Quiz Instructions**

Regulations: https://www.seas.upenn.edu/~ese532/fall2020/midterm\_details.pdf

| Question 1                                                                    | 1 pts    |
|-------------------------------------------------------------------------------|----------|
| I certify that I have complied with the University of Pennsylvania's Code of  |          |
| Academic Integrity and the exam regulations                                   |          |
| https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf               |          |
| (https://www.seas.upenn.edu/~ese532/fall2020/midterm_details.pdf) in complete | ing this |
| exam.                                                                         |          |
|                                                                               |          |
| O True                                                                        |          |
| ○ False                                                                       |          |
|                                                                               |          |
|                                                                               |          |

Consider the following code in answering the questions on this exam.

#include <stdint.h>
#define NUM\_POINTS 1000
#define LOG2\_NUM\_POINTS 10
#define MAX\_AREA (((uint64\_t)1<<63)-1)
#define MAX\_TIME (((uint64\_t)1<<63)-1)
uint64\_t min(uint64\_t a, uint64\_t b); // assume single instruction
uint64\_t max(uint64\_t a, uint64\_t b); // assume single instruction
extern int \*\*tp1set, \*\*tp2set, \*\*ap1set; // can hold negative numbers</pre>

```
extern int *dom;
extern uint64_t *ma, *mt;
uint64 t area param(int arg, int num, int *a)
{
 uint64 t res=0;
 for (int i=0;i<num;i++) // loop F
  {
    int b=(arg & 0x01);
    arg=arg>>1;
    res+=b*a[i];
  }
 return(res);
}
uint64 t time param(int arg, int num, int *t1, int *t2)
{
 uint64 t res=0;
 for (int i=0;i<num;i++) // loop G
  {
    int b=(arg & 0x01);
    arg=arg>>1;
    int tmp=(b*t1[i]+res);
    int t2i=t2[i];
    if (tmp==t2i)
       res=res+1;
    else
      res=max(t2i,res);
  }
 return(res);
}
void opt (int *tp1, int *tp2, int *ap1,
int *non_dom_count_ptr, uint64_t *min_area_ptr, uint64_t *min_time_ptr)
{
 uint64 t a[NUM POINTS];
 uint64 t t[NUM POINTS];
 uint16_t dom[NUM_POINTS];
 uint64 t min area=MAX AREA;
```

```
uint64_t min_time=MAX_TIME;
 uint64_t non_dom_count=0;
 for (int i=0;i<NUM POINTS;i++) // loop A
  {
   a[i]=area_param(i,LOG2_NUM_POINTS,ap1);
   t[i]=time_param(i,LOG2_NUM_POINTS,tp1,tp2);
   dom[i]=0;
  }
 for (int i=0;i<NUM POINTS;i++) // loop B
  {
   min_area=min(a[i],min_area);
   min_time=min(t[i],min_area);
  }
 for (int i=0;i<NUM_POINTS;i++) // loop C
  for (int j=0;j<NUM POINTS;j++) // loop D
   {
      if ((i!=j) && (a[j]<=a[i]) && (t[j]<=t[i])) dom[i]++;
   }
 for (int i=0;i<NUM POINTS;i++) // loop E
   {
    if (dom[i]==0) non_dom_count++;
   }
 *non_dom_count_ptr=non_dom_count;
 *min_area_ptr=min_area;
 *min_time_ptr=min_time;
}
void multi_opt(int num)
{
 for (int i=0;i<num;i++) // loop H
  {
   opt(*tp1set,*tp2set,*ap1set,dom,ma,mt);
   dom++;
   ma++;
   tp1set++;
   tp2set++;
   ap1set++;
  }
```

We start with a baseline, single processor system as shown.



## local scratchpad memory

- For simplicity throughout, we will treat non-memory indexing adds (subtracts count as adds), compares, logical operations (&&, ||), min, max, and multplies as the only compute operations. We'll assume the other operations take negligible time or can be run in parallel (ILP) with the listed compute and memory operations. (Some consequences: You may ignore loop and conditional overheads in processor runtime estimates; you may ignore computations in array indicies.)
- Baseline processor can execute one multiply, compare, or add per cycle and runs at 1 GHz.
- Reads from and writes to the 1 MB main memory issue in one cycle, but require 5 cycles of latency (including issue) to get a result; memory can supply one read or write each cycle.
- Reads from and writes to the 1 KB scratchpad memory take 1 cycle.
- By default, all arrays live in the main memory and all array references are to main memory.
- Assume non-array variables live in registers.
- Assume all additions are associative. Max and min are associative.
- Assume comparisons, adds, min, max, and multiplies take 1 ns when implemented in hardware accelerator, so fully pipelined accelerators also run at 1 GHz. A compare-mux operation can also be implemented in 1 ns.
- A lookup in a small memory (1KB or small) can complete in 1ns.



| Question 3 | 5 pts |
|------------|-------|
|            |       |

| For the single | -processor implementation, identify the bottleneck top-level loop. |
|----------------|--------------------------------------------------------------------|
| 🔵 Loop A       |                                                                    |
| 🔵 Loop B       |                                                                    |
| 🔿 Loop C       |                                                                    |
| 🔿 Loop E       |                                                                    |

| Question 4                                                                                                                                     | 5 pts                  |
|------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|
| What is the maximum Amdahl's Law speedup if we only accelerate the loop identified in Question 3. (show work for partial credit consideration) | I                      |
| HTMLEC<br>BIUA + A + Ix E E E E E X <sup>2</sup> × <sub>2</sub> E i<br>⊞ + ⊡ & X I √x ▷ M ¶ 12pt + Paragra                                     | <u>litor</u> ∰<br>ph ▼ |
|                                                                                                                                                |                        |

| Question 5                                                                                                                                                                     | 10 pts            |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------|
| Consider the coarse-grained dataflow graph for the top-level loops (A, B, C<br>What precedence constraints exist among these loops (what producer->co<br>relationships exist). | C, E).<br>onsumer |
| LoopA>LoopB                                                                                                                                                                    |                   |
| LoopB>LoopC                                                                                                                                                                    |                   |
| C LoopC>Loop E                                                                                                                                                                 |                   |
| LoopE>LoopA                                                                                                                                                                    |                   |
| LoopA>LoopC                                                                                                                                                                    |                   |
| LoopA>LoopE                                                                                                                                                                    |                   |

| C LoopB>LoopE |  |
|---------------|--|
| LoopC>LoopB   |  |
| LoopC>LoopA   |  |
| C LoopE>LoopB |  |
| LoopE>LoopC   |  |
| LoopB>LoopA   |  |

| Question 6                                 | 7 pts |
|--------------------------------------------|-------|
| Identify the loops that are data parallel. |       |
| Loop A                                     |       |
| C Loop B                                   |       |
| C Loop C                                   |       |
| C Loop D                                   |       |
| C Loop E                                   |       |
| C Loop F                                   |       |
| Loop G                                     |       |

| Question 7                                           | 7 pts |
|------------------------------------------------------|-------|
| Explain why each loop above is data parallel or not. |       |
|                                                      |       |









| Question 11                                                                                                                                                                         | 5 pts        |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|
|                                                                                                                                                                                     |              |
| When pipelined, what is the minimum clock cycle time achievable for loop G time_param()?                                                                                            | in :         |
| [Do not unroll the loop. Think about pipelining the loop body to start one ne iteration of the body on each clock cycle.]                                                           | W            |
| [Assume you can have small local memory banks to hold t1 and t2.]                                                                                                                   |              |
| HTML EC                                                                                                                                                                             | <u>litor</u> |
| $B I \cup A \bullet A \bullet \underline{\mathcal{I}}_{x} \equiv \Xi \equiv \underline{\mathcal{I}} \equiv x^{2} \times_{2} \underline{\mathcal{I}} \equiv \underline{\mathcal{I}}$ |              |
| ⊞ ▼ 🗈 🔗 🔆 🛋 √× 🕞 州 ¶, 12pt 🔹 Paragra                                                                                                                                                | ph 🔻         |

| Question 12                                                                                                                                                                                                          | 10 pts                  |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| Design a pipeline for loop G in the time_param() calculation that achie<br>minimum cycle time (in ns) [as identified in the previous question].<br>[Continue to assume you can have small local memory banks to hold | eves the<br>t1 and t2.] |
| Upload Choose a File                                                                                                                                                                                                 |                         |

**Question 13** 

Describe how to map the multi\_opt() computation to the following heterogeneous system to maximize the throughput of opt() calculations.



- 1 instance of 1 GHz, II=1 area\_param (loop F) calculation pipeline (assume have local memory banks for ap1 that can be used for each function invocation)
- 1 instance of time\_param (loop G) calculation pipeline (as you designed above)

(assume have local memory banks for tp1, tp2 that can be used for each function invocation)

- 2 single-issues, baseline processors (P), each running at 1 GHz.
- 8 Vector Processors (VP) with 8, 64b-wide vector lanes, each running at 1 GHz.

(for the loops that are data-parallel, you may assume computation achieves the resource bound)

There is a shared, 512b wide path to main memory available to the hardware pipelines and the Vector Processors. It can transfer one contiguous blocks of 512b into a Vector Register File (VRF) or hardware pipe each 1 ns cycle, but it takes 5 cycles of latency before a fetched value is available in the VRF or pipe. The single-issue, baseline processors can only move 64b from the main memory in cycle.







| Quiz saved at 10:28pm Submit Quiz |
|-----------------------------------|
|-----------------------------------|