#### University of Pennsylvania Department of Electrical and System Engineering System-on-a-Chip Architecture

| ESE532, Fall 2019 | HW4: SIMD | Wednesday, September 18 |
|-------------------|-----------|-------------------------|
|                   |           |                         |

Due: Friday, September 27, 5:00PM

In this assignment, we will accelerate the streaming application from last homework using the ARM NEON vector processor. Note that there were a few modifications to the application. You can find the sources for this homework the course website along with a data set.

## Collaboration

In this assignment, you work with partners that we assigned. You can find the assignment on Canvas in the *Partners* map under the *Files* section. In the event that the partner assignment does not work out, contact the instructor or TA as soon as possible. Partners may share code and results and discuss analysis, but each writeup should be prepared independently. Outside the assigned groups, only sharing of tool knowledge is allowed. See the course policies on the course web page http://www.seas.upenn.edu/~ese532 for full details of our policies for this course.

# ARM NEON

Information about the NEON architecture and datatypes is available in the ARM assembler user guide. Another section in the same guide lists the instructions. Note that not all information may be applicable to the ARMv8 architecture of the Cortex A53 processor that we are using. You are encouraged to locate other sources as needed and to share them.

## **Homework Submission**

### 1. Teamwork

As the difficulty of homework is ramping up, we encourage you to spend a moment planning on how to tackle the homework as a team.

(a) Describe which tasks of this homework you will perform, which tasks will be performed by your teammate(s), and which tasks you will perform together (e.g., pair programming, where you both sit together at the same terminal). Motivate your task distribution. (5 lines)

| Import Projects from File System or Archive                                     |                               | – 🗆 X                             |
|---------------------------------------------------------------------------------|-------------------------------|-----------------------------------|
| Import Projects from File System or Archive                                     |                               |                                   |
| This wizard analyzes the content of your folder or archive file to find project | s and import them in the IDE. |                                   |
| Import source C:\Users\ylkiao\ws_183\HW4                                        | ~                             | Directory Archive                 |
| type filter text                                                                |                               | Select All                        |
| Folder                                                                          | Import as                     | Deselect All                      |
| HW4\Baseline                                                                    | Eclipse project               |                                   |
| HW4\RemoteSystemsTempFiles                                                      | Folder already imported       |                                   |
| HW4\floorplan_static_wrapper_hw_platform_0                                      | Eclipse project               |                                   |
| HW4\ultra96_bsp                                                                 | Eclipse project               |                                   |
|                                                                                 |                               | 3 of 5 selected                   |
|                                                                                 |                               | Hide already open projects        |
| Use installed project configurators to:                                         |                               |                                   |
| Search for nested projects                                                      |                               |                                   |
| Detect and configure project natures                                            |                               |                                   |
| Working sets                                                                    |                               |                                   |
| Add project to working sets                                                     |                               | New                               |
| Working sets:                                                                   |                               | <ul> <li>Select</li> </ul>        |
|                                                                                 | Show                          | vother specialized import wizards |
|                                                                                 |                               |                                   |
| 0                                                                               | < Back Next >                 | Finish Cancel                     |
|                                                                                 |                               |                                   |

Figure 1: Import projects into SDx

- (b) Give an estimate of the duration of each of the tasks. (5 lines)
- (c) Record the actual time spent on tasks as you work through the assignment.
- (d) Explain how you will make sure that the lessons and knowledge gained from the exercises are shared with everybody in the team. (3 lines)

#### 2. Compiler Optimizations

Before we dive into the vector optimizations, we will investigate the effects of different levels of compiler optimizations. Import the project by clicking  $File \rightarrow Open \ project$  from file systems. Choose the right directory for the imported project like Figure 1. In case the build output disappears (because it did to me) you can find your build output by going to:  $Project \rightarrow Properties \rightarrow C/C++ Build \rightarrow Logging$  and then open the file it points to.

(a) Measure the latency and size of the Baseline project at the different optimization levels. As we create the project in SDK, you need to run and debug the Baseline project in a different way. Click Run → Run Configurations. Choose Debugger Baseline (Default). Specify the elf file in Application tab as Figure 2 and configure Target Setup as Figure 3. Click Run. Put your measurements in a table like Table 1. You can change the optimization level as follows: Right-click on the project in the Project Explorer, and select C/C++ Build Settings from the popup menu. In the Settings tab, go to ARM v8 gcc compiler → Optimization, and select

| Run Configurations                            |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|-----------------------------------------------|---------------------------------------------------------------------------------------------------------------|-----------------------|-----------------------|-------------------------|--------------------------------------------|--------|---------|--|
| ate, manage, and run configurations           |                                                                                                               |                       |                       |                         |                                            |        |         |  |
| bug a program using SDx Application Debugger. |                                                                                                               |                       |                       |                         |                                            |        |         |  |
| 🗎 🗶 🖨 🕸 🕶                                     | Name: Debugger                                                                                                | r_Baseline(Default)   |                       |                         |                                            |        |         |  |
| pe filter text                                | 🔀 Main 🛅 Application 💊 Target Setup) 🕪 Arguments) 🜉 Environment) 🛼 Symbol Files) 🦭 Source 🔉 Path Map 🛄 Common |                       |                       |                         |                                            |        |         |  |
| Launch Group OpenCL Concenct (TCE)            | Summay                                                                                                        |                       |                       |                         |                                            |        |         |  |
| Target Communication Framework                | Download                                                                                                      | Processor             | Project               | Application             | Details                                    |        |         |  |
| 👯 Xilinx SDx Application Debugger             |                                                                                                               | psu_cortexa53_0       |                       | C:\Users\ylxiao\ws_183\ | reset = true, stop at entry = false, reloc |        |         |  |
| Republic Contraction (Default)                |                                                                                                               | psu_cortexa53_1       |                       |                         | reset = true, stop at entry = false, reloc |        |         |  |
| Xilinx SDx Application Debugger (GDB)         |                                                                                                               | psu_cortexa53_2       |                       |                         | reset = true, stop at entry = false, reloc |        |         |  |
| Xilinx SPM Analysis                           |                                                                                                               | psu_cortexa53_3       |                       |                         | reset = true, stop at entry = false, reloc |        |         |  |
|                                               |                                                                                                               | psu_cortexr5_0        |                       |                         | reset = true, stop at entry = false, reloc |        |         |  |
|                                               |                                                                                                               | psu_cortexr5_1        |                       |                         | reset = true, stop at entry = false, reloc |        |         |  |
|                                               |                                                                                                               | psu_pmu_0             |                       |                         | reset = true, stop at entry = false, reloc |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               | Project:                                                                                                      |                       |                       |                         |                                            | Browse |         |  |
|                                               | Analisation D                                                                                                 | Colleged discours 102 | LINKO Deservice of De | hund Deseline elf       |                                            | Carach | Desures |  |
|                                               | Application: C:/Users/ybaoo/ws_183/HW4/Baseline/Debug/Baseline.eff Search                                     |                       |                       |                         |                                            | Search | browse  |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               | Stop at prog                                                                                                  | ram entry             |                       |                         |                                            |        |         |  |
|                                               | Advanced Opti                                                                                                 | ons: Edit             |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
| matched 9 of 18 items                         |                                                                                                               |                       |                       |                         |                                            | Revert | Apply   |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |
|                                               |                                                                                                               |                       |                       |                         |                                            |        |         |  |

Figure 2: Debugger Application Configurations for Baseline

| 🗎 🗶 🖶 🖶 🗝                                                        | Name: Debugger_Base                                                                   | Name: Debugger_Baseline(Default)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                                         |                       |                 |  |
|------------------------------------------------------------------|---------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|-----------------------|-----------------|--|
| e filter text                                                    | 🗶 Main 🗖 Applicati                                                                    | on 🔞 Target Setup 🛛 🕬= Arguments) 🛤 Environment   🛼 Symbol Files   🎶 Source                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 🔏 Path Map 🔲 Common                     |                       |                 |  |
| Launch Group  OpenCL  OpenCL (TCF)  Hardware Platfor             | Hardware Platform:                                                                    | C:\Users\ybiao\ws_183\HW4\ultra96V2_wrapper_hw_platform_0\system.hdf                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | Search                                  | Browse                |                 |  |
| Xilinx SDx Application Debugger                                  | Bitstream File:                                                                       | C:\Users\ybiao\ws_183\HW4\ultra96V2_wrapper_hw_platform_0\ultra96V2_wrapper.bi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Search                                  | Browse                | Generate.       |  |
| Debugger_Baseline(Default) Xilinx SDx Application Debugger (GDB) | FPGA Device:                                                                          | Auto Detect                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Select                                  |                       |                 |  |
| K, Xilinx SDx System Debugger                                    | PS Device:                                                                            | Auto Detect                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Select                                  |                       |                 |  |
| Xilinx SPM Analysis                                              | Use FSBL flow for                                                                     | initialization                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                         | 1.12                  |                 |  |
|                                                                  | Initialization File:                                                                  | C:\Users\ybsao\ws_183\HW4\ultra96V2_wrapper_hw_platform_U\psu_init.tcl                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | Search                                  | Browse                |                 |  |
|                                                                  | Reset RPU                                                                             | A non-point or matching of the powerup and reset. Required after programming FPG4     A. The following processors will be reset and suspended.     The set of th |                                         |                       |                 |  |
|                                                                  | ☐ Reset RPU<br>☐ Enable RPU Split<br>☐ Program FPGA<br>☑ Run psu_init<br>☑ PL Powerup | B. Request trigger for PL powerup and reset. Required after programming FPG4<br>Hotel of the set of suspended.     The following processors will be rest and suspended.     S. All processors in the system will be suspended, and Applications will be dow<br>tab.     To provide the suspended, and Applications will be dow<br>tab.     To provide the suspended, and Applications will be dow<br>tab.     To provide the suspended and the set of the suspended of the set of the s         | nloaded to the following process<br>if) | ors as specified in t | the Application |  |
|                                                                  | Reset RPU  Reader RPU Split  Program FPGA  Run psu_init  PL Powerup                   | B. Request Trigger for PL powerup and resct. Required after programming FPG/<br>ded     d. The following processors will be rest and puspended.     1) psu_cortexa53.0     S. All processors in the system will be suspended, and Applications will be dow<br>tab.     1) psu_cortexa53.0 (C\Users\ybiao\ws_183)HW4\Baseline\Debug\Baseline.e                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | nloaded to the following process<br>If) | ors as specified in t | th              |  |

Figure 3: Debugger Target Setup Configurations for Baseline

| Optimization level | Latency (ms) | Code size (bytes) |
|--------------------|--------------|-------------------|
| -00                |              |                   |
| -01                |              |                   |
| -02                |              |                   |
| -03                |              |                   |
| -0s                |              |                   |

Table 1: Latency and Code Size per Optimization Level

one of the optimization levels under *Optimization Level*. You can see the code size by opening the *CDT Global Build Console*. The code size is in the column *text*.

- (b) Include the assembly code of inner loop of Filter\_horizontal at optimization level -00 in your report. Click Run→ Debug Configurations. Choose Debugger Baseline (Default). Specify the elf file in Application tab and configure Target Setup as Figure 3. Click Debug to enter the debug mode.
- (c) Include the assembly code of inner loop of Filter\_horizontal at optimization level -O2 in your report.
- (d) Based on the machine code of questions 2b and 2c, explain the most important difference between the -00 and -02 versions. (2 lines) Hints (leading questions):
  - for each case (-O0, -O2), how many times does the loop read the variable i?
  - for each case (-O0, -O2), how many times does the loop read and write the variable Sum?
  - why is the -O2 loop able to avoid recalculating Y\*INPUT\_WIDTH+X inside the loop body?
  - what else is the -O2 loop able to avoid reading from memory? recaculating?
  - how is the -O2 loop able to perform fewer operations?
- (e) Why would you want to use optimization level -00?
  Hint: Compile the code with -03 and track the values of the variables X, Y, and i as you step through Filter\_horizontal. (3 lines)
- (f) Include the assembly code of inner loop of Filter\_horizontal at optimization level -O3 in your report.
- (g) Based on the machine code of questions 2c and 2f, explain the most important difference between the -02 and -03 versions. (1 line)
- (h) What are two drawbacks of using a higher optimization level? (5 lines)

#### 3. Automatic Vectorization

The easiest way to take advantage of vector instructions is by using the automatic vectorization feature of the GCC compiler, which automatically generates NEON instructions from loops. We will tell you how to change the compilation flag to enable the vectorization in this part. Automatic vectorization in GCC is sparsely documented in the GCC documentation. Although we are not using the ARM compiler, the ARM compiler user guide may give some more insight.

- (a) Report the latency of each stage of the baseline application at -O3. (Start a table that includes each stage and an overall application latency; we will continue to expand this table throughout this problem.)
- (b) Based on your understanding of the C code, which loops in the streaming stages of the application have sufficient data parallelism for vectorization? Motivate your answer. (Add a column to the table you started in Q 3a for marking suitability; add explanation in 2–5 lines after table.)
- (c) Identify the critical path lower bound for Filter\_vertical in terms of compute operations. Focus on the data path. Ignore control flow and offset computations. (5 lines)

Hint: Consider only the dependencies in the computation. What happens if you unroll the loops completely?

(d) Report the resource capacity lower bound for Filter\_vertical. Focus on the computation; you may ignore control flow and addressing computations. There are many resources that may limit the performance. (5 lines)

Hint: As with any resource capacity lower bound analysis, you may have multiple resources and may need to consider them each to identify the one that is most constraining.

Hint: you will need to review the NEON architecture (which we discuss in class) and reason about what resources is has available to be used on each cycle.

(e) What speedup do you expect your application can achieve under ideal circumstances? (5 lines)

Hint: remember Amdahl's Law; think about critical path lower bounds and resource capacity lower bounds.

(Add another column to the table you started in Q 3a showing expected performance after ideal vectorization; separately show Amdahl's Law calculation for overall speedup.)

(f) We will enable the vectorization in gcc. Right click the project, choose

 $Properties \rightarrow C/C++Build \rightarrow Settings \rightarrow ARM \ v8 \ gcc \ compiler \rightarrow Miscellaneous,$  in "Other flags" change "nosimd" to "simd", and build the project again. Report the program size.

(g) Report the speedup of the vectorized code with respect to the baseline. (Add two more columns to the table you started in Q 3a showing per stage and overall latency (first column) and speedup relative to non-vectorized baseline (second column)).

(h) Explain the discrepancy between your measured and ideal performance based on the optimization of Filter\_horizontal. (3 lines)
Hint: look at the size of the multiplications in the disassembly. if the disassembly window cannot show the compelet assembly code (like .word 0x2e20c2a1), you can open the elf file. In *Project Explorer*, double click *Baseline→ Binaries→Baseline.elf*. You can use line number as clues.

Hint: to read this code, you probably need to understand the relation between Q and V registers. Perhaps useful:

- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dht0002a/ ch01s03s02.html
- https://developer.arm.com/docs/den0024/latest/armv8-registers/neon-and-floa scalar-register-sizes
- https://developer.arm.com/docs/den0024/latest/armv8-registers/neon-and-floavector-register-sizes
- (i) Show how you can resolve the issue that you identified in the previous problem. (1 line)
- (j) Report the speedup with respect to the baseline after resolving the issue in both Filter\_horizontal and Filter\_vertical. (Add two more columns to the table you started in Q 3a showing per stage and overall speedup after resolving.)

### 4. Reflection

Reflect on the cooperation in your team.

- Compare your actual time on tasks with your original estimates. (table with 1-2 line explanation of major disrepancies)
- Reflect on your task decomposition (Q 1a). Were you able to complete the task as you originally planned? What aspects of your original task distribution worked well and why? Did you refine the plan during the assignment? How and why? In hindsight, how should you have distributed the tasks? (paragraph)
- What was the most useful thing you learned from or working with your teammate? (2–4 lines)
- What do you believe was the most useful thing that you were able to contribute to your team? (1–3 lines)