2. Setup and Walk-through

Warning

Make sure to stop your Amazon instances! We only have $150 of credits and we need it to last through Homework 7. You may need to take a long break during a homework, or you might take longer to read about something on google/stackoverflow; remember to stop your instance and restart it when you are ready to continue.

2.1. Vectorization

We will divide the computation into vectors that run on the NEON units in our ARM cores. Fig. 2.2 shows the microarchitecture of the ARM cores in our A1 instance. It’s a 3-way decode, out-of-order core with two 128-bit NEON SIMD units.

../_images/cortex-a72.png

Fig. 2.2 Microarchitecture of ARM Cortex A-72. Source: PC Watch

The Ultra96 boards have ARM Cortex A-53 cores. Compared to the A-72, it’s a 2-way decode, in-order core with one 64-bit NEON SIMD unit, as shown in Table 2.1.

Table 2.1 ARM Core Comparison, Source: A-53, A-72

Cortex A-53

Cortex A-72

ARM ISA

ARMv8 (32/64-bit)

ARMv8 (32/64-bit)

Decoder Width

2 micro-ops

3 ops (5 micro-ops)

Maximum Pipeline Length

8

16 stages

Integer Add

2

2

Integer Mul

1

1

Load/Store Units

1

1 + 1 (Dedicated L/S)

Branch Units

1

1

FP/NEON ALUs

1x64-bit

2x128-bit

L1 Cache

8KB-64KB I$ + 8KB-64KB D$

48KB I$ + 32KB D$

L2 Cache

128KB - 2MB (Optional)

512KB - 4MB

We will use the NEON Intrinsics API to program the NEON Units in our cores. An intrinsic behaves syntactically like a function, but the compiler translates it to a specific instruction that is inlined in the code. In the following sections, we will guide you through reading the NEON Programmer’s guide and learning to use these APIs.

2.2. Obtaining and Running the Code

In the previous homework, we dealt with a streaming application that compressed a video stream, and explored how to implement coarse-grain data-level parallelism and pipeline parallelism using std::threads to speedup the application. For this homework, we will use the same application and implement fine-grain, data-level parallelism on a vector architecture; we will explore both auto vectorization with the compiler and hand-crafted NEON vector intrinsics.

  • Login to your a1.xlarge instance and clone the ese532_code repository using the following command:

    git clone https://github.com/icgrp/ese532_code.git
    

    If you already have it cloned, pull in the latest changes using:

    cd ese532_code/
    git pull origin master
    

    The code you will use for homework submission is in the hw4 directory. The directory structure looks like this:

    hw4/
        assignment/
            Makefile
            common/
                App.h
                Constants.h
                Stopwatch.h
                Utilities.h
                Utilities.cpp
            src/
                App.cpp
                Compress.cpp
                Differentiate.cpp
                Filter.cpp
                Scale.cpp
            neon_example/
                Example.cpp
        data/
            Input.bin (symlinks to hw3)
            Golden.bin (symlinks to hw3)
    
  • There are 3 targets. You can build all of them by executing make all in the hw4/assignment directory. You can build separately by:

    • make baseline and ./baseline to run the project with no vectorization of Filter_vertical function.

    • make neon_filter and ./neon_filter to run the project with Filter_vertical vectorized (you will modify the vectorized code later).

    • make example and ./example to run the neon example.

  • The data folder contains the input data, Input.bin, which has 100 frames of size \(960\) by \(540\) pixels, where each pixel is a byte. Golden.bin contains the expected output. Each program uses this file to see if there is a mismatch between your program’s output and the expected output.

  • The assignment/common folder has header files and helper functions used by the four parts.

  • You will mostly be working with the code in the assignment/src folder.

2.3. Working with NEON

We are going to do some reading from the arm developer website articles and the NEON Programmer’s Guide in the following sections.

2.3.1. Basics

Read Introducing Neon for Armv8-a and answer the following questions. We have given you the answers, however make sure you do the reading! Knowing where to look in a programmer’s guide is a skill by itself and we want to learn it now than later.


Read NEON and floating-point registers and answer the following questions:


Read chapter four from the NEON Programmer's Guide and answer the following questions:

2.3.2. Coding with NEON Intrinsics

Read chapter four from the NEON Programmer’s Guide and answer the following questions. Use the Neon Intrinsics Reference website to find and understand any instruction.

Tip

This will help you in coding for your homework.

2.3.3. Optimization: