## ESE532: System-on-a-Chip Architecture Day 6: September 18, 2019 Data-Level Parallelism Penn ESE532 Fall 2019 -- DeHon ## Today **Data-level Parallelism** - · For Parallel Decomposition - Architectures - · Concepts - NEON Penn ESE532 Fall 2019 -- DeHon 2 ## Message - Data Parallelism easy basis for decomposition - Data Parallel architectures can be compact - pack more computations onto a fixed-size IC die - OR perform computation in less area Penn ESE532 Fall 2019 -- DeHon 3 #### Preclass 1 - · 400 news articles - · Count total occurrences of a string - How can we exploit data-level parallelism on task? - How much parallelism can we exploit? Penn ESE532 Fall 2019 -- DeHon 4 ## Parallel Decomposition enn ESE532 Fall 2019 -- DeHon 5 #### **Data Parallel** - Data-level parallelism can serve as an organizing principle for parallel task decomposition - Run computation on independent data in parallel Penn ESE532 Fall 2019 -- DeHon ## **Exploit** - Can exploit with - Threads - Pipeline Parallelism - Instruction-level Parallelism - Fine-grained Data-Level Parallelism Penn ESE532 Fall 2019 -- DeHon - ### Performance Benefit - Ideally linear in number of processors (resources) - · Resource Bound: $$\circ T_{dp} = (T_{single} \times N_{data})/P$$ - T<sub>single</sub> = Latency on single data item - $T_{dp} = max(T_{single}, N_{data} / P)$ 0 #### **SPMD** Single Program Multiple Data - · Only need to write code once - · Get to use many times Penn ESE532 Fall 2019 -- DeHon 9 ## Preclass 2 Common Examples - What are common examples of DLP? - Simulation - Numerical Linear Algebra - Signal or Image Processing - Image Processing - Optimization Penn ESE532 Fall 2019 -- DeHon 10 #### Hardware Architectures enn ESE532 Fall 2019 -- DeHon 11 #### Idea - If we're going to perform the same operations on different data, exploit that to reduce area, energy - Reduced area means can have more computation on a fixed-size die. Penn ESE532 Fall 2019 -- DeHon # W-bit ALU as SIMD • Familiar idea • A W-bit ALU (W=8, 16, 32, 64, ...) is SIMD • Each bit of ALU works on separate bits - Performing the same operation on it • Trivial to see bitwise AND, OR, XOR • Also true for ADD (each bit performing Full Adder) • Share one instruction across all ALU bits Bit 7 Penn ESE532 Fall 2019 - DeHon ### Preclass 6 - What do we get when add 65280 to 257 - 32b unsigned add? - 16b unsigned add? 27 ### ALU vs. SIMD? - · What's different between - 128b wide ALU - SIMD datapath supporting eight 16b ALU operations 28 # Segmented Datapath - · Relatively easy (few additional gates) to convert a wide datapath into one supporting a set of smaller operations - Just need to squash the carry at points · But need to keep instructions (description) small - So typically have limited, homogeneous widths supported Segmented Datapath · Relatively easy (few additional gates) to supporting a set of smaller operations - Just need to squash the carry at points convert a wide datapath into one Preclass 3: Opportunity • Don't need 64b variables for lots of things • Natural data sizes? - Audio samples? - Input from A/D? - Video Pixels? - X, Y coordinates for 4K x 4K image? ## **Vector Computation** - Easy to map to SIMD flow if can express computation as operation on vectors - Vector Add - Vector Multiply - Dot Product Penn ESE532 Fall 2019 -- DeHon 37 ## Terminology: Scalar - · Simple: non-vector - When we have a vector unit controlled by a normal (non-vector) processor core often need to distinguish: - Vector operations that are performed on the vector unit - Normal=non-vector=scalar operations performed on the base processor core Penn ESE532 Fall 2019 -- DeHon 39 # Vector Register File - Need to be able to feed the SIMD compute units - Not be bottlenecked on data movement to the SIMD ALU - · Wide RF to supply - · With wide path to memory # Point-wise Vector Operations Easy – just like wide-word operations (now with segmentation) # Point-wise Vector Operations - ...but alignment matters. - If not aligned, need to perform data movement operations to get aligned #### Ideal - for (i=0;i<64;i=i++)</li>c[i]=a[i]+b[i] - · No data dependencies - · Access every element - Number of operations is a multiple of number of vector lanes Penn ESE532 Fall 2019 -- DeHon 43 ## **Skipping Elements?** - How does this work with datapath? - Assume loaded a[0], a[1], ...a[63] and b[0], b[1], ...b[63] into vector register file - for (i=0;i<64;i=i+2) - -c[i/2]=a[i]+b[i] Penn ESE532 Fall 2019 -- DeHon 44 #### Stride - Stride: the distance between vector elements used - for (i=0;i<64;i=i+2)</li>c[i/2]=a[i]+b[i] - · Accessing data with stride=2 Penn ESE532 Fall 2019 -- DeHon 45 ### Load/Store - · Strided load/stores - Some architectures will provide strided memory access that compact when read into register file - · Scatter/gather - Some architectures will provide memory operations to grab data from different places to construct a dense vector Penn ESE532 Fall 2019 -- DeHon 46 #### Neon ARM Vector Accelerator on Zynq enn ESE532 Fall 2019 -- DeHon 47 #### **Neon Vector** - · 128b wide register file, 16 registers - Support - -2x64b - 4x32b (also Single-Precision Float) - -8x16b - 16x8b Penn ESE532 Fall 2019 -- DeHon 50 ## Sample Instructions - VADD basic vector - VCEQ compare equal Sets to all 0s or 1s, useful for masking - VMIN avoid using if's - VMLA accumulating multiply - VPADAL maybe useful for reduce - Vector pair-wise add - VEXT for "shifting" vector alignment - VLDn deinterleaving load Penn ESE532 Fall 2019 -- DeHon 51 #### **Neon Notes** - · Didn't see - Vector-wide reduce operation - Do need to think about operations being pipelined within lanes Penn ESE532 Fall 2019 -- DeHon 52 # ARM Cortex A53 (similar to A-7 Pipeline) 2-issue In-order 8-stage pipe Peating-Point / NEON Operating Point | Decode | Issue | Decode | Issue | Decode | Issue | Decode | Issue | Decode | Issue | Decode | Issue | Decode Deco ## Big Ideas - Data Parallelism easy basis for decomposition - Data Parallel architectures can be compact – pack more computations onto a chip - SIMD, Pipelined - Benefit by sharing (instructions) - Performance can be brittle - · Drop from peak as mismatch on peak as mismater # Admin - SDSoC available on Linux machines - See piazza - Reading for Day 7 online - HW3 due Friday - HW4 out enn ESE532 Fall 2019 -- DeHon