

#### Who is this guy?



See http://www.seas.upenn.edu/~pcozzi/

# How did this happen?



#### **Graphics Workloads**

Triangles/vertices and pixels/fragments







http://proteneer.com/blog/?p=263

Right image from http://http.developer.nvidia.com/GPUGems3/gpugems3 ch14.html





 NOTE: SGI was building interactive rendering supercomputers, but this was beginning of interactive 3D graphics on PC

Slide from http://s09.idav.ucdavis.edu/talks/01-BPS-SIGGRAPH09-mhouston.pdf

#### Why GPUs?

- Graphics workloads are embarrassingly parallel
  - Data-parallel
  - □ Pipeline-parallel
- CPU and GPU execute in parallel
- Hardware: texture filtering, rasterization, etc.

#### **Data Parallel**

#### Beyond Graphics

Cloth simulation
Particle system
Matrix multiply



# NVIDIA GeForce 6 (2004)



#### **NVIDIA G80 Architecture**



Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

#### Why Unify Shader Processors?



Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf



#### **GPU** Architecture Big Ideas

GPUs are specialized for

Compute-intensive, highly parallel computation
 Graphics is just the beginning.

- Transistors are devoted to:
  - Processing

#### □Not:

- Data caching
- Flow control

Slide from http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

| N <sup>2</sup> |                                      |                                                                            |    | N        |                                                                                                  |
|----------------|--------------------------------------|----------------------------------------------------------------------------|----|----------|--------------------------------------------------------------------------------------------------|
| "CPU-          | style" cores                         |                                                                            | Q  | Slimming | down                                                                                             |
|                | Fetch/<br>Decode<br>ALU<br>(Execute) | Data cache<br>(a big one)                                                  |    |          | Fetch/<br>Decode<br>Idea #1:<br>(Execute)<br>Remove components that<br>help a single instruction |
|                | Execution<br>Context                 | Out-of-order control logic<br>Fancy branch predictor<br>Memory pre-fetcher |    |          | Execution<br>Context                                                                             |
| 07/29/10       | Beyond Pr                            | grammable Shading Course, ACM SIGGRAPH 2010                                | 14 | 07/29/10 | Beyond Programmable Shading Course, ACM SIGGRAPH 2010                                            |

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

15

21





07/29/10



#### Add ALUs



Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs

#### **SIMD processing**

| 07  | 12  |   | 4 | 0 |
|-----|-----|---|---|---|
| 011 | e., | 1 | ^ | ~ |

18

Beyond Programmable Shading Course, ACM SIGGRAPH 2010

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

| 28 fragments in parallel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |       |     |    |     |      |                                         |   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----|----|-----|------|-----------------------------------------|---|
| •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       •       • | 28 fi | rag | Jm | ent | s in | parallel                                | Ø |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | •     |     | +  | Ļ   | +    |                                         |   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | +     | 88  | +  | +   | +    |                                         |   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | +     |     | +  | +   | +    |                                         |   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | ļ     |     | +  | +   | +    |                                         |   |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |       |     |    |     |      | Us, 16 simultaneous instruction streams | 2 |

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf



Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf







Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf



Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

| Eigh     | nteen small contexts (maximal latency hiding)         | Q  |
|----------|-------------------------------------------------------|----|
|          | Fetch/<br>Decode                                      |    |
|          | ALUS ALUS ALUS ALUS ALUS                              |    |
|          |                                                       |    |
|          |                                                       |    |
|          |                                                       |    |
| 07/29/10 | Beyond Programmable Shading Course, ACM SIGGRAPH 2010 | 40 |

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf



**Twelve medium contexts** 





Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

| -                              |                                                       |    |
|--------------------------------|-------------------------------------------------------|----|
| My chip!                       |                                                       | Ø. |
| 16 cores                       |                                                       |    |
| 8 mul-add ALUs per core        |                                                       |    |
| (128 total)                    |                                                       |    |
| 16 simultaneous                |                                                       |    |
| instruction streams            |                                                       |    |
| 64 concurrent (but interleaved |                                                       |    |
| instruction streams            |                                                       |    |
| 512 concurrent fragments       |                                                       |    |
| -                              |                                                       |    |
| = 256 GFLOPs (@ 1GHz)          |                                                       |    |
|                                |                                                       | 44 |
| 07/29/10                       | Beyond Programmable Shading Course, ACM SIGGRAPH 2010 |    |

**NVIDIA G80** 

| SP 📃 | 9         |
|------|-----------|
| HH   | HH        |
|      |           |
| 75   | $\square$ |
| L    | 1         |

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf



#### NVIDIA G80



| Streaming | Processing (SP) |
|-----------|-----------------|
|-----------|-----------------|



**NVIDIA G80** 

Streaming Multi-Processor (SM)

#### **NVIDIA G80**

- 16 SMs
- Each with 8 SPs
   128 total SPs
- Each SM hosts up to 768 threads
- Up to 12,288 threads in flight

| GeForce 8: Modern GPU Architectur |                         |             |             |       |                   |    |                  |  |  |
|-----------------------------------|-------------------------|-------------|-------------|-------|-------------------|----|------------------|--|--|
| :                                 | Host<br>Input Assembler |             |             | Setup | & Rasterize       |    |                  |  |  |
|                                   | ertex Thread Issu       | Je Geom 1   | hread Issue | Pixel | Thread Issue      | 53 | essor            |  |  |
|                                   |                         |             | L           |       |                   |    | Thread Processor |  |  |
|                                   |                         |             |             |       |                   |    |                  |  |  |
| Framebuffer Framebuffer           |                         | Framebuffer | Framebufi   | fer   | Framebuffer Frame |    | uffer            |  |  |

#### Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

#### **NVIDIA GT200**

- 30 SMs
- Each with 8 SPs 240 total SPs
- Each SM hosts up to
   1024 threads
- In flight, up to
   30,720 threads



Slide from David Luebke: http://s08.idav.ucdavis.edu/luebke-nvidia-gpu-architecture.pdf

# **GPU** Computing History

2001/2002 – researchers see GPU as dataparallel coprocessor

□ The GPGPU field is born

- 2007 NVIDIA releases CUDA
   CUDA Compute Uniform Device Architecture
   GPGPU shifts to GPU Computing
- 2008 Khronos releases OpenCL specification

# Let's program this thing!

#### **CUDA** Abstractions

- A hierarchy of thread groups
- Shared memories
- Barrier synchronization

#### **CUDA Kernels**

 Executed N times in parallel by N different CUDA threads







**Thread Hierarchies** 

- Grid one or more thread blocks
   1D or 2D
- Block array of threads
  - □1D, 2D, or 3D
  - Each block in a grid has the same number of threads
  - □ Each thread in a block can
    - Synchronize
    - Access shared memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

#### **Thread Hierarchies**

- A thread block is a batch of threads that can cooperate with each other by:
  - Synchronizing their execution
    - For hazard-free shared memory accesses
  - Efficiently sharing data through a low latency shared memory
- Two threads from two different blocks cannot cooperate



Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

#### **Thread Hierarchies**

- Thread Block
  - □ Group of threads
    - G80 and GT200: Up to 512 threads
    - Fermi: Up to 1024 threads
  - □ Reside on same processor core
  - $\Box$  Share memory of that core



Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

#### **Thread Hierarchies**



A multithreaded program is partitioned into blocks of threads that execute independently from each other, so that a GPU with more cores will automatically execute the program in less time than a GPU with ferver cores.

#### Figure 1-4. Automatic Scalability

#### **Thread Hierarchies**

Threads in a block

□ Share (limited) low-latency memory

- $\Box$  Synchronize execution
  - To coordinate memory accesses

syncThreads ()
 Barrier – threads in block wait until all threads reach this
 Lightweight



Image from: http://developer.download.nvidia.com/compute/cuda/3\_2\_prod/toolkit/docs/CUDA\_C\_Programming\_Guide.pdf

#### Scheduling Threads

Warp – threads from a block
 G80 / GT200 – 32 threads
 Run on the same SM
 Unit of thread scheduling
 Consecutive threadIdx values
 An implementation detail – in theory
 warpSize

#### **Scheduling Threads**

 Warps for three blocks scheduled on the same SM.



Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

# Scheduling Threads

#### Remember this:



Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

#### Scheduling Threads

- · SM implements zero-overhead warp scheduling
  - At any time, only one of the warps is executed by SM
  - Warps whose next instruction has its operands ready for consumption are eligible for execution
  - Eligible Warps are selected for execution on a prioritized scheduling policy
  - All threads in a warp execute the same instruction when selected



© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign

# Scheduling Threads

What happens if branches in a warp diverge?

# Scheduling Threads

#### Remember this:



Image from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian\_gpuArchTeraflop\_BPS\_SIGGRAPH2010.pdf

# Scheduling Threads

 32 threads per warp but 8 SPs per SM. What gives?



# Scheduling Threads

- 32 threads per warp but 8 SPs per SM. What gives?
- When an SM schedules a warp:
   Its instruction is ready
  - $\Box$ 8 threads enter the SPs on the 1<sup>st</sup> cycle
  - $\square\,8$  more on the 2<sup>nd</sup>, 3<sup>rd</sup>, and 4<sup>th</sup> cycles
  - Therefore, 4 cycles are required to dispatch a warp

# Scheduling Threads

#### Question

□A kernel has

- I global memory read (200 cycles)
- 4 non-dependent multiples/adds
- How many warps are required to hide the memory latency?

# **Scheduling Threads**

- Solution
  - □ Each warp has 4 multiples/adds
    - ■16 cycles
  - □We need to cover 200 cycles
    - 200 / 16 = 12.5
    - ceil(12.5) = 13
  - □13 warps are required

# Memory Model

#### Recall:

- Device code can:
  - R/W per-thread registers
  - R/W per-thread local memory
  - R/W per-block shared memory
  - R/W per-grid global memory
  - Read only per-grid constant memory
- Host code can
  - R/W per grid global and constant memories

| ( | (Device) Grid      |               |               |               |             |            |  |  |  |  |
|---|--------------------|---------------|---------------|---------------|-------------|------------|--|--|--|--|
|   | Block (0, (        | 0)            | Block (1, 0)  |               |             |            |  |  |  |  |
|   | Share              | ed Memory     | Shared Memory |               |             |            |  |  |  |  |
|   | Registers          | Registers     | Î             | Registers     | Register    |            |  |  |  |  |
|   | <b>‡</b>           |               | ł             | <b>‡</b> ,    |             | ↓<br>↓     |  |  |  |  |
|   | Inread (U, C       | 0) Thread (1, | <b>v</b> )    | finread (U, t | 0) Thread ( | 1, 0)<br>↑ |  |  |  |  |
|   | +                  | Ļ             |               | +             | Ļ           | Ľ          |  |  |  |  |
| ↔ | Global<br>Memory   |               |               |               |             | L          |  |  |  |  |
| ↔ | Constant<br>Memory |               |               |               |             |            |  |  |  |  |
|   |                    |               |               |               |             |            |  |  |  |  |

#### Thread Synchronization

Threads in a block can synchronize
 call \_\_syncthreads to create a barrier
 A thread waits at this call until all threads in the block reach it, then all threads continue

```
Mds[i] = Md[j];
_____syncthreads();
func(Mds[i], Mds[i + 1]);
```



Time: 0

#### Thread Synchronization



Time: 1



Thread Synchronization Thread 0 Thread 1 Mds[i] = Md[j];Mds[i] = Md[j]; $\square$ syncthreads();  $\Box$ syncthreads();

func(Mds[i], Mds[i+1]);

func(Mds[i], Mds[i+1]);



Threads 0 and 1 are blocked at barrier



Time: 3



can continue

Time: 3



#### **Thread Synchronization**

- Why is it important that execution time be similar among threads?
- Why does it only synchronize within a block?

#### **Thread Synchronization**



Figure 3.5 Lack of synchronization across blocks enables transparent scalability of CUDA programs

Image from http://courses.engr.illinois.edu/ece498/al/textbook/Chapter3-CudaThreadingModel.pdf

#### **Thread Synchronization**

■ Can \_\_\_\_syncthreads() cause a thread to hang?

# Thread Synchronization if (someFunc()) { \_\_\_\_\_syncthreads(); } // ...

# Thread Synchronization

Π.

```
if (someFunc())
{
    ___syncthreads();
}
else
{
    ___syncthreads();
}
```