CSC367 Lecture Notes: Part 6
GPU Programming, CUDA
See all lecture notes here.
PCI Express (PCIe)
GPU computing is one of the most common parts of accelerated computing. The accelerators is an additional piece of hardware that is connected to the rest of the system using PCIe bus.
One lane of PCIe is 4 wires, each wire can transmit one bit per cycle in both directions. The following table shows the maximum bandwidth of the PCIe bus.
CPU-GPU Architecture
The memory model for the CPU-GPU system is as follows
All the cache system and the memory hierarchy on the CPU is optimized for latency. From a design perspective, the general purpose GPU (GP-GPU) is designed purely for throughput and often have high latency. We don't care how fast we can compute each single item, but only how fast we compute a very large number of elements.
In the a streaming multiprocessor (SM), we see
GPU threads do light-weight jobs with a huge register files, making context switching essentially free (and hides memory latency by switching away threads that are waiting for memory).
GPUs are mostly optimized for single precision. This is a design choice. Single precision units are smaller (physically), so can be speed up more. Also, single precision computation is more common, which is the market the vendors are targeting.
Programming Model
We will use CUDA to program Nvidia GPUs. The
CUDA Programming Guide is a good reference for CUDA programming.
Note that CUDA is not the only GPU programming model. A major advantage of CUDA is that it can only run on Nvidia GPUs. There are attempts to make a GPU programming model that can run on all GPUs, a promising new model is
SYCL. However, we will focus on CUDA since it is the most mature GPU programming model. Most of the big ideas are the same for all GPU programming models, they may just have different names.
One of the most important characteristics of the GPU programming model is the
compute capability, which is the instruction set available to the GPU.
For example, half-precision which is commonly used for deep learning is only available for compute capability 5.3 and above.
The
warp size is the the size of the warp scheduler (on CPU they are called instruction scheduler). When a GPU runs an instruction, it is executed for 32 threads.
Note that the L1 cache (at the bottom) is is shared by all threads. It can also be reconfigured to be private to block called
shared memory. Note that the shared memory is not coherent.
The register file is very large, sufficiently large to fit all the data required to fit an entire block.
CUDA has nvidia
Types of GPU Memory
Atomic Operations
Kernels and Memory Allocations
Parallel Reduction
Part 7 and Onwards
See Part 7.