CSC367 Lecture Notes: Part 2
Single Processor Optimization and Memory Hierarchy

See all lecture notes here.

Memory Hierarchy

Most applications run at <10% of peak performance, even on a single processor. Most of the lost time is spent waiting for data to arrive from memory. To write efficient programs, we need to understand how the hardware works.

Note that all arithmetic operations are performed in registers. The different types of variables (int, float, double, pointers, etc.) are really just words in the processor. In an idealized model, we assume each operation (like add, multiply, etc.) has roughly the same cost.

Compilers can perform many optimizations like loop unrolling, loop fusion, reorder instructions to improve register reuse, etc. Why do we need to worry about memory hierarchy? Because compilers can't do everything. Compiler often gives up on optimizing complex code since correctness is the most important. A compiler applies optimization based on some general rules, so sometimes very simple code will not be optimized. However, you should always know what your code does, so you can always try to write efficient code.

The memory hierarchy for a (modern) single processor contains a multi-level cache (MLC) system, main memory, and potentially multiple levels of storage. The memory hierarchy is based on small amount of fast memory that is very close to the processor, and a large amount of slow memory that is further from the processor (in terms of physical distance).

Latency and Bandwidth

The two most important metrics of memory performance are latency and bandwidth. Latency is the amount of time it takes to go from one point to another. For example, if the read latency of main memory is 100ns, then it takes 100ns to fetch one word from memory. Bandwidth is the amount of data that can be transferred in a given amount of time at steady state. For example, if the read bandwidth of main memory is 1GB/s, then after a while, we can fetch 1GB of data from memory every second. An order of magnitude estimate of the latency for different parts of the memory hierarchy is: Memory Hierarchy Latency

A "real world" analogy of latency and bandwidth is the example of driving.

A narrow neighborhood street has low bandwidth because you can only drive one car at a time, and high latency since it has a low speed limit.
A single-lane highway still has low bandwidth because you can only drive one car at a time, but low latency since you can drive very fast.
A major city road has high bandwidth because it's wide, and high latency because speed limit is still low.
A multi-lane freeway has high bandwidth and low latency since many cars can drive at high speed at the same time.

If you are a city planner and want to maximize the throughput of cars, you want to build a multi-lane freeway. But also note that decreasing the latency also increases the throughput. Hence, latency is very important for memory performance, but is the major challenge for memory system since the speed of electricity is finite.

Caching and Locality

One way to hide the high latency of main memory is to use caching. In a MLC system, data is first loaded onto the cache, and then the processor accesses data from the cache. The cache has much lower latency than main memory, thus we want to maximize the number of memory accesses from cache a.k.a cache hits. If the data required by the processor is not in cache, this is called a cache miss and the data must be fetched from main memory.

Temporal locality is the reuse of data that is already in the cache due to a recent memory access. Spatial locality is taking advantage of data that are stored close to each other in memory and are likely brought into the cache together. This is because the cache is organized into cache lines, typically around 64 bytes, which is the smallest unit of data that is fetched from main memory into cache. Some modern processors support cache prefetching, which is a technique to predict which data will be needed in the future and fetch them into the cache before they are actually needed.

Example: Matrix Multiplication

We will revisit the matrix multiplication example from a new light. For this simple example, we assume we have one level of cache, 256KB in size. Each cache line is 64 bytes, and we have $4096\times4096$ matrices of doubles (so each cache line stores 8 elements). Recall the code in Part 1, with $i,j,k$ ordering. How many accesses from main memory will we have? For the one full execution of the inner-most loop, the access access pattern looks like this: Matrix multiplication access pattern ijk

Matrix multiplication access pattern ijk

With the ideal cache eviction policy, we have 1 (for C), 512 (for A), and 4K (for B) cache misses. Notice that the accesses for the B matrix are very far apart, so it has very poor spatial locality. Now, we understand why switching the order of the loops caused the drastic speedup. The access pattern for the $i,k,j$ ordering is: Matrix multiplication access pattern ikj

Matrix multiplication access pattern ikj

This only have 512 (for C), 1 (for A), and 512 (for B) cache misses. This is 4.5x fewer cache misses, so we have much better spatial locality. We also have better temporal locality since we are reusing the element in A for each iteration.

False Sharing in Shared Memory

Consider if two processors share a cache line from memory. When they try to write at the same time to different positions, what happens?

Do we have the red $x$ or the red $y$? We should have both but the cache for the two processors disagrees. This is called false sharing, and the cache coherence protocol is designed to prevent this. This typically causes a cache miss for at least one of the processors. This will be discussed in more detail when we discuss shared memory parallelism.

Pipelining

Pipelining is a technique for implementing instruction-level parallelism on a single processor. The idea is keep as many different parts of the processor busy at all times. This is done by dividing every instruction into multiple stages, each stage utilizes a different part of the processor. For example, instruction fetch (IF), decode (ID), execute (EX), memory access (MEM), and write back (WB) for different instructions can be done at the same time by using the following pattern:

Clock cycle Instr. No.	1	2	3	4	5	6	7
1	IF	ID	EX	MEM	WB
2		IF	ID	EX	MEM	WB
3			IF	ID	EX	MEM	WB
4				IF	ID	EX	MEM
5					IF	ID	EX

This increases the bandwidth of the processor, since we can do more work per clock cycle. However, this may also lead to stalls/bubbles in the pipeline. For example, if the memory access stage (MEM) takes longer than the other stages, then the processor will have to wait for the memory access to finish before it can continue. Hyperthreading is a technique to hide some of the performance penalty of pipeline stalls.

Laundry example

A "real-world" analogy for pipelining is doing 4 loads of laundry. Consider a washer that takes 30 minutes, dryer that takes 40 minutes, and folding that takes 20 minutes. To complete the laundry, we do not need 6 hours since when the machines can run simultaneously. We can do the following: Laundry

The pipelined procedure only takes 3.5 hours. Note that the bandwidth for laundry is 1.5 loads per hour, which is over twice as much of the naive procedure. However, the latency is still 90 minutes, which brings us to an important point that pipelining does not reduce the latency. It only increases the bandwidth.

Vectorization and SIMD

Modern CPUs have a set of large registers (128, 256, or 512-bits) used for vectorized operations. There are also special instructions for performing vectorized operations (like add, multiply, etc). These instructions are called SIMD instructions. SIMD

There are some challenges to use the SIMD instructions, like the vector must be aligned in memory, and some instructions are required to move data around from one part of the register to another. Math libraries like MATLAB and NumPy take advantage of these special instructions to speed up vector operations.

Performance Model

We will explore a simple performance model of single processor machines, which will help us gain several insights about different architectures as well as various common algorithms used in scientific computing. We assume that we only have one level of cache (fast memory, with zero latency). We also assume that each the cache line can only fit one element. For this model, we will define the following terms:

$m$ = number of memory elements moved between fast and slow memory
$t_m$ = time per slow memory operation
$f$ = number of arithmetic operations
$t_f$ = time per arithmetic operation $<<t_m$
$q=f/m$ the average number of flops per slow memory access

Note that $f$ and $m$ are solely determined the algorithm, and $t_m$ and $t_f$ are solely determined by the hardware. $q$ is called the computational intensity, which is a measure of the algorithm's efficiency. The minimum time it takes to run an algorithm is $t_{opt}=ft_f$, which occurs all the data is in fast memory. The actual runtime is $$t=\underbrace{ft_f}_{\text{compute}}+\underbrace{mt_m}_{\text{memory}}=t_{opt}\left(1+\frac{t_m}{t_f}\cdot\frac{1}{q}\right).$$ The factor $t_m/t_f$ is called the machine balance. For an algorithm to achieve at least half the peak speed, we require $$q\ge t_m/t_f.$$ The machine balance for typical (older) CPUs typically range from 5 to 50, which is the minimum computational intensity required to achieve half the peak speed.

Matrix-Vector Multiply

Consider the following matrix-vector multiplication algorithm, with the assumption that the vectors $x$ and $y,$ and one row of matrix $A$ can fit in fast memory.

{read x(1:n) into fast memory} // n memory refs
{read y(1:n) into fast memory} // n memory refs
for i = 1 to n
  {read A(i,1:n) into fast memory} // n memory refs (n times)
  for j = 1 to n
    y(i) = y(i) + A(i,j)*x(j) // 2 operations (n² times)
{write y(1:n) to slow memory} // n memory refs

The total number of memory references is $m=3n+n^2$, the total number of arithmetic operations is $f=2n^2$, and the computational intensity is $q=f/m\approx2$. This shows that matrix-vector multiplication is a memory bound operation.

Naive Matrix Multiply

For naive matrix multiplication, if we assume that all the data can fit in fast memory, then we have the following algorithm:

{read A,B,C into fast memory} // 3n² memory refs
for i = 1 to n
  for j = 1 to n
    for k = 1 to n
      C(i,j) = C(i,j) + A(i,k)*B(k,j) // 2 operations (n³ times)
{write C to slow memory} // n² memory refs

The algorithm has $m=4n^2$ memory references, $f=2n^3$ arithmetic operations. Thus, $q=f/m=\mathcal O(n)$, which is very good. For large $n$, we can utilize all the performance of the CPU. However, this assumption is not very realistic since fast memory is typically very small.

If we make the more conservative assumption that only three matrix rows can fit in fast memory, the algorithm becomes:

for i = 1 to n
  {read A(i,1:n) into fast memory} // n memory refs (n times)
  for j = 1 to n
    {read B(1:n,j) into fast memory} // n memory refs (n² times)
    {read C(i,j) into fast memory} // 1 memory ref (n² times)
    for k = 1 to n
      C(i,j) = C(i,j) + A(i,k)*B(k,j) // 2 operations (n³ times)
    {write C(i,j) to slow memory} // 1 memory ref (n² times)

Then, we have $m=n^2+n^3+n^2+n^2=3n^2+n^3$ memory references, and $f=2n^3$ arithmetic operations. Thus, $q=f/m\approx2$ for large $n$. This is the same computational intensity as the matrix-vector multiply. This algorithm is not very memory efficient. There are better ways for matrix multiplication.

Blocked/Tiled Matrix Multiply

Recall block matrix multiplication behaves the same way as regular matrix multiplication, except instead of doing scalar multiplications you do matrix multiplications on the blocks. Suppose we have a block size of $b\times b$, then there will $n_b\times n_b$ blocks in the matrix (where $n_b=n/b$), and assume 3 blocks can fit in fast memory.

Block Matrix Multiplication

Note that $C(i,j)$ in the diagram actually meant $C(bi,bj)$. The algorithm itself is as follows:

for i = 1 to n_b
  for j = 1 to n_b
    {read the (i,j)-th block of C into fast memory} 
    // b² memory refs (n_b² times)
    for k = 1 to n_b
      {read the (i,k)-th block of A into fast memory} 
      // b² memory refs (n_b³ times)
      {read the (k,j)-th block of B into fast memory} 
      // b² memory refs (n_b³ times)
      for bi = 1 to b
        for bj = 1 to b
          for bk = 1 to b
            C(i*b+bi,j*b+bj) += A(i*b+bi,k*b+bk)*B(k*b+bk,j*b+bj) 
            // 2 operations (n_b³*b³=n³ times)
    {write the (i,j)-th block of C to slow memory} 
    // b² memory refs (n_b² times)

In total there are $m=b^2(2n_b^2+2n_b^3)=2n^2(1+n_b)$ memory references, and $f=2n^3$ arithmetic operations. So the computational intensity is $q=f/m\approx n/n_b=b$ for large $n$. This outperforms naive matrix multiply when $b>2$. Why can't we make $b$ as big as we want? Because we only have a limited amount of fast memory. If our fast memory can fit $M$ elements, we an achieve a computational intensity of $q\le\sqrt{M/3}$.

Overview of Common Computational Methods

This paper from Berkeley in 2006, when parallel computing was still on the rise, gives a good overview of the different algorithms and their use in various real world applications. The red color indicates very important algorithms for the given application, and the blue color means less important. As can be seen, dense matrix multiplication is very important, which is why we use it as an example to illustrate many of the concepts.

Overview of Common Computational Methods

Part 3 and Onwards

See Part 3.

CSC367 Lecture Notes: Part 2Single Processor Optimization and Memory Hierarchy