CSC367 Lecture Notes: Part 1
Why, Moore's Law, and How to Write Fast Code

See all lecture notes here.

The Asterisk in Moore's Law

Everyone who is interested in computer science have heard about Moore's Law, which is an observation made by Gordon Moore (at Intel, 1965) that the number of transistors on a chip will double every 18 months (or 2 years, depending on who you talk to). This trend can be seen in the graph below.

Moore's Law 1970

Unfortunately, the exponential growth in computational performance cannot simply come from the increase in the clock speed (which controls the number of instructions that can be executed per second). The dynamic power is proportional to \(V^2fC\), where \(V\) is the voltage, \(f\) is the frequency, and \(C\) is the capacitance. I don't really know what capacitance means, but voltage is directly proportional to frequency. Therefore, frequency has a cubic effect on the dynamic power. If we were to follow Moore's law by just increase frequency, we have the following trend.

Moore's Law 1970

This is clearly not sustainable. But we are still realizing Moore's law well into the 21st century. Why? Instead of having one fast processor, we can fit many slower processors onto a single chip. Moreover, the world's fastest supercomputers take advantage of thousands of chips connected in a network.

On today's computers, a simple C program will not be able to take advantage of the performance of even a laptop, let alone a supercomputer. This is why parallel programming is so important for any type of scientific computing. But to write good parallel programs, we must first understand the architecture and how to write fast serial code.

Dense Matrix Multiplication

Let us start with a classic example of dense matrix multiplication. Our experimental results will be for \(4096\times 4096\) matrices on a (relatively old now) dual CPU Intel Haswell machine with a peak performance of 836 GFLOPS. We will use the naive algorithm which requires \(2n^3=2^{37}\) floating point operations.

Python

We first start with everyone's favorite language, Python. The matrices are represented as lists of lists and the code snippet for matrix multiplication is shown below.
for i in range(n):
    for j in range(n):
        for k in range(n):
            C[i][j] += A[i][k] * B[k][j]
This code takes about 6 hours (21042s) to execute. Is this fast? Well, if we do the math this program achieves 6.25 MFLOPS, which is only 0.0075% of the peak system performance. This is because python is an interpreted language and is notorius for being slow.

Java

What about Java? We can implement the same algorithm in Java, which compiles into bytecode and is then JIT compiled.
for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
        for (int k = 0; k < n; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
The runtime is about 40 minutes (2387s), which is about 8.8 times faster than python. But there are still better programming languages.

C

C code compiles directly into machine code, which has the reputation of being a fast programming language.
for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
        for (int k = 0; k < n; k++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
This code only takes 19 minutes (1156s) to execute, which is over 18 times faster than python. However, it still only achieves 0.01% of the peak system performance.

Interchanging Loop Order

Now we are tinkering around and come up with the question, what if we just swap the order of the loops? Will anything change? To a naive observer, it seem like nothing should change because the order of the loops should not affect the result. However, if we run the experiment again, with this code
for (int i = 0; i < n; i++) {
    for (int k = 0; k < n; k++) {
        for (int j = 0; j < n; j++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
we see it takes under 3 minutes (178s) to execute. This is 6.5 times faster than the original code for just switching the loop order. This may seem weird, but to understand this we need to understand some hardware, specifically the memory hierarchy which we explore in detail in Part 2. We can also use compiler optimizations to further improve the performance. Using -O2, we get a runtime of 55 seconds. However, this is still only 0.3% of the peak system performance. Why? Recall our system has a total of 18 cores, but this code is only using 1.

Parallelizing the Code

The simplest way to parallelize the code is to use OpenMP. By simply adding one line of code, we can parallelize the computation. It is not always this simple, and parallelizing algorithms will be explored in detail in Part 3.
#pragma omp parallel for
for (int i = 0; i < n; i++) {
    for (int k = 0; k < n; k++) {
        for (int j = 0; j < n; j++) {
            C[i][j] += A[i][k] * B[k][j];
        }
    }
}
This code now runs in 3 seconds, which is almost 18 times faster than the serial code. This is expected since we do have 18 cores. However, we are still only at 5% of the peak system performance. The reason also has to do with the hardware, but this time it is more complicated. Tiling, vectorization, matrix transposition, data alignment, preprocessing, and AVX instructions can be taken advantage of to further improve the performance. If we use all these advanced techniques, we can achieve similar performance to the Intel MKL library, which is a highly optimized mathematics library. Our method (and MKL) can perform the computation in 0.4 seconds. However, this is still only 40% of the peak system performance.

Conclusion and Notes

On a modern computer, the table below shows the performance of different implementations for a common task, matrix multiplication.

Implementation Time(s) GFLOPS Absolute speedup Relative % of peak performance
Python 21042 0.006 1 N/A 0.00
Java 2387 0.058 9 8.8 0.00
C 1156 0.119 18 2.07 0.01
+reorder 178 0.774 118 6.5 0.09
+O2 55 2.516 385 3.25 0.3
+parallel 3 45.2 6921 17.97 5.4
+advanced techniques 0.39 352.4 53292 7.79 42

As can be seen, we sped up a program that took over 6 hours to under half a second, a speedup of over 50000 times. This shows that good code is fast code. Nobody wants to wait for something 50000 times, LOL.

The reason we are still only able to achieve 40% of the peak system performance is because of the name, peak performance. This is the maximum performance that can be achieved under ideal conditions which only lasts for a very short period of time. The steady state performance is lower, and this is close to what we achieved.

Part 2 and Onwards

See Part 2.