CSC367 Lecture Notes: Part 3
Designing Parallel Algorithms

See all lecture notes here.

Parallel Hardware and Programming Models

There are many different types of parallel hardware. We will mainly focus on

Shared Memory: This is the most common to our daily lives. Shared memory architecture has multiple cores (typically on a single chip) that share the same memory. This type of hardware is used in most phones, laptops, and desktops. For shared memory architectures, we typically use the pthreads or OpenMP library.
Distributed Systems: This is the common in supercomputer and compute clusters, where multiple computers are connected to each others via a network. Communication for distributed systems is the main focus for designing algorithms, since it is much more expensive (higher latency) than on shared memory systems. Communication is usually done by message passing. For distributed systems, we typically use the MPI library.
SIMD and Vector: The most common example for this architecture is graphics cards (GPU). These compute units typically take advantage of many specialized processors for vector computations, and programmed with CUDA (for Nvidia GPUs). This type of architecture is also commonly used in machine learning.

For these different architectures, we may need to design our algorithms differently. However, many principles are the same, which is what we will focus on in this section. The general guidelines are:

Identify tasks in your program that can be performed concurrently
Map concurrent tasks onto multiple threads or processes, to be run in parallel
Partition the input, output, and/or intermediate data and assign to processes
Handle concurrent accesses to shared data by multiple processes
Add synchronization between stages of the parallel execution, where necessary
Profile performance and determine what the bottlenecks are
Target optimizations based on profiling information and performance analysis
Write small benchmarks to test your program in a variety of configurations

Tasks: Decomposition, Dependency, Granularity, Interaction, Mapping, Balance

A task is a unit of computation that can be extracted from the main program and assigned to a process, and can be run in parallel with other tasks. We need to decompose the main program into tasks, and then map these tasks to processes. Choosing the right task decomposition is an important part of designing parallel algorithms.

Note that not all tasks are independent of each other. Some tasks may depend on the output of other tasks. For example, in image filtering, applying the filter on each pixel may be independent of each other. However, normalizing the image after applying the filter can only be done after the entire image has been filtered.

The dependencies for any algorithm can be represented in a task dependency graph, which is a directed acyclic graph (DAG). The nodes in the graph represent the tasks, and the edges represent the dependencies. A start node is a node with no incoming edges, which represent the first task any processor can execute. A finish node is a node with no outgoing edges. Some examples of task dependency graphs are shown below.

TaskDAG

Note that a processor does not have to traverse a path through the task dependency graph, although that may be efficient since the process may already have some of the data needed for the next task in cache while computing the current task (improved temporal locality).

The granularity of a task decomposition is determined by how many tasks and what their sizes are. A coarse-grained decomposition has few large tasks, while a fine-grained decomposition has many small tasks.

Also, note the tradeoff between computation and communication. If we have a large number of small tasks, then we may have to communicate a lot between processors, as well as synchronize them often. But if we have a few large tasks, then we may have to wait for a long time for a processor to finish a task before we can start the next task. We will discuss later how in certain cases, we can do computation while waiting for communication, which can save a lot of time.

Some Useful Definitions

Ideal parallelism is when all tasks are independent of each other, and no communication is needed. Coarse-grained parallelism is when a lot of computation can be performed before communication is necessary. This type of parallelism is common for distributed systems, where communication is expensive. Fine-grained parallelism is when frequent communication is needed, which is more suitable for shared memory systems. The parallelism granularity is the measure of how much processing can be done before communication is necessary, for a given task decomposition.

The maximum degree of concurrency is the maximum number of tasks that can be executed at a specific time. The average degree of concurrency is the average number of tasks that are executed at a specific time. Consider the following two task dependency graphs:

DependencyGraphs

Note that each task take one unit of time. The maximum degree of concurrency for both graphs is 4, which is realized in the first step. The average degree of concurrency for the left graph is \(\frac{4+2+1}{1+1+1}=\frac{7}{3}\), whereas for the right graph it is only \(\frac{4+1+1+1}{1+1+1+1}=\frac{7}{4}\).

Note that each task need not be the same size. Consider the following examples:

DependencyGraphsUnweighted

Here, the maximum degree of concurrency for both graphs is 38, but the average degree of concurrency for the left graph is \(\frac{38+12+9}{12+7+9}=2.11\) while for the right graph it is \(\frac{38+7+5+9}{12+7+5+9}=1.79\). Note that the denominator is the length of the critical path, which is defined as the longest path between any pair of start and finish nodes. The length of the critical path is the minimum amount of time to execute the entire program. The critical path length for the left graph is 28, and for the right graph it is 33. We can calculate the average degree of concurrency by dividing the total amount of work by the critical path length.

Question: If the granularity of a task decomposition is more fine-grain, more concurrency can be achieved. Is this always true? No! There are inherent limits to fine-grained decompositions like hitting indivible tasks, or having to do redundent or excessive computation.

Task Interactions

The task dependency graph captures producer-consumer interactions. There are also other types of interactions, like when tasks on different processors need to exchange data or synchronize. Consider the following decomposition of a matrix vector multiplication:

Matrix Multiplication Task Interaction

Although the vector \(b\) is has one element assigned to each task, every task needs the entirety of \(b\) to compute their result (in \(c\)). This is called a read-only interaction, where we can use a shared vector (on shared memory machines), or a broadcast (on distributed systems). There are also read-write interactions, which are more complicated since we need to ensure there are no race conditions. Therefore, read-write interactions should be kept on the same processor as much as possible.

Mapping

A mapping as an assignment of tasks to processors. This depends on the decomposition also. The goals of a good mapping are

Maximize the degree of concurrency
Minimize interaction among processors
Minimize the total completion time

Often, the task decomposition and mapping can result in conflicting goals, so we must find a good balance. The degree of concurrency is affected by decomposition choice, but the mapping affects how much of the concurrency can be utilized. To map effectively, we must consider the task size as well as the size of the data associated with each task. This will be discussed in more detail when we discuss different mapping techniques.

Decomposition Techniques

There are two commonly used decomposition techniques: recursive decomposition where we primarily decompose tasks based on the algorithm, and data decomposition where we primarily decompose tasks based on partitioning the data. Data decomposition is more common and typically simpler to understand.

Recursive Decomposition

Recursive decomposition divides problems into subproblems, and solves subproblems by subdividing recursively the same way and combining results. Subproblems are solved in parallel.

Mapping Techniques

Parallel Algorithm Models

Parallel Performance Model and Amdahl's Law

Common Patterns in Scientific Computing

Communication, Surface to Volume Ratio, and Ghost Zones

Approximating Particle Interactions

Data Compression

Often in scientific computing (like finite element), a lot of data is represented in matrix form. These matrices are often very sparse (typically 1-3 %). For example, we can represent the stencil for a 2D finite difference method as a matrix.

Compressed Sparse Row (CSR) Format

The compressed sparse row format is a common format for storing sparse matrices. It stores 3 different arrays: val stores the values of the non-zero elements, ind stores the column indices of the non-zero elements, and ptr stores pointer to the first non-zero element. CSR

For example, matrix vector multiplication can be done as follows:

Compression helps to reduce the memory and storage costs in matrix computations. However, optimizing sparse matrix operations is more difficult for the following reasons:

Load balance is a bigger issue, since the distribution of the non-zero elements are often not uniform and not known in advance.
Data is not well-aligned in memory, so we may need to add padding to enforce alignment.
The compute pattern is irregular, making parallel computation more difficult.

Part 4 and Onwards

See Part 4.

CSC367 Lecture Notes: Part 3Designing Parallel Algorithms