Skip to content

Parallel Programming

Introduction

Single-processor performance has slowed down compared to 20 years ago. You cannot add more and more transistors onto an integradted circuit. You will come to a point, where that huge number of transitors will generate so much heat, that the processor will actually decrease in performance. Manufacturers decided that, instead of developing faster monolithic processors, they will put multiple processors on a single integrated circuit. They are called multicore processors.

Consquene: Simply adding more processors will not improve performance of a serial program. They cannot make use of the multiple cores.

Programmers need to learn about parallel programming, because the best parallelization may be obtained by stepping back and devising a new algorithm.

Parallel programming follows the basic idea of partitioning. There are two widely used approaches:

  • task-parallelism: partition the tasks used to solve the problem among the cores (not the same instructions)
  • data-parallelism: partition the data used to solve the problem among the cores (with the same instructions)

Points to consider:

  • communication
  • load balancing
  • synchronization

There are two main types of parallel systems:

  • shared-memory: each core can read and write each memory location
  • distributed-memory: each core has its own private memory and cores must communicate explicitly, e.g. via messages

Concurrent, parallel, distributed

  • concurrent: multiple tasks can be in progress at any time
  • parallel: multiple tasks cooperate closely to solve a problem
  • distributed: a program may need to cooperate with other programs (focused on using resources)

Parallel programming is a subset of concurrent programming.

Every parallel program contains at leat one serial program.

Parallel hardware and software

The classical von Neumann architecture consits of

  • main memory which can store both data and instructions
  • central processing unit (CPU) divided into control unit and arithmetic logic unit (ALU)
  • interconnection between memory and CPU

Data in the CPU and information about the current state of a program are stored in fast storage called registers.

The separation between memory and CPU is often called the von Neumann bottleneck, because executing instructions on the CPU is way faster than fetching the data from memory.

Modifications to the von Neumann bottleneck:

Caching

Is a collection of memory locations that can be accessed in less time than other locations. How can we decide which data and instructions should be stored in the cache?

Programs tend to use data and instructions that are physically close to recently used data and instructions (example would be a loop over an array). This is called locality. Therefore, memory access will operate on blocks of data and instructions. These blocks are called cache lines. Cache is often divided into levels (L1, L2, L3, ...). Each data and instruction that lies in level i, will also be in cache i+1. CPU will first check in L1, then in L2. If the desired information was found, that will be called a cache hit, otherwise it's a cache miss.

The CPU cache is controlled by the systems hardware, but programmers still have a indirect control over the cached data, due to the knowledge of locality. Memory is a huge one-dimensional array. Therefore it really matters, how a programmer writes code.

for (j = 0; j < MAX; j++) {
  for (i = 0; i < MAX; i++) {
    y[i] += A[i][j] * X[j];
  }
}

for (i = 0; i < MAX; i++) {
  for (j = 0; j < MAX; j++) {
    y[i] += A[i][j] * X[j];
  }
}

The second version is way faster, due to how the array is layed out in memory.

Instruction-level parallelism

Processor can execute instructions simultaneously. There are two approaches: pipelining and multiple issue.

SIMD systems

A classical von Neumann is a single instruction stream, single data stream (SISD), since it executes a single instruction at a time and it can fetch or store one item of data a time.

Single instruction, multiple data (SIMD) systems are parallel systems, which operate on multiple data streams by applying the same instruction to multiple data items (single control unit, multiple ALUs). These systems are good on data-parallelism, not so good on other.

MIMD systems

Multiple instruction, multiple data (MIMD) systems support multiple simultaneous instruction streams operating on multiple data streams (independant number of processing units, each with own control unit and ALU). These systems are asynchronous.

Shared-memory systems

Most used shared-memory systems use one ore more multicore processors (multiple CPUs on a single chip). The cores have a private L1 cache and the other caches may or may not be shared.

Uniform memory access (UMA): all processors are directly connected to main memory. Time to access all memory locations is the same for all cores.

Nonuniform memory access (NUMA): all processors have a direct connection to one block in memory and other processors can access this block through special hardware. Time for accessing the blocks is different. Direct block can be accessed faster.

UMA systems are easier to program, since the programmer does not have to worry about different access times.

Cache coherence (Reference)

Multiprocessor systems introduce the cache coherence problem. When multiple processors maintain locally cached copies of a unique-shared memory location, any local modifications of the location can result in inconsistent memory.

Cache coherence prevents this problem by maintaining a uniform sate for each chached block of data.

False sharing

CPU chaches operate on cache lines. When multipes processors work on the same cache line, they invalidate each others cache line and it needs to be updated. This can be prevented by using temporary storage that is local to each process and then copying the local storage to shared storage.

Amdahl's law

Performance of parallel programs is limited by the non-parallel part of the program.

If a fraction r of the serial program remains unparallelized, then Amdahl's law says we cannot get a speedup better than 1/r.

This law does not take the problem size into consideration. If we increase the problem size, the serial fraction of the program decreases in size (this is known as Gustafson's law).

OpenMP

OpenMP (short for open multiprocessing) is an API for shared-memory parallel programming. Desinged for systems where each thread has access to all available memory. In order to use OpenMP, you have to use a C compiler that supports OpenMP. The compiler and runtime system will take over some tasks, so it can be easier to program parallel code.

OpenMP allows for incremental parallelization. It provides a directive-based shared-memory API.

References


Last update: November 23, 2020