OpenMP

Cheat sheet: http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf

To compile:

gcc -fopenmp -o code code.c

OpenMP is for SPMD model of parallel processing. The way to specify parallelism is #pragma omp but it also provides several C functions to specialize each thread

// Return the thread id, start from 0 for the master thread to N-1
int omp_get_thread_num();

// Return the total number of threads in the current team
int omp_get_num_threads()

The part of code that multiple threads shall spawn is done by

#pragma omp parallel num_threads(N)
{ /* block */ }

If there is nothing inside the block, N threads would be spawned, each to run the block for once. The threads are joined again at the end of the code block. If no num_threads provided, it is determined automatically. The variables can be defined as private (thread has its own copy) or shared (all thread share the same memory spot) by constructs

#pragma omp parallel shared(x,y) private(i,j,k)

There are several ways to specify parallelism. One is to break a for-loop into multiple threads. This is useful if each loop is data-independent. Syntax is

#pragma omp for
for (i=lowerbound; i op upperbound; inc_expr)
{ /* block */ }

The for-loop must with an integer loop counter i, and the lowerbound/upperbound must be an explicit integer. The increment expression inc_expr can be any of the following: ++i, i++, --i, i--, i += n, i -= n, i = i+n, i=n+i, i = i-n. If in any for-loop, a portion of the code must not be executed by more than one thread (e.g. initialization), then we can surround it by a single construct:

#pragma omp single
{ /* block */ }

But for non-loop type parallelism, we need to use sections:

#pragma omp sections
{
  #pragma omp section
  { /* block */ }
  #pragma omp section
  { /* block */ }
}

The sections construct contains a set of structured blocks each identified by a section construct. Each section would be executed in one thread in parallel.

Note that, if a parallel section contains only one for construct or one sections construct, it may be done in a shortcut:

#pragma omp paralel for
for (;;) { /* block */ }

#pragma omp parallel sections
{
  #pragma omp section
  { /* block */ }
  #pragma omp section
  { /* block */ }
}

Critical sections

A critical section can be done by

#pragma omp critical (critical_section_name)

But if the critical section is only to ensure atomic read/write of a variable, it can be done instead by

#pragma omp atomic [read|write|update|capture]
x = expr;

#pragma omp atomic capture
{ /* copy and update */ }

For atomic read, it must be in the form of

v = x;

For atomic write, it must be in the form of

x = expr;

For atomic update, it can be any of the following form:

x++; ++x;
x--; --x;
x op= expr;
x = x op expr;

For atomic capture, it can be any of the following form:

v = x++;
v = x--;
v = ++x;
v = --x;
v = (x op= expr);
{ v = x; x op= expr; }
{ x op= expr; v = x; }
{ v = x; x = x op expr; };
{ x = x op expr; v = x; };
{ v = x; x++; }
{ v = x; ++x; }
{ v = x; x--; }
{ v = x; --x; }
{ x++; v = x; }
{ ++x; v = x; }
{ x--; v = x; }
{ --x; v = x; }

Other tricks

Sometimes, different thread may have inconsistent view due to cache. To flush the cache at certain point,

#pragma omp flush

If a variable in the parallel is manipulated but eventually reduced to a single value on the completion of all threads, it is more efficient to do

#pragma omp parallel reduction(op: var1, var2)

where the list of variables are supposed to be reduced by the operator op (which can be +, *, &, |, ^, &&, ||, max, or min). Then the variable would be make independent copies with operator-neutral values initialized to each thread, and upon their completion, the variables would be combined using the specified operation.