Kyle Kloberdanz's Blog

OpenMP - Static or Dynamic Scheduling?

10 May 2020

OpenMP, or OMP, is an API for parallelizing your C, C++, and Fortran code. It works via compiler pragmas, meaning that it will only work if you have a compiler that supports it. Thankfully, most modern compilers support OMP, including

GCC
Clang
MSVC
ICC (Intel C Compiler)

OpenMP offers many tunables, which can greatly change the performance characteristics. One of which, is the scheduler .

Two of the most common scheduling types to use are static and dynamic. Static scheduling means that each thread gets assigned a chunk of work at the start of the parallel section. If a thread finishes early, then it will stop executing and sit idle while the other threads continue their work. This might sound extremely wasteful, but for workloads where the amount of work per unit is about the same, static may be the best solution, because it has much less overhead than dynamic scheduling. Dynamic scheduling means that threads will get assigned work from a work queue, and can get assigned more work if they finish early and there is more work to be done. This is great if you have a workload where each task may differ in the amount of time it takes to solve.

Let's create some benchmarks to see when one scheduling type will be better than the other. Below I have the base code that we will use. This is the single threaded implementation of a program that will find the primes from a range of numbers. The astute mathematicians reading this will note that this is a very naïve algorithm for finding prime numbers, and I should probably be using a sieve for better performance. This is not a post about algorithm optimization, however. I just need to simulate a workload that keeps my CPUs busy :)

      1 #include <stdio.h>
      2 #include <stdbool.h>
      3 #include <math.h>
      4 
      5 bool is_prime(unsigned long n) {
      6     if (n < 2) {
      7         return false;
      8     } else {
      9         unsigned long max = sqrt(n) + 1;
     10         unsigned long i;
     11         for (i = 2; i < max; i++) {
     12             if (n % i == 0) {
     13                 return false;
     14             }
     15         }
     16         return true;
     17     }
     18 }
     19 
     20 int main(void) {
     21     unsigned long start = 9223372036854775808UL;
     22     unsigned long end = start + 1000;
     23     unsigned long i;
     24 
     25     for (i = start; i < end; i++) {
     26         printf("testing: %lu\n", i);
     27         if (is_prime(i)) {
     28             printf("%lu is prime\n", i);
     29         }
     30     }
     31     return 0;
     32 }

    $ clang -o single single.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm
    $ time ./single
    9223372036854775837 is prime
    9223372036854775907 is prime
    9223372036854775931 is prime
    9223372036854775939 is prime
    9223372036854775963 is prime
    9223372036854776063 is prime
    9223372036854776077 is prime
    9223372036854776167 is prime
    9223372036854776243 is prime
    9223372036854776257 is prime
    9223372036854776261 is prime
    9223372036854776293 is prime
    9223372036854776299 is prime
    9223372036854776351 is prime
    9223372036854776393 is prime
    9223372036854776407 is prime
    9223372036854776561 is prime
    9223372036854776657 is prime
    9223372036854776687 is prime
    9223372036854776693 is prime
    9223372036854776711 is prime
    9223372036854776803 is prime
    
    real    10m28.951s
    user    10m28.739s
    sys     0m0.020s

10m28.951s - Now that we have a baseline, let's start multithreading with OpenMP!

As you can see, I only had to add a few lines to make this program multithreaded. I simply added the pragma to tell openmp to parallelize this for loop, as well as a critical section to enforce that only a single thread can print at a time. This will keep our messages from colliding with each other.

      1 #include <stdio.h>
      2 #include <stdbool.h>
      3 #include <math.h>
      4 
      5 bool is_prime(unsigned long n) {
      6     if (n < 2) {
      7         return false;
      8     } else {
      9         unsigned long max = sqrt(n) + 1;
     10         unsigned long i;
     11         for (i = 2; i < max; i++) {
     12             if (n % i == 0) {
     13                 return false;
     14             }
     15         }
     16         return true;
     17     }
     18 }
     19 
     20 int main(void) {
     21     unsigned long start = 9223372036854775808UL;
     22     unsigned long end = start + 1000;
     23     unsigned long i;
     24 
     25     #pragma omp parallel for schedule(static)
     26     for (i = start; i < end; i++) {
     27         if (is_prime(i)) {
     28             #pragma omp critical
     29             {
     30                 printf("%lu is prime\n", i);
     31             }
     32         }
     33     }
     34     return 0;
     35 }

    $ clang -o static static.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm
    9223372036854776351 is prime
    9223372036854775837 is prime
    9223372036854775939 is prime
    9223372036854776063 is prime
    9223372036854776167 is prime
    9223372036854776657 is prime
    9223372036854776561 is prime
    9223372036854776393 is prime
    9223372036854776687 is prime
    9223372036854776257 is prime
    9223372036854776803 is prime
    9223372036854775907 is prime
    9223372036854776243 is prime
    9223372036854775963 is prime
    9223372036854776077 is prime
    9223372036854776407 is prime
    9223372036854776693 is prime
    9223372036854776261 is prime
    9223372036854775931 is prime
    9223372036854776711 is prime
    9223372036854776293 is prime
    9223372036854776299 is prime
    
    real    2m14.245s
    user    15m17.936s
    sys     0m0.212s

2m14.245s - That's a quite a bit bit faster... but let's think about how the complexity of our prime algorithm works. If we start with the number 96, then when we apply our is_prime function, the first thing we divide by is 2... which has a remainder of 0, and then we're done.

Next number, 97
Let's try 2. Is the remainder 0? no
How about 3? no
Maybe 4? no
5? no
6? no
7? no
8? no
9? no
10? no
And now we have tested numbers up to the sqrt(97), and we can now conclude that 97 is prime!

97 was A LOT more work to test than 96 even though they were right next to each other!

This algorithm has wildly different runtimes for different inputs, so let's get smarter with how we multithread it. Let's try to keep those worker threads as busy as possible!

In OpenMP, changing the scheduling type is as simple as replacing static with dynamic

      1 #include <stdio.h>
      2 #include <stdbool.h>
      3 #include <math.h>
      4 
      5 bool is_prime(unsigned long n) {
      6     if (n < 2) {
      7         return false;
      8     } else {
      9         unsigned long max = sqrt(n) + 1;
     10         unsigned long i;
     11         for (i = 2; i < max; i++) {
     12             if (n % i == 0) {
     13                 return false;
     14             }
     15         }
     16         return true;
     17     }
     18 }
     19 
     20 int main(void) {
     21     unsigned long start = 9223372036854775808UL;
     22     unsigned long end = start + 1000;
     23     unsigned long i;
     24 
     25     #pragma omp parallel for schedule(dynamic)
     26     for (i = start; i < end; i++) {
     27         if (is_prime(i)) {
     28             #pragma omp critical
     29             {
     30                 printf("%lu is prime\n", i);
     31             }
     32         }
     33     }
     34     return 0;
     35 }

    clang -o dynamic dynamic.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm
    time ./dynamic
    9223372036854775939 is prime
    9223372036854775931 is prime
    9223372036854776167 is prime
    9223372036854776063 is prime
    9223372036854776257 is prime
    9223372036854776077 is prime
    9223372036854775963 is prime
    9223372036854775907 is prime
    9223372036854776243 is prime
    9223372036854775837 is prime
    9223372036854776261 is prime
    9223372036854776293 is prime
    9223372036854776299 is prime
    9223372036854776407 is prime
    9223372036854776393 is prime
    9223372036854776711 is prime
    9223372036854776351 is prime
    9223372036854776687 is prime
    9223372036854776693 is prime
    9223372036854776657 is prime
    9223372036854776803 is prime
    9223372036854776561 is prime
    
    real    1m24.828s
    user    18m10.744s
    sys     0m0.436s

1m24.828s - That's a pretty good speedup, considering all we had to do was understand what kind of work our algorithm was doing, and tweak how we use OMP with a single line change!

* All benchmarks were run on a an AMD Ryzen 7 2700X Eight-Core Processor, running Ubuntu 18.04

About

bicycle, and the Des Moines Capitol Building

This blog is about whatever I find interesting at this very moment. I'll post how-to articles on occasion and various lessons that I've learned in my career.

Me

I am a software developer from Des Moines Iowa. I graduated from the University of Iowa with a B.S. in Computer Science.

I am interested in math, compilers, and interpreters, and I always strive to understand how various systems work from the ground up.

The languages I use most often include

C
Rust
Python
C++
Go

I like working in a UNIX or Linux environment.

Posts

Follow Me

Scientific Research Publications

[ICSE 2022 Technical Track] DeepStability: A Study of Unstable Numerical Methods and Their Solutions in Deep Learning