OpenMP, or OMP, is an API for parallelizing your C, C++, and Fortran code. It works via compiler pragmas, meaning that it will only work if you have a compiler that supports it. Thankfully, most modern compilers support OMP, including

- GCC
- Clang
- MSVC
- ICC (Intel C Compiler)

Two of the most common scheduling types to use are **static**
and **dynamic**. Static scheduling means that each thread gets
assigned a chunk of work at the start of the parallel section. If a thread
finishes early, then it will stop executing and sit idle while the other
threads continue their work. This might sound extremely wasteful, but for
workloads where the amount of work per unit is about the same, static may
be the best solution, because it has much less overhead than dynamic
scheduling. Dynamic scheduling means that threads will get assigned work
from a work queue, and can get assigned more work if they finish early and
there is more work to be done. This is great if you have a workload where
each task may differ in the amount of time it takes to solve.

Let's create some benchmarks to see when one scheduling type will be better than the other. Below I have the base code that we will use. This is the single threaded implementation of a program that will find the primes from a range of numbers. The astute mathematicians reading this will note that this is a very naïve algorithm for finding prime numbers, and I should probably be using a sieve for better performance. This is not a post about algorithm optimization, however. I just need to simulate a workload that keeps my CPUs busy :)

1 #include <stdio.h> 2 #include <stdbool.h> 3 #include <math.h> 4 5 bool is_prime(unsigned long n) { 6 if (n < 2) { 7 return false; 8 } else { 9 unsigned long max = sqrt(n) + 1; 10 unsigned long i; 11 for (i = 2; i < max; i++) { 12 if (n % i == 0) { 13 return false; 14 } 15 } 16 return true; 17 } 18 } 19 20 int main(void) { 21 unsigned long start = 9223372036854775808UL; 22 unsigned long end = start + 1000; 23 unsigned long i; 24 25 for (i = start; i < end; i++) { 26 printf("testing: %lu\n", i); 27 if (is_prime(i)) { 28 printf("%lu is prime\n", i); 29 } 30 } 31 return 0; 32 }

$ clang -o single single.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm $ time ./single 9223372036854775837 is prime 9223372036854775907 is prime 9223372036854775931 is prime 9223372036854775939 is prime 9223372036854775963 is prime 9223372036854776063 is prime 9223372036854776077 is prime 9223372036854776167 is prime 9223372036854776243 is prime 9223372036854776257 is prime 9223372036854776261 is prime 9223372036854776293 is prime 9223372036854776299 is prime 9223372036854776351 is prime 9223372036854776393 is prime 9223372036854776407 is prime 9223372036854776561 is prime 9223372036854776657 is prime 9223372036854776687 is prime 9223372036854776693 is prime 9223372036854776711 is prime 9223372036854776803 is prime real 10m28.951s user 10m28.739s sys 0m0.020s

As you can see, I only had to add a few lines to make this program
multithreaded. I simply added the **pragma** to tell openmp
to parallelize this for loop, as well as a **critical**
section to enforce that only a single thread can print at a time. This will
keep our messages from colliding with each other.

1 #include <stdio.h> 2 #include <stdbool.h> 3 #include <math.h> 4 5 bool is_prime(unsigned long n) { 6 if (n < 2) { 7 return false; 8 } else { 9 unsigned long max = sqrt(n) + 1; 10 unsigned long i; 11 for (i = 2; i < max; i++) { 12 if (n % i == 0) { 13 return false; 14 } 15 } 16 return true; 17 } 18 } 19 20 int main(void) { 21 unsigned long start = 9223372036854775808UL; 22 unsigned long end = start + 1000; 23 unsigned long i; 24 25 #pragma omp parallel for schedule(static) 26 for (i = start; i < end; i++) { 27 if (is_prime(i)) { 28 #pragma omp critical 29 { 30 printf("%lu is prime\n", i); 31 } 32 } 33 } 34 return 0; 35 }

$ clang -o static static.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm 9223372036854776351 is prime 9223372036854775837 is prime 9223372036854775939 is prime 9223372036854776063 is prime 9223372036854776167 is prime 9223372036854776657 is prime 9223372036854776561 is prime 9223372036854776393 is prime 9223372036854776687 is prime 9223372036854776257 is prime 9223372036854776803 is prime 9223372036854775907 is prime 9223372036854776243 is prime 9223372036854775963 is prime 9223372036854776077 is prime 9223372036854776407 is prime 9223372036854776693 is prime 9223372036854776261 is prime 9223372036854775931 is prime 9223372036854776711 is prime 9223372036854776293 is prime 9223372036854776299 is prime real 2m14.245s user 15m17.936s sys 0m0.212s

2m14.245s - That's a quite a bit bit faster... but let's think about how
the complexity of our prime algorithm works. If we start with the number
**96**, then when we apply our **is_prime**
function, the first thing we divide by is **2**... which has a
remainder of **0**, and then we're done.

Next number, **97**

Let's try **2**. Is the remainder **0**? no

How about **3**? no

Maybe **4**? no

**5**? no

**6**? no

**7**? no

**8**? no

**9**? no

**10**? no

And now we have tested numbers up to the **sqrt(97)**, and we
can now conclude that **97** is prime!

**97** was **A LOT** more work to test than
**96** even though they were right next to each other!

This algorithm has wildly different runtimes for different inputs, so let's get smarter with how we multithread it. Let's try to keep those worker threads as busy as possible!

1 #include <stdio.h> 2 #include <stdbool.h> 3 #include <math.h> 4 5 bool is_prime(unsigned long n) { 6 if (n < 2) { 7 return false; 8 } else { 9 unsigned long max = sqrt(n) + 1; 10 unsigned long i; 11 for (i = 2; i < max; i++) { 12 if (n % i == 0) { 13 return false; 14 } 15 } 16 return true; 17 } 18 } 19 20 int main(void) { 21 unsigned long start = 9223372036854775808UL; 22 unsigned long end = start + 1000; 23 unsigned long i; 24 25 #pragma omp parallel for schedule(dynamic) 26 for (i = start; i < end; i++) { 27 if (is_prime(i)) { 28 #pragma omp critical 29 { 30 printf("%lu is prime\n", i); 31 } 32 } 33 } 34 return 0; 35 }

clang -o dynamic dynamic.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm time ./dynamic 9223372036854775939 is prime 9223372036854775931 is prime 9223372036854776167 is prime 9223372036854776063 is prime 9223372036854776257 is prime 9223372036854776077 is prime 9223372036854775963 is prime 9223372036854775907 is prime 9223372036854776243 is prime 9223372036854775837 is prime 9223372036854776261 is prime 9223372036854776293 is prime 9223372036854776299 is prime 9223372036854776407 is prime 9223372036854776393 is prime 9223372036854776711 is prime 9223372036854776351 is prime 9223372036854776687 is prime 9223372036854776693 is prime 9223372036854776657 is prime 9223372036854776803 is prime 9223372036854776561 is prime real 1m24.828s user 18m10.744s sys 0m0.436s

1m24.828s - That's a pretty good speedup, considering all we had to do was understand what kind of work our algorithm was doing, and tweak how we use OMP with a single line change!

* All benchmarks were run on a an AMD Ryzen 7 2700X Eight-Core Processor, running Ubuntu 18.04

This blog is about whatever I find interesting at this very moment. I'll post how-to articles on occasion and various lessons that I've learned in my career.

I am a software developer from Des Moines Iowa. I graduated from the University of Iowa with a B.S. in Computer Science.

I am interested in math, compilers, and interpreters, and I always strive to understand how various systems work from the ground up.

The languages I use most often include

- C++
- Python
- C
- Rust

I like working in a UNIX or Linux environment.