OpenMP, or OMP, is an API for parallelizing your C, C++, and Fortran code. It works via compiler pragmas, meaning that it will only work if you have a compiler that supports it. Thankfully, most modern compilers support OMP, including
Two of the most common scheduling types to use are static and dynamic. Static scheduling means that each thread gets assigned a chunk of work at the start of the parallel section. If a thread finishes early, then it will stop executing and sit idle while the other threads continue their work. This might sound extremely wasteful, but for workloads where the amount of work per unit is about the same, static may be the best solution, because it has much less overhead than dynamic scheduling. Dynamic scheduling means that threads will get assigned work from a work queue, and can get assigned more work if they finish early and there is more work to be done. This is great if you have a workload where each task may differ in the amount of time it takes to solve.
Let's create some benchmarks to see when one scheduling type will be better than the other. Below I have the base code that we will use. This is the single threaded implementation of a program that will find the primes from a range of numbers. The astute mathematicians reading this will note that this is a very naïve algorithm for finding prime numbers, and I should probably be using a sieve for better performance. This is not a post about algorithm optimization, however. I just need to simulate a workload that keeps my CPUs busy :)
1 #include <stdio.h> 2 #include <stdbool.h> 3 #include <math.h> 4 5 bool is_prime(unsigned long n) { 6 if (n < 2) { 7 return false; 8 } else { 9 unsigned long max = sqrt(n) + 1; 10 unsigned long i; 11 for (i = 2; i < max; i++) { 12 if (n % i == 0) { 13 return false; 14 } 15 } 16 return true; 17 } 18 } 19 20 int main(void) { 21 unsigned long start = 9223372036854775808UL; 22 unsigned long end = start + 1000; 23 unsigned long i; 24 25 for (i = start; i < end; i++) { 26 printf("testing: %lu\n", i); 27 if (is_prime(i)) { 28 printf("%lu is prime\n", i); 29 } 30 } 31 return 0; 32 }
$ clang -o single single.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm $ time ./single 9223372036854775837 is prime 9223372036854775907 is prime 9223372036854775931 is prime 9223372036854775939 is prime 9223372036854775963 is prime 9223372036854776063 is prime 9223372036854776077 is prime 9223372036854776167 is prime 9223372036854776243 is prime 9223372036854776257 is prime 9223372036854776261 is prime 9223372036854776293 is prime 9223372036854776299 is prime 9223372036854776351 is prime 9223372036854776393 is prime 9223372036854776407 is prime 9223372036854776561 is prime 9223372036854776657 is prime 9223372036854776687 is prime 9223372036854776693 is prime 9223372036854776711 is prime 9223372036854776803 is prime real 10m28.951s user 10m28.739s sys 0m0.020s
As you can see, I only had to add a few lines to make this program multithreaded. I simply added the pragma to tell openmp to parallelize this for loop, as well as a critical section to enforce that only a single thread can print at a time. This will keep our messages from colliding with each other.
1 #include <stdio.h> 2 #include <stdbool.h> 3 #include <math.h> 4 5 bool is_prime(unsigned long n) { 6 if (n < 2) { 7 return false; 8 } else { 9 unsigned long max = sqrt(n) + 1; 10 unsigned long i; 11 for (i = 2; i < max; i++) { 12 if (n % i == 0) { 13 return false; 14 } 15 } 16 return true; 17 } 18 } 19 20 int main(void) { 21 unsigned long start = 9223372036854775808UL; 22 unsigned long end = start + 1000; 23 unsigned long i; 24 25 #pragma omp parallel for schedule(static) 26 for (i = start; i < end; i++) { 27 if (is_prime(i)) { 28 #pragma omp critical 29 { 30 printf("%lu is prime\n", i); 31 } 32 } 33 } 34 return 0; 35 }
$ clang -o static static.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm 9223372036854776351 is prime 9223372036854775837 is prime 9223372036854775939 is prime 9223372036854776063 is prime 9223372036854776167 is prime 9223372036854776657 is prime 9223372036854776561 is prime 9223372036854776393 is prime 9223372036854776687 is prime 9223372036854776257 is prime 9223372036854776803 is prime 9223372036854775907 is prime 9223372036854776243 is prime 9223372036854775963 is prime 9223372036854776077 is prime 9223372036854776407 is prime 9223372036854776693 is prime 9223372036854776261 is prime 9223372036854775931 is prime 9223372036854776711 is prime 9223372036854776293 is prime 9223372036854776299 is prime real 2m14.245s user 15m17.936s sys 0m0.212s
2m14.245s - That's a quite a bit bit faster... but let's think about how
the complexity of our prime algorithm works. If we start with the number
96, then when we apply our is_prime
function, the first thing we divide by is 2... which has a
remainder of 0, and then we're done.
Next number, 97
Let's try 2. Is the remainder 0? no
How about 3? no
Maybe 4? no
5? no
6? no
7? no
8? no
9? no
10? no
And now we have tested numbers up to the sqrt(97), and we
can now conclude that 97 is prime!
97 was A LOT more work to test than 96 even though they were right next to each other!
This algorithm has wildly different runtimes for different inputs, so let's get smarter with how we multithread it. Let's try to keep those worker threads as busy as possible!
1 #include <stdio.h> 2 #include <stdbool.h> 3 #include <math.h> 4 5 bool is_prime(unsigned long n) { 6 if (n < 2) { 7 return false; 8 } else { 9 unsigned long max = sqrt(n) + 1; 10 unsigned long i; 11 for (i = 2; i < max; i++) { 12 if (n % i == 0) { 13 return false; 14 } 15 } 16 return true; 17 } 18 } 19 20 int main(void) { 21 unsigned long start = 9223372036854775808UL; 22 unsigned long end = start + 1000; 23 unsigned long i; 24 25 #pragma omp parallel for schedule(dynamic) 26 for (i = start; i < end; i++) { 27 if (is_prime(i)) { 28 #pragma omp critical 29 { 30 printf("%lu is prime\n", i); 31 } 32 } 33 } 34 return 0; 35 }
clang -o dynamic dynamic.c -fopenmp -Wall -Wextra -Wpedantic -O2 -lm time ./dynamic 9223372036854775939 is prime 9223372036854775931 is prime 9223372036854776167 is prime 9223372036854776063 is prime 9223372036854776257 is prime 9223372036854776077 is prime 9223372036854775963 is prime 9223372036854775907 is prime 9223372036854776243 is prime 9223372036854775837 is prime 9223372036854776261 is prime 9223372036854776293 is prime 9223372036854776299 is prime 9223372036854776407 is prime 9223372036854776393 is prime 9223372036854776711 is prime 9223372036854776351 is prime 9223372036854776687 is prime 9223372036854776693 is prime 9223372036854776657 is prime 9223372036854776803 is prime 9223372036854776561 is prime real 1m24.828s user 18m10.744s sys 0m0.436s
1m24.828s - That's a pretty good speedup, considering all we had to do was understand what kind of work our algorithm was doing, and tweak how we use OMP with a single line change!
* All benchmarks were run on a an AMD Ryzen 7 2700X Eight-Core Processor, running Ubuntu 18.04
This blog is about whatever I find interesting at this very moment. I'll post how-to articles on occasion and various lessons that I've learned in my career.
I am a software developer from Des Moines Iowa. I graduated from the University of Iowa with a B.S. in Computer Science.
I am interested in math, compilers, and interpreters, and I always strive to understand how various systems work from the ground up.
The languages I use most often include
I like working in a UNIX or Linux environment.