Multithreading With Omp Tools

Table of Contents

Introduction

In the last post, we discussed about parallel construct and left at Execution Model Events. I read it and found that it consists a lot thread dispatching and hence in this blog, I’ll discuss about multithreading and how a small openmp example can be solved using it.

Also, I have a few questions in my mind, that I’ll post at the end.

Multithreading

We all know why and when to use threading, it is provided under <thread> header file for c++ and can be called following the syntax:

std::thread thread_object(callable);

std::thread is the thread class that represents a single thread in C++. To start a thread we simply need to create a new thread object and pass the executing code to be called (i.e, a callable object) into the constructor of the object. Once the object is created a new thread is launched which will execute the code specified in callable. A callable can be any of the five:

  • A Function Pointer
  • A Lambda Expression
  • A Function Object
  • Non-Static Member Function
  • Static Member Function

Launching Thread Using Function Pointer

A function pointer can be a callable object to pass to the std::thread constructor for initializing a thread. The following code snippet demonstrates how it is done.

void foo(param)
{ 
  Statements; 
}
// The parameters to the function are put after the comma
std::thread thread_obj(foo, params);

Examples

Example 1

For our understanding, I’ll use the following toy example:

openmp

#include <stdio.h>
#include <omp.h>

int main() {
    int thread_id;

    printf("omp_get_max_threads(): %d\n", omp_get_max_threads());
    #pragma omp parallel private(thread_id)
    {
        thread_id = omp_get_thread_num();
        printf("Hello from thread: %d\n", thread_id);
    }
    return 0;
}

One can execute this using:

% /opt/homebrew/bin/g++-13 -fopenmp a.cpp && ./a.out
omp_get_max_threads(): 8
Hello from thread: 1
Hello from thread: 5
Hello from thread: 6
Hello from thread: 3
Hello from thread: 2
Hello from thread: 4
Hello from thread: 0
Hello from thread: 7

std::thread

This can be very easily translated to:

#include <iostream>
#include <thread>
#include <mutex>
#include <omp.h>
using namespace std;

std::mutex mtx;

void helloFromThread( int thread_id ) {
	mtx.lock();
	std::cout << "Hello from thread: " << thread_id << std::endl;
	mtx.unlock();
}

int main() {
	const int num_threads = omp_get_max_threads();

	cout << "omp_get_max_threads(): "<<omp_get_max_threads()<<"\n";

	std::thread threads[num_threads];

	for (int i = 0; i < num_threads; ++i) {
        	threads[i] = std::thread(helloFromThread, i);
    	}

    	for (int i = 0; i < num_threads; ++i) {
        	threads[i].join();
    	}

	return 0;
}

To execute this you will have to use -pthread flag and command will be:

% /opt/homebrew/bin/g++-13 a-openmp.cpp -fopenmp -pthread && ./a.out
omp_get_max_threads(): 8
Hello from thread: 0
Hello from thread: 1
Hello from thread: 3
Hello from thread: 4
Hello from thread: 5
Hello from thread: 6
Hello from thread: 7
Hello from thread: 2

To ensure that the output from different threads doesn’t get mixed up, we use mutex. mutex synchronize access to the output stream.

Posix threads

One may definetly think, we already achieved multithreading using std::thread then what is the need to read about Posix threads. So, as per what stackoverflow suggests:

If you want to run code on many platforms, go for Posix Threads. They are available almost everywhere and are quite mature. On the other hand if you only use Linux/gcc std::thread is perfectly fine - it has a higher abstraction level, a really good interface and plays nicely with other C++11 classes.

The C++11 std::thread class unfortunately doesn’t work reliably (yet) on every platform, even if C++11 seems available. For instance in native Android std::thread or Win64 it just does not work or has severe performance bottlenecks (as of 2012).

A good replacement is boost::thread - it is very similar to std::thread (actually it is from the same author) and works reliably, but, of course, it introduces another dependency from a third party library.

Now, having known requirement of posix threads, this is how the modified code looks like:

#include <iostream>
#include <pthread.h>
#include <omp.h>

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

void* helloFromThread(void* arg) {
    int thread_id = *((int*)arg);
    pthread_mutex_lock(&mtx);
    std::cout << "Hello from thread: " << thread_id << std::endl;
    pthread_mutex_unlock(&mtx);
    return NULL;
}

int main() {
    const int num_threads = omp_get_max_threads();
    pthread_t threads[num_threads];
    int thread_ids[num_threads];

    std::cout << "omp_get_max_threads(): " << omp_get_max_threads() << "\n";

    for (int i = 0; i < num_threads; ++i) {
        thread_ids[i] = i;
        pthread_create(&threads[i], NULL, helloFromThread, (void*)&thread_ids[i]);
    }

    for (int i = 0; i < num_threads; ++i) {
        pthread_join(threads[i], NULL);
    }

    return 0;
}

You have to now just include <pthread.h> and use PTHREAD_MUTEX_INITIALIZER instead and apply changes at consequent places.

% time /opt/homebrew/bin/g++-13 a-pthread-openmp.cpp -fopenmp -pthread && ./a.out
/opt/homebrew/bin/g++-13 a-pthread-openmp.cpp -fopenmp -pthread  0.17s user 0.16s system 78% cpu 0.432 total
omp_get_max_threads(): 8
Hello from thread: 0
Hello from thread: 3
Hello from thread: 2
Hello from thread: 4
Hello from thread: 1
Hello from thread: 5
Hello from thread: 6
Hello from thread: 7

Example 2

Increasing the complexity of example, here is the following cpp code that uses #pragma omp parallel for and assigns value to an array of length 1000000.

openmp

#include <stdio.h>
#include <omp.h>

void initialize_array( int n, float *a, float val ) {
	int i;
	#pragma omp parallel
	{
		#pragma omp for
		for (i = 0; i < n; i++) {
			a[i] = val;
		}
	}
}

int main() {
	float a[1000000];
	initialize_array(1000000, a, 12.91);
}
% time /opt/homebrew/bin/g++-13 normal.cpp -fopenmp && ./a.out
/opt/homebrew/bin/g++-13 normal.cpp -fopenmp  0.07s user 0.16s system 67% cpu 0.343 total

std::thread

Translate version of the example is shown below:

#include <iostream>
#include <omp.h>
#include <thread>
#include <mutex>

std::mutex mtx; // Declare a mutex

void initialize_array(int n, float *a, float val, int start, int end) {
    for (int i = start; i < end; ++i) {
        // Lock the mutex before accessing the shared array
        mtx.lock();
        a[i] = val;
        // Unlock the mutex after modifying the array
        mtx.unlock();
    }
}

int main() {
	const int array_size = 1000000;
	int num_threads = omp_get_max_threads();

	float a[array_size];

	int chunk_size = array_size / num_threads;

	std::thread threads[num_threads];
    	for (int i = 0; i < num_threads; ++i) {
        	int start = i * chunk_size;
        	int end = (i == num_threads - 1) ? array_size : (i + 1) * chunk_size;
        	threads[i] = std::thread(initialize_array, array_size, a, 12.91f, start, end);
    	}

    	// Join threads
    	for (int i = 0; i < num_threads; ++i) {
        	threads[i].join();
    	}

	return 0;
}
% time /opt/homebrew/bin/g++-13 b.cpp -fopenmp -pthread && ./a.out
/opt/homebrew/bin/g++-13 b.cpp -fopenmp -pthread  0.24s user 0.05s system 107% cpu 0.271 total

Posix threads

Using posix thread, it looks like:

#include <iostream>
#include <omp.h>
#include <pthread.h>

pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

struct ThreadArgs {
    float *array;
    float value;
    int start;
    int end;
};


void* initialize_array(void* arg) {
    ThreadArgs* args = static_cast<ThreadArgs*>(arg);

    for (int i = args->start; i < args->end; ++i) {
        pthread_mutex_lock(&mtx);
        args->array[i] = args->value;
        pthread_mutex_unlock(&mtx);
    }

    delete args;
    return nullptr;
}

int main() {
    int num_threads = omp_get_max_threads();

    float a[1000000];

    int chunk_size = 1000000 / num_threads;

    pthread_t threads[num_threads];
    for (int i = 0; i < num_threads; ++i) {
        ThreadArgs* args = new ThreadArgs();
        args->array = a;
        args->value = 12.91f;
        args->start = i * chunk_size;
        args->end = (i == num_threads - 1) ? 1000000 : (i + 1) * chunk_size;
        pthread_create(&threads[i], nullptr, initialize_array, static_cast<void*>(args));
    }

    for (int i = 0; i < num_threads; ++i) {
        pthread_join(threads[i], nullptr);
    }

    return 0;
}
% time /opt/homebrew/bin/g++-13 b-pthread-openmp.cpp -fopenmp -pthread && ./a.out
/opt/homebrew/bin/g++-13 b-pthread-openmp.cpp -fopenmp -pthread  0.17s user 0.05s system 106% cpu 0.207 total

Question

  • I see no builtin way to execute a parallel construct in runtime library routines provided by omp. gnu and llvm guys have developed their own wrappers over these runtime library, so I think we’ll have to do that, and if yes, was the above example a correct direction to proceed?

References