LFortran Openmp
Table of Contents
Introduction
The next step in development of LFortran is to add support for openmp, we have to support it in AST and ASR.
Approach
Let’s take the following and document approach on how shall we work on this.
subroutine simple(n, a, b)
integer :: i, n
real :: a(n), b(n)
!$omp parallel private(i)
!$omp do
do i = 2, n
    b(i) = (a(i) + a(i-1)) / 2.0
end do
!$omp end do
!$omp end parallel
end subroutine
GFortran parse tree
When we do gfortran e.f90 -fopenmp -fdump-tree-all, in a-e.f90.017t.ompexp we get the following
GFortran Parse tree
OMP region tree
bb 2: gimple_omp_parallel
    bb 3: gimple_omp_for
    bb 4: GIMPLE_OMP_CONTINUE
    bb 5: GIMPLE_OMP_RETURN
bb 6: GIMPLE_OMP_RETURN
Added new low gimple function simple_._omp_fn.0 to callgraph
Introduced new external node (__builtin_omp_get_num_threads/2).
Introduced new external node (__builtin_omp_get_thread_num/3).
;; Function simple_._omp_fn.0 (simple_._omp_fn.0, funcdef_no=1, decl_uid=3028, cgraph_uid=2, symbol_order=1)
void simple_._omp_fn.0 (struct .omp_data_s.8 & restrict .omp_data_i)
{
  real(kind=4)[0:D.3037] * restrict b [value-expr: .omp_data_i->b];
  real(kind=4)[0:D.3038] * restrict a [value-expr: .omp_data_i->a];
  integer(kind=4) & restrict n [value-expr: .omp_data_i->n];
  real(kind=4)[0:D.3037] * D.3091;
  real(kind=4) D.3090;
  integer(kind=8) D.3089;
  integer(kind=8) D.3088;
  real(kind=4) D.3087;
  real(kind=4) D.3086;
  real(kind=4) D.3085;
  real(kind=4)[0:D.3038] * D.3084;
  integer(kind=8) D.3083;
  integer(kind=8) D.3082;
  integer(kind=4) D.3081;
  real(kind=4) D.3080;
  real(kind=4)[0:D.3038] * D.3079;
  integer(kind=8) D.3078;
  integer(kind=8) D.3077;
  integer(kind=4) D.3076;
  integer(kind=4) i;
  integer(kind=4) D.3074;
  integer(kind=4) D.3073;
  integer(kind=4) D.3072;
  integer(kind=4) tt.11;
  integer(kind=4) q.10;
  integer(kind=4) D.3069;
  integer(kind=4) D.3068;
  integer(kind=4) D.3067;
  integer(kind=4) D.3066;
  integer(kind=4) D.3065;
  integer(kind=4) D.3064;
  integer(kind=4) & D.3063;
  <bb 11> :
  <bb 3> :
  D.3063 = .omp_data_i->n; // n address
  D.3064 = *D.3063; // n value
  D.3065 = D.3064; // n value
  D.3066 = __builtin_omp_get_num_threads (); // num_threads
  D.3067 = __builtin_omp_get_thread_num (); // thread_id
  D.3068 = D.3065 + 1; // n + 1
  D.3069 = D.3068 + -2; // n - 1
  q.10 = D.3069 / D.3066; // n - 1 / num_threads --> how much portion to each thread
  tt.11 = D.3069 % D.3066; // n - 1 % num_threads --> portion left
  if (D.3067 < tt.11) // --> not sure what this is doing
    goto <bb 9>; [25.00%]
  else
    goto <bb 10>; [75.00%]
  <bb 10> :
  D.3072 = q.10 * D.3067; // chunk * thread_id
  D.3073 = D.3072 + tt.11;
  D.3074 = D.3073 + q.10;
  if (D.3073 >= D.3074)
    goto <bb 5>; [INV]
  else
    goto <bb 8>; [INV]
  <bb 8> :
  i = D.3073 + 2;
  D.3076 = D.3074 + 2;
  <bb 4> :
  D.3077 = (integer(kind=8)) i;
  D.3078 = D.3077 + -1;
  D.3079 = .omp_data_i->a; // a
  D.3080 = MEM <real(kind=4)[0:D.3040]> [(real(kind=4)[0:D.3040] *)D.3079][D.3078]; // a(i)
  D.3081 = i + -1; // i - 1
  D.3082 = (integer(kind=8)) D.3081;
  D.3083 = D.3082 + -1;
  D.3084 = .omp_data_i->a;
  D.3085 = MEM <real(kind=4)[0:D.3040]> [(real(kind=4)[0:D.3040] *)D.3084][D.3083]; // a(i-1)
  D.3086 = D.3080 + D.3085; // a(i) + a(i-1)
  D.3087 = ((D.3086));
  D.3088 = (integer(kind=8)) i;
  D.3089 = D.3088 + -1;
  D.3090 = D.3087 / 2.0e+0; // (a(i) + a(i-1)) / 2.0
  D.3091 = .omp_data_i->b; // b
  MEM <real(kind=4)[0:D.3041]> [(real(kind=4)[0:D.3041] *)D.3091][D.3089] = D.3090; // b = (a(i) + a(i-1)) / 2.0
  i = i + 1;
  if (i < D.3076) // if iterator less than a chunk in thread
    goto <bb 4>; [INV]
  else
    goto <bb 5>; [INV]
  <bb 5> :
  <bb 6> :
  return;
  <bb 9> :
  tt.11 = 0;
  q.10 = q.10 + 1;
  goto <bb 10>; [100.00%]
}
......
// called as 
D.3011 = size.7_8 * 4;
.omp_data_o.9.n = n;
.omp_data_o.9.a = a;
.omp_data_o.9.b = b;
__builtin_GOMP_parallel (simple_._omp_fn.0, &.omp_data_o.9, 0, 0);
.omp_data_o.9 = {CLOBBER};
OpenMP APIs
In runtime library, we can see these functions 0000000000073128 T _GOMP_parallel, 00000000000805d0 T _omp_get_num_threads and 0000000000080ae8 T _omp_get_thread_num which can be simply C interfaces with arguments as ( CPtr? )
I have added comments to understand what gfortran guys are doing, although I did not get what they are doing at if (D.3067 < tt.11). Rest code logic is simply splitting range ( 2, n ) into num_threads and then applying loop normally. All these inside __builting_GOMP_parallel.
DoConcurrent version
Let’s take DoConcurrent version of this and brainstorm design for pass.
DoConcurrent version of this will look like
subroutine simple(n, a, b)
integer :: i, n
real :: a(n), b(n)
do concurrent (i = 2:n)
b(i) = (a(i) + a(i-1)) / 2.0
end do
end subroutine
Compiler Flag
We may need to add a compiler flag --openmp to link, if enabled will execute the pass, else skip it.
We do have one --openmp Enable openmp, that is good.
Pass replace_openmp
Pass:
replace_openmp
DoConcurrent is a statement, we will create a simple ASR::BaseWalkVisitor with special handling for DoConcurrent statements.
visit_DoConcurrent will do the following:
- will trap all do concurrent loops
- identify privateandsharedvariables
- create function like _omp_fnand populate it
- create function _GOMP_parallel( BindC ) with required arguments
- add a new subroutine call to _GOMP_parallelin body
Execution and Testing
To test it in integration test, we can generate executable using
lfortran -c a.f90 -o a.o
and then link libopenmp.dylib, LFortran runtime library and execute it. We can follow symbolics there.