HOME

Date: <2024-10-29 Tue>

#pragma omp flush
A simple experiment to figure out how flush works

1. Experiment

Lets look at the following source code. Compiled with flags -O3 -fopenmp

< Collapse code block> Expand code block
#include <stdio.h>

int main() {
    int i = 1;
    #pragma omp parallel
    {
        i++;
        #pragma omp flush
        i*=i;
        i--;
        printf("%d\n", i);
    }
    return 0;
}

1.1. Without flush

If we omit the #pragma omp flush directive, we see that the variable i is loaded from memory to the eax register once, then some computation are performed (imul, sub) and then the value is returned back to memory with another mov. [godbolt.org]

< Collapse code block> Expand code block
.LC0:
        .string "%d\n"
main._omp_fn.0:
        mov     eax, DWORD PTR [rdi]
        lea     esi, [rax+1]
        xor     eax, eax
        imul    esi, esi
        sub     esi, 1
        mov     DWORD PTR [rdi], esi
        mov     edi, OFFSET FLAT:.LC0
        jmp     printf
main:
        sub     rsp, 24
        xor     ecx, ecx
        xor     edx, edx
        mov     edi, OFFSET FLAT:main._omp_fn.0
        lea     rsi, [rsp+12]
        mov     DWORD PTR [rsp+12], 1
        call    GOMP_parallel
        xor     eax, eax
        add     rsp, 24
        ret

For arm architecture, the assembly instruction for the parallel block is as follows [godbolt.org]:

< Collapse code block> Expand code block
main._omp_fn.0:
        mov     r3, r0
        movw    r0, #:lower16:.LC0
        movt    r0, #:upper16:.LC0
        ldr     r1, [r3]
        adds    r1, r1, #1
        mul     r1, r1, r1
        subs    r1, r1, #1
        str     r1, [r3]
        b       printf

Here too, the steps are same:

  • ldr r1, [r3] : first i (at [r3]) is loaded from memory to a register r1
  • then computation is performed using the registers
  • finally, the result is stored back to memory str r1, [r3]

1.2. With Flush

With the #pragma omp flush directive we see that the addition on variable i is directly performed in the memory (add rdi 1) and then an atomic or operation (atomic because of lock) is performed on rsp. Then as usual, the variable i is fetched, remaining computations are performed (imul, sub) and then value returned back to memory. [godbolt.org]

< Collapse code block> Expand code block
main._omp_fn.0:
        add     DWORD PTR [rdi], 1
        xor     eax, eax
        lock or QWORD PTR [rsp], 0
        mov     esi, DWORD PTR [rdi]
        imul    esi, esi
        sub     esi, 1
        mov     DWORD PTR [rdi], esi
        mov     edi, OFFSET FLAT:.LC0
        jmp     printf

For arm architecture, the assembly does similar work [godbolt.org]:

< Collapse code block> Expand code block
main._omp_fn.0:
        mov     r3, r0
        movw    r0, #:lower16:.LC0
        movt    r0, #:upper16:.LC0
        ldr     r2, [r3]
        adds    r2, r2, #1
        str     r2, [r3]
        dmb     ish
        ldr     r1, [r3]
        mul     r1, r1, r1
        subs    r1, r1, #1
        str     r1, [r3]
        b       printf
  • ldr r2, [r3] : First i is fetched from memory to register r2
  • adds r2, r2, #1 : addition is performed in r2
  • str r2, [r3] : result is saved back to memory
  • dmb ish : It is a memory barrier instruction that prevents CPU from reordering the instructions
  • ldr r1, [r3] : Value of i is again fetched from memory
  • the remaining computation (multiplication and substraction) is performed
  • and the result is again stored back to memory.

2. Analysis

The above experiment shows that #pragma omp flush causes two changes:

  • before the flush, the variables modified in registers are moved back to the memory.
  • after the flush, the variables are loaded back from memory to the register (if the subsequent instructions require that variable)

After the mov operation the cache coherence mechanisms of the CPU are responsible to make the value consistent among the different cores' cache and the memory.

I am not sure why the atomic or operation (lock or rsp 0) in x86-64 architecture and the dmb ish instruction in arm instruction is required between the first store and second load. I guess it is to prevent out of order & speculative execution (link1, arm docs).

Reasoning:

  • due to out of order execution and speculative execution done by modern CPUs, load instruction later in the code may be performed in parallel to other instructions.
  • this speculative execution can also cause the cache to be updated in different order than expected by looking at the code
  • a memory barrier (e.g. dmb in arm, lock or in x86-64) prevents the CPU from doing out of order execution

But still, when accessing the same memory location, the CPU, I assume, shouldn't reorder the store and load at that location.

  • Maybe it does not wait for the load before executing further instructions because it had just written to it? I guess not.

You can send your feedback, queries here