2024-10-29 (Edited: 2025-05-16)

#pragma omp flush
A simple experiment to figure out how flush works

1. Experiment

Lets look at the following source code. Compiled with flags -O3 -fopenmp

#include <stdio.h>

int main() {
    int i = 1;
    #pragma omp parallel
    {
        i++;
        #pragma omp flush
        i*=i;
        i--;
        printf("%d\n", i);
    }
    return 0;
}

1.1. Without flush

If we omit the #pragma omp flush directive, we see that the variable i is loaded from memory to the eax register once, then some computation are performed (imul, sub) and then the value is returned back to memory with another mov. [godbolt.org]

.LC0:
        .string "%d\n"
main._omp_fn.0:
        mov     eax, DWORD PTR [rdi]
        lea     esi, [rax+1]
        xor     eax, eax
        imul    esi, esi
        sub     esi, 1
        mov     DWORD PTR [rdi], esi
        mov     edi, OFFSET FLAT:.LC0
        jmp     printf
main:
        sub     rsp, 24
        xor     ecx, ecx
        xor     edx, edx
        mov     edi, OFFSET FLAT:main._omp_fn.0
        lea     rsi, [rsp+12]
        mov     DWORD PTR [rsp+12], 1
        call    GOMP_parallel
        xor     eax, eax
        add     rsp, 24
        ret

For arm architecture, the assembly instruction for the parallel block is as follows [godbolt.org]:

main._omp_fn.0:
        mov     r3, r0
        movw    r0, #:lower16:.LC0
        movt    r0, #:upper16:.LC0
        ldr     r1, [r3]
        adds    r1, r1, #1
        mul     r1, r1, r1
        subs    r1, r1, #1
        str     r1, [r3]
        b       printf

Here too, the steps are same:

  • ldr r1, [r3] : first i (at [r3]) is loaded from memory to a register r1
  • then computation is performed using the registers
  • finally, the result is stored back to memory str r1, [r3]

1.2. With Flush

With the #pragma omp flush directive we see that the addition on variable i is directly performed in the memory (add rdi 1) and then an atomic or operation (atomic because of lock) is performed on rsp. Then as usual, the variable i is fetched, remaining computations are performed (imul, sub) and then value returned back to memory. [godbolt.org]

main._omp_fn.0:
        add     DWORD PTR [rdi], 1
        xor     eax, eax
        lock or QWORD PTR [rsp], 0
        mov     esi, DWORD PTR [rdi]
        imul    esi, esi
        sub     esi, 1
        mov     DWORD PTR [rdi], esi
        mov     edi, OFFSET FLAT:.LC0
        jmp     printf

For arm architecture, the assembly does similar work [godbolt.org]:

main._omp_fn.0:
        mov     r3, r0
        movw    r0, #:lower16:.LC0
        movt    r0, #:upper16:.LC0
        ldr     r2, [r3]
        adds    r2, r2, #1
        str     r2, [r3]
        dmb     ish
        ldr     r1, [r3]
        mul     r1, r1, r1
        subs    r1, r1, #1
        str     r1, [r3]
        b       printf
  • ldr r2, [r3] : First i is fetched from memory to register r2
  • adds r2, r2, #1 : addition is performed in r2
  • str r2, [r3] : result is saved back to memory
  • dmb ish : It is a memory barrier instruction that prevents CPU from reordering the instructions
  • ldr r1, [r3] : Value of i is again fetched from memory
  • the remaining computation (multiplication and substraction) is performed
  • and the result is again stored back to memory.

2. Analysis

The above experiment shows that #pragma omp flush causes two changes:

  • before the flush, the variables modified in registers are moved back to the memory.
  • after the flush, the variables are loaded back from memory to the register (if the subsequent instructions require that variable)

After the mov operation the cache coherence mechanisms of the CPU are responsible to make the value consistent among the different cores' cache and the memory. But, some optimizations in the microarchitecture come into play, and thus this is not quite true and we need memory barriers.

CPU will always see its own operations as if they happened in program order. The need for memory barriers arise in shared memory multiprocessor systems (SMP). SMPs have cache coherency protocols that maintain coherent view of memory among the processor caches, but as an optimization for those coherency protocols CPUs also have store queues and invalidate queues. It is really the store queues and invalidate queues that cause the need for memory barriers.

Stores are placed in the store queue if the cache block is either not in the cache or is in cache but in shared state. Store queue saves CPU from stalling for a store to complete. For subsequent loads the CPU forwards data from the store queue. So, the CPU still sees its operations in program order. But other CPUs can see the writes in different order. A write memory barrier ensures that all the writes before this point are completed before proceeding with new writes. And so other CPUs will see the writes preceeding the barrier before the writes that follow the barrier.

When a CPU needs to write to a block, it needs exclusive access. So, it sends invalidate message to all other CPUs and waits for acknowledgement before updating the value. The cache might be busy to handle the invalidate message right away, so the microarchitecture can have a invalidate queue as an optimization. Incoming invalidate messages are immediately acknowledged and placed in the queue but not invalidated in the cache. Thus the CPU can be oblivious to the fact that the load from the cache is actually invalid. A read memory barrier, ensures that all pending invalidates in the queue are processed before any subsequent load operations are allowed to complete. This guarantees that the CPU reads the most up-to-date data.

Take an example from Memory Barriers - A Hardware View for Software Hackers [pg. 8] with MESI protocol (Modified Exclusive Shared Invalid):

// Assume that initially b resides in CPU 0
// and a resides in CPU 1
void foo(void) { // Run by CPU 0
  a = 1;
  write_barrier(); // Otherwise b will be written while a remains in store queue
  b = 1;
}

void bar(void) { // Run by CPU 1
  while (b == 0) continue;
  // Otherwise invalidate for a might still be in invalidate queue
  // and the cache would have old value of a, still marked as valid
  read_barrier();
  assert(a == 1);
}

You can send your feedback, queries here