#pragma omp flush
A simple experiment to figure out how flush works
1. Experiment
Lets look at the following source code. Compiled with flags -O3 -fopenmp
< Collapse code block
#include <stdio.h> int main() { int i = 1; #pragma omp parallel { i++; #pragma omp flush i*=i; i--; printf("%d\n", i); } return 0; }
1.1. Without flush
If we omit the #pragma omp flush
directive, we see that the variable i
is loaded from memory to the eax
register once, then some computation are performed (imul
, sub
) and then the value is returned back to memory with another mov
. [godbolt.org]
< Collapse code block
.LC0: .string "%d\n" main._omp_fn.0: mov eax, DWORD PTR [rdi] lea esi, [rax+1] xor eax, eax imul esi, esi sub esi, 1 mov DWORD PTR [rdi], esi mov edi, OFFSET FLAT:.LC0 jmp printf main: sub rsp, 24 xor ecx, ecx xor edx, edx mov edi, OFFSET FLAT:main._omp_fn.0 lea rsi, [rsp+12] mov DWORD PTR [rsp+12], 1 call GOMP_parallel xor eax, eax add rsp, 24 ret
For arm
architecture, the assembly instruction for the parallel block is as follows [godbolt.org]:
< Collapse code block
main._omp_fn.0: mov r3, r0 movw r0, #:lower16:.LC0 movt r0, #:upper16:.LC0 ldr r1, [r3] adds r1, r1, #1 mul r1, r1, r1 subs r1, r1, #1 str r1, [r3] b printf
Here too, the steps are same:
ldr r1, [r3]
: firsti
(at[r3]
) is loaded from memory to a registerr1
- then computation is performed using the registers
- finally, the result is stored back to memory
str r1, [r3]
1.2. With Flush
With the #pragma omp flush
directive we see that the addition on variable i
is directly performed in the memory (add rdi 1
) and then an atomic or
operation (atomic because of lock
) is performed on rsp
. Then as usual, the variable i
is fetched, remaining computations are performed (imul
, sub
) and then value returned back to memory. [godbolt.org]
< Collapse code block
main._omp_fn.0: add DWORD PTR [rdi], 1 xor eax, eax lock or QWORD PTR [rsp], 0 mov esi, DWORD PTR [rdi] imul esi, esi sub esi, 1 mov DWORD PTR [rdi], esi mov edi, OFFSET FLAT:.LC0 jmp printf
For arm architecture, the assembly does similar work [godbolt.org]:
< Collapse code block
main._omp_fn.0: mov r3, r0 movw r0, #:lower16:.LC0 movt r0, #:upper16:.LC0 ldr r2, [r3] adds r2, r2, #1 str r2, [r3] dmb ish ldr r1, [r3] mul r1, r1, r1 subs r1, r1, #1 str r1, [r3] b printf
ldr r2, [r3]
: Firsti
is fetched from memory to registerr2
adds r2, r2, #1
: addition is performed inr2
str r2, [r3]
: result is saved back to memorydmb ish
: It is a memory barrier instruction that prevents CPU from reordering the instructionsldr r1, [r3]
: Value ofi
is again fetched from memory- the remaining computation (multiplication and substraction) is performed
- and the result is again stored back to memory.
2. Analysis
The above experiment shows that #pragma omp flush
causes two changes:
- before the flush, the variables modified in registers are moved back to the memory.
- after the flush, the variables are loaded back from memory to the register (if the subsequent instructions require that variable)
After the mov
operation the cache coherence mechanisms of the CPU are responsible to make the value consistent among the different cores' cache and the memory.
I am not sure why the atomic or
operation (lock or rsp 0
) in x86-64 architecture and the dmb ish
instruction in arm instruction is required between the first store and second load. I guess it is to prevent out of order & speculative execution (link1, arm docs).
Reasoning:
- due to out of order execution and speculative execution done by modern CPUs, load instruction later in the code may be performed in parallel to other instructions.
- this speculative execution can also cause the cache to be updated in different order than expected by looking at the code
- a memory barrier (e.g.
dmb
in arm,lock or
in x86-64) prevents the CPU from doing out of order execution
But still, when accessing the same memory location, the CPU, I assume, shouldn't reorder the store and load at that location.
- Maybe it does not wait for the load before executing further instructions because it had just written to it? I guess not.