#pragma omp flush
A simple experiment to figure out how flush works
1. Experiment
Lets look at the following source code. Compiled with flags -O3 -fopenmp
#include <stdio.h> int main() { int i = 1; #pragma omp parallel { i++; #pragma omp flush i*=i; i--; printf("%d\n", i); } return 0; }
1.1. Without flush
If we omit the #pragma omp flush
directive, we see that the variable i
is loaded from memory to the eax
register once, then some computation are performed (imul
, sub
) and then the value is returned back to memory with another mov
. [godbolt.org]
.LC0: .string "%d\n" main._omp_fn.0: mov eax, DWORD PTR [rdi] lea esi, [rax+1] xor eax, eax imul esi, esi sub esi, 1 mov DWORD PTR [rdi], esi mov edi, OFFSET FLAT:.LC0 jmp printf main: sub rsp, 24 xor ecx, ecx xor edx, edx mov edi, OFFSET FLAT:main._omp_fn.0 lea rsi, [rsp+12] mov DWORD PTR [rsp+12], 1 call GOMP_parallel xor eax, eax add rsp, 24 ret
For arm
architecture, the assembly instruction for the parallel block is as follows [godbolt.org]:
main._omp_fn.0: mov r3, r0 movw r0, #:lower16:.LC0 movt r0, #:upper16:.LC0 ldr r1, [r3] adds r1, r1, #1 mul r1, r1, r1 subs r1, r1, #1 str r1, [r3] b printf
Here too, the steps are same:
ldr r1, [r3]
: firsti
(at[r3]
) is loaded from memory to a registerr1
- then computation is performed using the registers
- finally, the result is stored back to memory
str r1, [r3]
1.2. With Flush
With the #pragma omp flush
directive we see that the addition on variable i
is directly performed in the memory (add rdi 1
) and then an atomic or
operation (atomic because of lock
) is performed on rsp
. Then as usual, the variable i
is fetched, remaining computations are performed (imul
, sub
) and then value returned back to memory. [godbolt.org]
main._omp_fn.0: add DWORD PTR [rdi], 1 xor eax, eax lock or QWORD PTR [rsp], 0 mov esi, DWORD PTR [rdi] imul esi, esi sub esi, 1 mov DWORD PTR [rdi], esi mov edi, OFFSET FLAT:.LC0 jmp printf
For arm architecture, the assembly does similar work [godbolt.org]:
main._omp_fn.0: mov r3, r0 movw r0, #:lower16:.LC0 movt r0, #:upper16:.LC0 ldr r2, [r3] adds r2, r2, #1 str r2, [r3] dmb ish ldr r1, [r3] mul r1, r1, r1 subs r1, r1, #1 str r1, [r3] b printf
ldr r2, [r3]
: Firsti
is fetched from memory to registerr2
adds r2, r2, #1
: addition is performed inr2
str r2, [r3]
: result is saved back to memorydmb ish
: It is a memory barrier instruction that prevents CPU from reordering the instructionsldr r1, [r3]
: Value ofi
is again fetched from memory- the remaining computation (multiplication and substraction) is performed
- and the result is again stored back to memory.
2. Analysis
The above experiment shows that #pragma omp flush
causes two changes:
- before the flush, the variables modified in registers are moved back to the memory.
- after the flush, the variables are loaded back from memory to the register (if the subsequent instructions require that variable)
After the mov
operation the cache coherence mechanisms of the CPU are responsible to make the value consistent among the different cores' cache and the memory. But, some optimizations in the microarchitecture come into play, and thus this is not quite true and we need memory barriers.
CPU will always see its own operations as if they happened in program order. The need for memory barriers arise in shared memory multiprocessor systems (SMP). SMPs have cache coherency protocols that maintain coherent view of memory among the processor caches, but as an optimization for those coherency protocols CPUs also have store queues and invalidate queues. It is really the store queues and invalidate queues that cause the need for memory barriers.
Stores are placed in the store queue if the cache block is either not in the cache or is in cache but in shared state. Store queue saves CPU from stalling for a store to complete. For subsequent loads the CPU forwards data from the store queue. So, the CPU still sees its operations in program order. But other CPUs can see the writes in different order. A write memory barrier ensures that all the writes before this point are completed before proceeding with new writes. And so other CPUs will see the writes preceeding the barrier before the writes that follow the barrier.
When a CPU needs to write to a block, it needs exclusive access. So, it sends invalidate message to all other CPUs and waits for acknowledgement before updating the value. The cache might be busy to handle the invalidate message right away, so the microarchitecture can have a invalidate queue as an optimization. Incoming invalidate messages are immediately acknowledged and placed in the queue but not invalidated in the cache. Thus the CPU can be oblivious to the fact that the load from the cache is actually invalid. A read memory barrier, ensures that all pending invalidates in the queue are processed before any subsequent load operations are allowed to complete. This guarantees that the CPU reads the most up-to-date data.
Take an example from Memory Barriers - A Hardware View for Software Hackers [pg. 8] with MESI protocol (Modified Exclusive Shared Invalid):
// Assume that initially b resides in CPU 0 // and a resides in CPU 1 void foo(void) { // Run by CPU 0 a = 1; write_barrier(); // Otherwise b will be written while a remains in store queue b = 1; } void bar(void) { // Run by CPU 1 while (b == 0) continue; // Otherwise invalidate for a might still be in invalidate queue // and the cache would have old value of a, still marked as valid read_barrier(); assert(a == 1); }