lock addl $0, (%esp)
is a substitute for mfence
, not lfence
.
(lock add
is generally faster on modern CPUs, especially Intel Skylake with updated microcode where mfence
acts like lfence
as well, blocking out-of-order exec even of instructions on registers. That's why GCC recently switched to using a dummy lock add
instead of mfence
when it needs a full barrier.)
The use-case is when you need to block StoreLoad reordering (the only kind that x86's strong memory model allows), but you don't need an atomic RMW operation on a shared variable. https://preshing.com/20120515/memory-reordering-caught-in-the-act/
e.g. assuming aligned std::atomic<int> a,b
, where the default memory_order is seq_cst
movl $1, a # a = 1; Atomic for aligned a
# barrier needed here between seq_cst store and later loads
movl b, %eax # tmp = b; Atomic for aligned b
Your options are:
Do a sequential-consistency store with xchg
, e.g. mov $1, %eax
/ xchg %eax, a
so you don't need a separate barrier; it's part of the store. I think this is the most efficient option on most modern hardware; C++11 compilers other than gcc use xchg
for seq_cst stores. (See Why does a std::atomic store with sequential consistency use XCHG? re: performance and correctness.)
Use mfence
as a barrier. (gcc used mov
+ mfence
for seq_cst stores, but recently switched to xchg
for performance.)
Use lock addl $0, (%esp)
as a barrier. Any lock
ed instruction is a full barrier, but this one has no effect on register or memory contents except FLAGS. See Does lock xchg have the same behavior as mfence?
(Or to some other location, but the stack is almost always private and hot in L1d, so it's a good candidate. Later reloads of whatever was using that space couldn't start until after the atomic RMW anyway because it's a full barrier.)
You can only use xchg
as a barrier by folding it into a store because it unconditionally writes the memory location with a value that doesn't depend on the old value.
When possible, using xchg
for a seq-cst store is probably best, even though it also reads from the shared location. mfence
is slower than expected on recent Intel CPUs (Are loads and stores the only instructions that gets reordered?), also blocking out-of-order execution of independent non-memory instructions the same way lfence
does.
It might even be worth using lock addl $0, (%esp)/(%rsp)
instead of mfence
even when mfence
is available, but I haven't experimented with the downsides. Using -64(%rsp)
or something might make it less likely to lengthen a data dependency on something hot (a local or a return address), but that can make tools like valgrind unhappy.
lfence
is never useful for memory ordering unless you're reading from video RAM (or some other WC weakly-ordered region) with MOVNTDQA loads.
Serializing out-of-order execution (but not the store buffer) isn't useful to stop StoreLoad reordering (the only kind that x86's strong memory model allows for normal WB (write-back) memory regions).
The real-world use-cases for lfence
are for blocking out-of-order execution of rdtsc
for timing very short blocks of code, or for Spectre mitigation by blocking speculation through a conditional or indirect branch.
See also When should I use _mm_sfence _mm_lfence and _mm_mfence (my answer and @BeeOnRope's answer) for more about why lfence
is not useful, and when to use each of the barrier instructions. (Or in mine, the C++ intrinsics when programming in C++ instead of asm).