atomic ops

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic ops is
not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide
full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so it is
addditionally slower especially when cpus compete.

I already explained this once privately: fully memory barriers is not
stronger than needed.
FreeBSD has a different semantic than Linux. We historically enforce a
full barrier on _acq() and _rel() rather then just a read and write
barrier, hence we need a different implementation than Linux.
There is code that relies on this property, like the locking
primitives (release a mutex, for instance).

In short: optimizing the implementation for performance is fine and
due. Changing the semantic is not fine, unless you have reviewed and
fixed all the uses of _rel() and _acq().

Post by Mateusz Guzik
On amd64 it is sufficient to place a compiler barrier in such cases.
Next, we lack some atomic ops in the first place.
smp_wmb - no writes can be reordered past this point
smp_rmb - no reads can be reordered past this point
1. var = tmp; smp_wmb();
2. tmp = var; smp_rmb();
3. smp_rmb(); tmp = var;
This matters since what we can use already to emulate this is way
heavier than needed on aforementioned amd64 and most likely other archs.

I can see the value of such barriers in case you want to just
synchronize operation regards read or writes.
I also believe that on newest intel processors (for which we should
optimize) rmb() and wmb() got significantly faster than mb(). However
the most interesting case would be for arm and mips, I assume. That's
where you would see a bigger perf difference if you optimize the
membar paths.

Last time I looked into it, in FreeBSD kernel the Linux-ish
rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code,
handling of 16-bits operand and implementation of "faster" bus
barriers.
Initially I had thought about just confining the smp_*() in a Linux
compat layer and fix the other 2 in this way: for 16-bits operands
just pad to 32-bits, as the C11 standard also does. For the bus
barriers, just grow more versions to actually include the rmb()/wmb()
scheme within.

At this point, I understand we may want to instead support the
concept of write-only or read-only barrier. This means that if we want
to keep the concept tied to the current _acq()/_rel() scheme we will
end up with a KPI explosion.

I'm not the one making the call here, but for a faster and more
granluar approach, possibly we can end up using smp_rmb() and
smp_wmb() directly. As I said I'm not the one making the call.

Attilio

--
Peace can only be achieved by understanding - A. Einstein

Andrew Turner

2014-10-28 14:25:10 UTC

On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic ops
is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide
full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so it
is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has added
support for load-acquire and store-release atomic instructions. For the
use in atomic instructions we can assume these only operate of the
address passed to them.

It is unlikely we will use them in the 32-bit port however I would like
to know the expected semantics of these atomic functions to make sure
we get them correct in the arm64 port. I have been advised by one of
the ARM Linux kernel maintainers on the problems they have found using
these instructions but have yet to determine what our atomic functions
guarantee.

Andrew

Attilio Rao

2014-10-28 14:33:06 UTC

Post by Andrew Turner
On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic ops
is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64) provide
full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so it
is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has added
support for load-acquire and store-release atomic instructions. For the
use in atomic instructions we can assume these only operate of the
address passed to them.
It is unlikely we will use them in the 32-bit port however I would like
to know the expected semantics of these atomic functions to make sure
we get them correct in the arm64 port. I have been advised by one of
the ARM Linux kernel maintainers on the problems they have found using
these instructions but have yet to determine what our atomic functions
guarantee.

For FreeBSD the "reference doc" is atomic(9).
It clearly states:

The second variant of each operation includes a read memory barrier.
This barrier ensures that the effects of this operation are completed
before the effects of any later data accesses. As a result, the opera-
tion is said to have acquire semantics as it acquires a pseudo-lock
requiring further operations to wait until it has completed. To denote
this, the suffix ``_acq'' is inserted into the function name immediately
prior to the ``_<type>'' suffix. For example, to subtract two integers
ensuring that any later writes will happen after the subtraction is per-
formed, use atomic_subtract_acq_int().

The third variant of each operation includes a write memory barrier.
This ensures that all effects of all previous data accesses are completed
before this operation takes place. As a result, the operation is said to
have release semantics as it releases any pending data accesses to be
completed before its operation is performed. To denote this, the suffix
``_rel'' is inserted into the function name immediately prior to the
``_<type>'' suffix. For example, to add two long integers ensuring that
all previous writes will happen first, use atomic_add_rel_long().

The bottom-side of all this is that read memory barriers ensures that
the effect of the operations you are making (load in case of
atomic_load_acq_int(), for example) are completed before any later
data accesses. "Data accesses" qualifies for *all* the operations
including read, writes, etc. This is very different by what Linux
assumes for its rmb() barrier, for example which just orders loads. So
for FreeBSD there is no _acq -> rmb() analogy and there is no _rel ->
wmb() analogy.

This must be kept well in mind when trying to optimize the atomic_*()
operations.

Attilio

--
Peace can only be achieved by understanding - A. Einstein

Andrew Turner

2014-10-28 17:53:18 UTC

On Tue, 28 Oct 2014 15:33:06 +0100

Post by Andrew Turner
On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic
ops is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
provide full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so
it is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
added support for load-acquire and store-release atomic
instructions. For the use in atomic instructions we can assume
these only operate of the address passed to them.
It is unlikely we will use them in the 32-bit port however I would
like to know the expected semantics of these atomic functions to
make sure we get them correct in the arm64 port. I have been
advised by one of the ARM Linux kernel maintainers on the problems
they have found using these instructions but have yet to determine
what our atomic functions guarantee.

For FreeBSD the "reference doc" is atomic(9).

There may also be a difference between what it states, how they are
implemented, and what developers assume they do. I'm trying to make
sure I get them correct.

Post by Attilio Rao
The second variant of each operation includes a read memory barrier.
This barrier ensures that the effects of this operation are completed
before the effects of any later data accesses. As a result, the
opera- tion is said to have acquire semantics as it acquires a
pseudo-lock requiring further operations to wait until it has
completed. To denote this, the suffix ``_acq'' is inserted into the
function name immediately prior to the ``_<type>'' suffix. For
example, to subtract two integers ensuring that any later writes will
happen after the subtraction is per- formed, use
atomic_subtract_acq_int().

It depends on the point we guarantee the acquire barrier to be. On ARMv8
the function will be a load/modify/write sequence. If we use a
load-acquire operation for atomic_subtract_acq_int, for example, for a
pointer P and value to subtract X:

loop:
load-acquire *P to N
perform N = N - X
store-exclusive N to *P
if the store failed goto loop

where N and X are both registers.

This will mean no access after this loop will happen before it, but
they may happen within it, e.g. if there was a later access A the
following may be possible:

Load P
Access A
Store P

We know the store will happen as if it fails, e.g. another processor
access *P, the store will have failed and will iterate over the loop.

The other point is we can guarantee any store-release, and therefore
any prior access, has happened before a later load-acquire even if it's
on another processor.

...

Post by Attilio Rao
The bottom-side of all this is that read memory barriers ensures that
the effect of the operations you are making (load in case of
atomic_load_acq_int(), for example) are completed before any later
data accesses. "Data accesses" qualifies for *all* the operations
including read, writes, etc. This is very different by what Linux
assumes for its rmb() barrier, for example which just orders loads. So
for FreeBSD there is no _acq -> rmb() analogy and there is no _rel ->
wmb() analogy.

On ARMv8 using the above pseudo-code the operation later operations
will not be moved before the load-acquire, but they may happen before
it's store. Having discussed this with John Baldwin I don't think this
is a problem due to the nature of the store operation being allowed to
fail if another processor has written its memory.

Post by Attilio Rao
This must be kept well in mind when trying to optimize the atomic_*()
operations.

At this point I'm more interested in getting them correct as they will
be important when I start on SMP support.

Andrew

Attilio Rao

2014-10-28 20:08:27 UTC

Post by Andrew Turner
On Tue, 28 Oct 2014 15:33:06 +0100

Post by Andrew Turner
On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic
ops is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
provide full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so
it is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
added support for load-acquire and store-release atomic
instructions. For the use in atomic instructions we can assume
these only operate of the address passed to them.
It is unlikely we will use them in the 32-bit port however I would
like to know the expected semantics of these atomic functions to
make sure we get them correct in the arm64 port. I have been
advised by one of the ARM Linux kernel maintainers on the problems
they have found using these instructions but have yet to determine
what our atomic functions guarantee.

For FreeBSD the "reference doc" is atomic(9).

There may also be a difference between what it states, how they are
implemented, and what developers assume they do. I'm trying to make
sure I get them correct.

atomic(9) is our reference so there might be no difference between
what it states and what all architectures implement.
I can say that x86 follows atomic(9) well. I'm not competent enough to
judge if all the !x86 arches follow it completely.
I can understand that developers may get confused. The FreeBSD scheme
is pretty unique. It comes from the fact that historically the membar
support was made to initially support x86. The super-widespread Linux
design, instead, tried to catch all architectures in its description.
It become very well known and I think it also "pushed" for companies
like Intel to invest in improving performance of things like explicit
read/write barriers, etc.

No, this will be broken in FreeBSD if "Access A" is later.
If "Access A" is prior the membar it doesn't really matter if it gets
interleaved with any of the operations in the atomic instruction.
Ideally, it could even surpass the Store P itself.
But if "Access A" is later (and you want to implement an _acq()
barrier) then it cannot absolutely gets in the middle of the atomic_*
operation.

Post by Andrew Turner
We know the store will happen as if it fails, e.g. another processor
access *P, the store will have failed and will iterate over the loop.
The other point is we can guarantee any store-release, and therefore
any prior access, has happened before a later load-acquire even if it's
on another processor.

No, we can never guarantee on the visibility of the operations by other CPUs.
We just make guarantee on how the operations are posted on the system
bus (or how they are locally visible).
Keeping in mind that FreeBSD model cames from x86, you can sense that
some things are sized on the x86 model, which doesn't have any rule or
ordering on global visibility of the operations.

Post by Andrew Turner
...

Post by Attilio Rao
This must be kept well in mind when trying to optimize the atomic_*()
operations.

At this point I'm more interested in getting them correct as they will
be important when I start on SMP support.

Sure. The thread as started as an "optimization of x86" but it refers
to all atomic_* on every architecture FreeBSD supports.

Attilio

--
Peace can only be achieved by understanding - A. Einstein

John Baldwin

2014-10-29 14:59:16 UTC

Post by Andrew Turner
On Tue, 28 Oct 2014 15:33:06 +0100

Post by Andrew Turner
On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic
ops is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
provide full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so
it is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
added support for load-acquire and store-release atomic
instructions. For the use in atomic instructions we can assume
these only operate of the address passed to them.
It is unlikely we will use them in the 32-bit port however I would
like to know the expected semantics of these atomic functions to
make sure we get them correct in the arm64 port. I have been
advised by one of the ARM Linux kernel maintainers on the problems
they have found using these instructions but have yet to determine
what our atomic functions guarantee.

For FreeBSD the "reference doc" is atomic(9).

There may also be a difference between what it states, how they are
implemented, and what developers assume they do. I'm trying to make
sure I get them correct.

Actually, it was designed to support ia64 (and specifically the .acq and
.rel modifiers on the ld, st, and cmpxchg instructions). Some of the
langage is wrong (and is my fault) in that they are not "read" and
"write" barriers. They truly are "acquire" and "release". That said,
x86 has stronger barriers than that, partly because on i386 there wasn't
a whole lot of options (though atomic_store_rel on even i386 should just
be a simple store).

Eh, that isn't broken. It is subtle however. The reason it isn't broken
is that if any access to P occurs afer the 'load P', then the store will
fail and the load-acquire will be retried, if A was accessed during the
atomi op, the load-acquire during the try will discard that and force A
to be re-accessed. If P is not accessed during the atomic op, then it is
safe to access A during the atomic op itself.

1) Again, it's actually based on ia64.

2) x86 _does_ have rules on ordering of global visiblity in that most
stores (aside from some SSE special cases) will become visible in
program order. Now, you can't force the _timing_ of when the stores
become visible (and this is true in general, in MI code you can't
assume that a barrier is equivalent to a cache flush).

3) In this case I think Andrew is using "armv8" for "we" and you can
depend on architecture-specific semantics to determine the implementation
of atomic(9).

--
John Baldwin

Attilio Rao

2014-10-29 16:33:35 UTC

Post by Andrew Turner
On Tue, 28 Oct 2014 15:33:06 +0100

Post by Andrew Turner
On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic
ops is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
provide full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so
it is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
added support for load-acquire and store-release atomic
instructions. For the use in atomic instructions we can assume
these only operate of the address passed to them.
It is unlikely we will use them in the 32-bit port however I would
like to know the expected semantics of these atomic functions to
make sure we get them correct in the arm64 port. I have been
advised by one of the ARM Linux kernel maintainers on the problems
they have found using these instructions but have yet to determine
what our atomic functions guarantee.

For FreeBSD the "reference doc" is atomic(9).

There may also be a difference between what it states, how they are
implemented, and what developers assume they do. I'm trying to make
sure I get them correct.

This is specific to armv8, which I know 0 about. Good to know.
From a general point of view the description didn't seem ok.

Yes, this is what I mean. You can't have guarantee on the global
timing of the memory accesses.

Attilio

--
Peace can only be achieved by understanding - A. Einstein

Ian Lepore

2014-10-29 16:58:15 UTC

Post by Andrew Turner
On Tue, 28 Oct 2014 15:33:06 +0100

Post by Andrew Turner
On Tue, 28 Oct 2014 14:18:41 +0100

Post by Mateusz Guzik
As was mentioned sometime ago, our situation related to atomic
ops is not ideal.
atomic_load_acq_* and atomic_store_rel_* (at least on amd64)
provide full memory barriers, which is stronger than needed.
Moreover, load is implemented as lock cmpchg on var address, so
it is addditionally slower especially when cpus compete.

On 32-bit ARM prior to ARMv8 (i.e. all chips we currently support)
there are only full barriers. On both 32 and 64-bit ARMv8 ARM has
added support for load-acquire and store-release atomic
instructions. For the use in atomic instructions we can assume
these only operate of the address passed to them.
It is unlikely we will use them in the 32-bit port however I would
like to know the expected semantics of these atomic functions to
make sure we get them correct in the arm64 port. I have been
advised by one of the ARM Linux kernel maintainers on the problems
they have found using these instructions but have yet to determine
what our atomic functions guarantee.

For FreeBSD the "reference doc" is atomic(9).

There may also be a difference between what it states, how they are
implemented, and what developers assume they do. I'm trying to make
sure I get them correct.

I'm not sure I completely agree with all of this.

First, for

if any access to P occurs afer the 'load P', then the store will
fail and the load-acquire will be retried

The term 'access' needs to be changed to 'store'. Other read accesses
to P will not cause the store-exclusive to fail.

Next, when we consider 'Access A' I'm not sure it's true that the access
will replay if the store-exclusive fails and the operation loops. The
access to A may have been a prefetch, even a prefetch for data on a
predicted upcoming execution branch which may or may not end up being
taken.

I think the only think that makes an ldrex/strex sequence safe for use
in implementing synchronization primitives is to insert a 'dmb' after
the acquire loop (after the strex succeeds), and 'dsb' before the
release loop (dsb is required for SMP, dmb might be good enough on UP).

Looking into this has made me realize our current armv6/7 atomics are
incorrect in this regard. Guess I'll see about fixing them up Real Soon
Now. :)

-- Ian

1) Again, it's actually based on ia64.
2) x86 _does_ have rules on ordering of global visiblity in that most
stores (aside from some SSE special cases) will become visible in
program order. Now, you can't force the _timing_ of when the stores
become visible (and this is true in general, in MI code you can't
assume that a barrier is equivalent to a cache flush).
3) In this case I think Andrew is using "armv8" for "we" and you can
depend on architecture-specific semantics to determine the implementation
of atomic(9).

John Baldwin

2014-10-29 17:35:57 UTC

Post by Ian Lepore

Post by John Baldwin
Eh, that isn't broken. It is subtle however. The reason it isn't broken
is that if any access to P occurs afer the 'load P', then the store will
fail and the load-acquire will be retried, if A was accessed during the
atomi op, the load-acquire during the try will discard that and force A
to be re-accessed. If P is not accessed during the atomic op, then it is
safe to access A during the atomic op itself.

Correct, though for the places where acquire is used I believe that is ok.
Certainly for lock cookies it is ok. It's writes to the lock cookie that
would invalidate 'A'.

Post by Ian Lepore
Next, when we consider 'Access A' I'm not sure it's true that the access
will replay if the store-exclusive fails and the operation loops. The
access to A may have been a prefetch, even a prefetch for data on a
predicted upcoming execution branch which may or may not end up being
taken.
I think the only think that makes an ldrex/strex sequence safe for use
in implementing synchronization primitives is to insert a 'dmb' after
the acquire loop (after the strex succeeds), and 'dsb' before the
release loop (dsb is required for SMP, dmb might be good enough on UP).
Looking into this has made me realize our current armv6/7 atomics are
incorrect in this regard. Guess I'll see about fixing them up Real Soon
Now. :)

I'm not actually sure either, but it would be surprising to me otherwise.
Presumably there is nothing magic about a branch. Either the load-acquire
is an acquire barrier or it isn't. Namely, suppose you had this sequence:

load-acquire P
access A (prefetch)
load-acquire Q
load A

Would you expect the prefetch to satisfy the load or should the load-acquire
on Q discard that? Having a branch after a failing conditional store back
to the load acquire should work similarly. It has to discard anything that
was prefetched or it isn't an actual load-acquire.

That is consider:

1:
load-acquire P
access A (prefetch)
conditonal-store P
branch-if-fail 1b
load A

In the case that the branch fails, the sequence of operations is:

load-acquire P
access A (prefetch)
conditional-store P
branch
load-acquire P

That should be equivalent to the first sequence above unless the branch
instruction has the magical property of disabling memory barriers on the
instruction after a branch (which would be insane).

--
John Baldwin

Ian Lepore

2014-10-29 18:03:50 UTC

Post by Ian Lepore

Correct, though for the places where acquire is used I believe that is ok.
Certainly for lock cookies it is ok. It's writes to the lock cookie that
would invalidate 'A'.

I'm not actually sure either, but it would be surprising to me otherwise.
Presumably there is nothing magic about a branch. Either the load-acquire
load-acquire P
access A (prefetch)
load-acquire Q
load A
Would you expect the prefetch to satisfy the load or should the load-acquire
on Q discard that? Having a branch after a failing conditional store back
to the load acquire should work similarly. It has to discard anything that
was prefetched or it isn't an actual load-acquire.
load-acquire P
access A (prefetch)
conditonal-store P
branch-if-fail 1b
load A
load-acquire P
access A (prefetch)
conditional-store P
branch
load-acquire P
That should be equivalent to the first sequence above unless the branch
instruction has the magical property of disabling memory barriers on the
instruction after a branch (which would be insane).

I hadn't realized it when I wrote that, but Andy was speaking in the
context of armv8, which has a true load-acquire instruction. In our
current code (armv6 and 7) we need the explicit dmb/dsb barriers to get
the same effect. (It turns out we do have barriers, I misspoke earlier,
but some of our dmb need to be dsb.)

-- Ian

John Baldwin

2014-10-29 18:13:18 UTC

Post by Ian Lepore
I hadn't realized it when I wrote that, but Andy was speaking in the
context of armv8, which has a true load-acquire instruction. In our
current code (armv6 and 7) we need the explicit dmb/dsb barriers to get
the same effect. (It turns out we do have barriers, I misspoke earlier,
but some of our dmb need to be dsb.)

Ah, ok. Fair enough. :)

--
John Baldwin

Andrew Turner

2014-10-30 18:10:48 UTC

On Wed, 29 Oct 2014 13:35:57 -0400

Post by Ian Lepore
Next, when we consider 'Access A' I'm not sure it's true that the
access will replay if the store-exclusive fails and the operation
loops. The access to A may have been a prefetch, even a prefetch
for data on a predicted upcoming execution branch which may or may
not end up being taken.
I think the only think that makes an ldrex/strex sequence safe for
use in implementing synchronization primitives is to insert a 'dmb'
after the acquire loop (after the strex succeeds), and 'dsb' before
the release loop (dsb is required for SMP, dmb might be good enough
on UP).
Looking into this has made me realize our current armv6/7 atomics
are incorrect in this regard. Guess I'll see about fixing them up
Real Soon Now. :)

I have checked with someone in ARM. The prefetch should not be
considered an access with regard to the barrier and it could be moved
before it as it will only load data into the cache. The barrier only
deals with loading data into the core, i.e. if it has was part of the
prefetch it will be loaded from the cache no earlier than the
load-acquire. The cache coherency protocol ensures the data will be up
to date while the barrier will ensure the ordering of the load of A.

In the above example the prefetch of A will not be thrown away but the
data in the cache may change between the prefetch and load A if another
core has written to A. If this is the case the load will be of the new
data.

Andrew

John Baldwin

2014-10-30 19:03:13 UTC

Post by Andrew Turner
On Wed, 29 Oct 2014 13:35:57 -0400

Post by Ian Lepore
Next, when we consider 'Access A' I'm not sure it's true that the
access will replay if the store-exclusive fails and the operation
loops. The access to A may have been a prefetch, even a prefetch
for data on a predicted upcoming execution branch which may or may
not end up being taken.
I think the only think that makes an ldrex/strex sequence safe for
use in implementing synchronization primitives is to insert a 'dmb'
after the acquire loop (after the strex succeeds), and 'dsb' before
the release loop (dsb is required for SMP, dmb might be good enough
on UP).
Looking into this has made me realize our current armv6/7 atomics
are incorrect in this regard. Guess I'll see about fixing them up
Real Soon Now. :)

I have checked with someone in ARM. The prefetch should not be
considered an access with regard to the barrier and it could be moved
before it as it will only load data into the cache. The barrier only
deals with loading data into the core, i.e. if it has was part of the
prefetch it will be loaded from the cache no earlier than the
load-acquire. The cache coherency protocol ensures the data will be up
to date while the barrier will ensure the ordering of the load of A.
In the above example the prefetch of A will not be thrown away but the
data in the cache may change between the prefetch and load A if another
core has written to A. If this is the case the load will be of the new
data.

That is sufficient for what atomic(9)'s _acq wants, yes.

--
John Baldwin

Mateusz Guzik

2014-10-29 19:04:59 UTC

I mean stronger than needed in some cases, popular one is fget_unlocked
and we provide no "lightest sufficient" barrier (which would also be
cheaper).

Other case which benefits greatly is sys/sys/seq.h. As noted in some
other thread, using load_acq as it is destroys performance.

I don't dispute the need for full barriers, although it is unclear what
current consumers of load_acq actually need a full barrier..

Post by Attilio Rao
In short: optimizing the implementation for performance is fine and
due. Changing the semantic is not fine, unless you have reviewed and
fixed all the uses of _rel() and _acq().

I can see the value of such barriers in case you want to just
synchronize operation regards read or writes.
I also believe that on newest intel processors (for which we should
optimize) rmb() and wmb() got significantly faster than mb(). However
the most interesting case would be for arm and mips, I assume. That's
where you would see a bigger perf difference if you optimize the
membar paths.
Last time I looked into it, in FreeBSD kernel the Linux-ish
rmb()/wmb()/etc. were used primilarly in 3 places: Linux-derived code,
handling of 16-bits operand and implementation of "faster" bus
barriers.
Initially I had thought about just confining the smp_*() in a Linux
compat layer and fix the other 2 in this way: for 16-bits operands
just pad to 32-bits, as the C11 standard also does. For the bus
barriers, just grow more versions to actually include the rmb()/wmb()
scheme within.
At this point, I understand we may want to instead support the
concept of write-only or read-only barrier. This means that if we want
to keep the concept tied to the current _acq()/_rel() scheme we will
end up with a KPI explosion.
I'm not the one making the call here, but for a faster and more
granluar approach, possibly we can end up using smp_rmb() and
smp_wmb() directly. As I said I'm not the one making the call.

Well, I don't know original motivation for expressing stuff with
_load_acq and _store_rel.

Anyway, maybe we could do something along (expressing intent, not actual
code):

mb_producer_start(p, v) { *p = v; smp_wmb(); }
mb_producer(p, v) { smp_wmb(); *p = v; }
mb_producer_end(p, v) { mb_producer(p, v); }

type mb_consumer(p) { var = *p; smp_rmb(); return (var); }
type mb_consumer_start(p) { return (mb_consumer(p)); }
type mb_consumer_end(p) { smp_rmb(); return (*p); }

--
Mateusz Guzik <mjguzik gmail.com>

Konstantin Belousov

2014-10-28 13:42:54 UTC

x86 atomic_store_rel() does not establish any cpu barrier, due to the
already provided guarantees of the architecture.

Post by Mateusz Guzik
Moreover, load is implemented as lock cmpchg on var address, so it is
addditionally slower especially when cpus compete.
On amd64 it is sufficient to place a compiler barrier in such cases.
Next, we lack some atomic ops in the first place.
smp_wmb - no writes can be reordered past this point
smp_rmb - no reads can be reordered past this point
1. var = tmp; smp_wmb();
2. tmp = var; smp_rmb();
3. smp_rmb(); tmp = var;
This matters since what we can use already to emulate this is way
heavier than needed on aforementioned amd64 and most likely other archs.
It is unclear to me whether it makes sense to alter what
atomic_load_acq_* are currently doing.

I still think that our load/stores, comparing with the classic definition
of the operations, are ordered, i.e. what is called sequential consistent
in the C standard. I have no idea if we want this property, or is it
used really. The kern_intr.c (ab)uses load in this way.

Post by Mateusz Guzik
The simplest thing would be to just introduce aforementioned macros.
Unfortunately I don't have any ideas for new function names.
I was considering stealing consumer/producer wording instead of acq/rel,
but that does not help with case 1.
Also there is no common header for atomic ops.
I propose adding sys/atomic.h which includes machine/atomic.h. Then it
would provide atomic ops missing from md header implemented using what
is already there.
For an example where it could be useful see
https://svnweb.freebsd.org/base/head/sys/sys/seq.h?view=markup
Comments?
- atomic_load_acq_rmb_int is a terrible name and I'm trying to get rid
of it
- seq_consistent misses a read memory barrier, but in worst case this
will result in spurious ENOTCAPABLE returned. security problem of
circumventing capabilities is plugged since seq is properly re-checked
before we return
--
Mateusz Guzik <mjguzik gmail.com>

Mateusz Guzik

2015-04-09 06:14:49 UTC

On Tue, Oct 28, 2014 at 03:52:22AM +0100, Mateusz Guzik wrote:
[scratching old content so that I hopefully re-state it nicer]

I would like to reviwe the discussion about memory barriers provided in
the kernel.

The kernel (at least on amd64) lacks lightweight barriers providing only
following guarantees:
- all writes are completed prior to given point
- all reads are completed prior to given point

On amd64 such barriers require only compiler barrier, and as such
obviously beat currently used operations like load_acq (which uses
cmpxchg).

Example consumer which would benefit greatly from such barriers is
seq.h:
https://svnweb.freebsd.org/base/head/sys/sys/seq.h?view=markup

_load_acq on amd64 provides full barrier and it was noted we should not
change that in order to not break possible 3rd party consumers.
Also I don't see any alternative naming convention trying to stick to
this scheme that we could use.

As such I propose stealing naming from Linux and introduction of smp_wmb
and smp_rmb macros providing aforementioned funcionality.

So for amd64 this would be:
#define smp_wmb() __compiler_membar()
#define smp_rmb() __compiler_membar()

Any objections?

I'm happy to talk to arch maintainers in order to get relevant
implementations for all architectures.

--
Mateusz Guzik <mjguzik gmail.com>

Alan Cox

2015-04-09 08:17:01 UTC

Post by Mateusz Guzik
[scratching old content so that I hopefully re-state it nicer]
I would like to reviwe the discussion about memory barriers provided in
the kernel.
The kernel (at least on amd64) lacks lightweight barriers providing only
- all writes are completed prior to given point
- all reads are completed prior to given point
On amd64 such barriers require only compiler barrier, and as such
obviously beat currently used operations like load_acq (which uses
cmpxchg).
Example consumer which would benefit greatly from such barriers is
https://svnweb.freebsd.org/base/head/sys/sys/seq.h?view=markup
_load_acq on amd64 provides full barrier and it was noted we should not
change that in order to not break possible 3rd party consumers.
Also I don't see any alternative naming convention trying to stick to
this scheme that we could use.
As such I propose stealing naming from Linux and introduction of smp_wmb
and smp_rmb macros providing aforementioned funcionality.
#define smp_wmb() __compiler_membar()
#define smp_rmb() __compiler_membar()
Any objections?
I'm happy to talk to arch maintainers in order to get relevant
implementations for all architectures.

How about stealing from C11's stdatomic.h instead of Linux. C11's model
for expressing memory access ordering requirements is, like our
atomic.h, inspired by the release consistency model. And, stdatomic.h
has an operation, atomic_thread_fence(), that allows you to express the
need for acquire and/or release ordering at some point in your program
without an associated memory access.

Warner Losh

2015-04-09 13:06:03 UTC