Discussion:
RFC: setting performance_cx_lowest=C2 in -HEAD to avoid lock contention on many-CPU boxes
Adrian Chadd
2015-04-25 16:31:31 UTC
Permalink
Hi!

I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.

I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.

If no-one has a problem with this then I'll do it after the weekend.

Thanks!



-adrian
K. Macy
2015-04-25 17:18:50 UTC
Permalink
Perhaps use an arbitrary cutoff - say <= 8 cores - where the
cx_lowest=C3. This serialization isn't going to hurt on systems with
more modest core counts.
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
Thanks!
-adrian
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Adrian Chadd
2015-04-25 17:37:40 UTC
Permalink
Post by K. Macy
Perhaps use an arbitrary cutoff - say <= 8 cores - where the
cx_lowest=C3. This serialization isn't going to hurt on systems with
more modest core counts.
Maybe. I bet it's a function of the idle state entry rate and core
count - so maybe at 8 cores it'll hurt but only if it's entering idle
at a high rate. Eg, if it's taking a hell of a lot of interrupts but
not maxing out the CPU.




-adrian
Adrian Chadd
2015-04-25 17:44:16 UTC
Permalink
Oh the other thing, which I just mentioned to kip in IRC - all of the
intel laptops I've tested (and that's a long list) don't enter CPU C7
if the power is plugged in.

Ie:

* power in, ACPI C2 -> CPU C6
* power in, ACPI C3 - CPU C6
* battery - ACPI C2 -> CPU C6
* battery -> ACPI C3 -> CPU C7

So having performance_cx_lowest=C3 is effectively a no-op on the
devices that it'd matter on, so it's okay to just flip it to C2.



-adrian
K. Macy
2015-04-25 18:02:09 UTC
Permalink
Um, don't you care more about power savings when it's on battery?
Post by Adrian Chadd
Oh the other thing, which I just mentioned to kip in IRC - all of the
intel laptops I've tested (and that's a long list) don't enter CPU C7
if the power is plugged in.
* power in, ACPI C2 -> CPU C6
* power in, ACPI C3 - CPU C6
* battery - ACPI C2 -> CPU C6
* battery -> ACPI C3 -> CPU C7
So having performance_cx_lowest=C3 is effectively a no-op on the
devices that it'd matter on, so it's okay to just flip it to C2.
-adrian
Adrian Chadd
2015-04-25 18:05:45 UTC
Permalink
Post by K. Macy
Um, don't you care more about power savings when it's on battery?
right, that's what I'm changing performance_cx_lowest, and not
economy_cx_lowest.

performance == AC power
economy == battery
Post by K. Macy
Post by Adrian Chadd
Oh the other thing, which I just mentioned to kip in IRC - all of the
intel laptops I've tested (and that's a long list) don't enter CPU C7
if the power is plugged in.
* power in, ACPI C2 -> CPU C6
* power in, ACPI C3 - CPU C6
* battery - ACPI C2 -> CPU C6
* battery -> ACPI C3 -> CPU C7
So having performance_cx_lowest=C3 is effectively a no-op on the
devices that it'd matter on, so it's okay to just flip it to C2.
-adrian
Davide Italiano
2015-04-25 18:18:05 UTC
Permalink
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
This sounds to me just a way to hide a problem.
Very few people nowaday run on NUMA and they can tune the machine as
they like when they do testing.
If there's a lock contention problem, it needs to be fixed and not
hidden under another default.
Also, as already noted this is a problem on 80-core machines but
probably not on a 2-core Atom. I think you need to understand factors
better and come up with a more sensible relation. In other words, your
bet needs to be proven before changing a default useful for frew that
can impact many.
--
Davide

"There are no solved problems; there are only problems that are more
or less solved" -- Henri Poincare
Adrian Chadd
2015-04-25 18:45:10 UTC
Permalink
Post by Davide Italiano
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
This sounds to me just a way to hide a problem.
Very few people nowaday run on NUMA and they can tune the machine as
they like when they do testing.
If there's a lock contention problem, it needs to be fixed and not
hidden under another default.
The lock contention problem is inside ACPI and how it's designed/implemented.
We're not going to easily be able to make ACPI lock "better" as we're
constrained by how ACPI implements things in the shared ACPICA code.
Post by Davide Italiano
Also, as already noted this is a problem on 80-core machines but
probably not on a 2-core Atom. I think you need to understand factors
better and come up with a more sensible relation. In other words, your
bet needs to be proven before changing a default useful for frew that
can impact many.
I've just described the differences in behaviour. I've checked the C
states on all the intel servers too - with power plugged in, ACPI C2
and ACPI C3 still result in entering CPU C6 state, not CPU C7 state -
so it's not going to result in worse behaviour.

For reference, "all" being the following list:

* westmere-EX
* nehalem
* sandybridge
* sandybridge mobile
* sandybridge xeon
* ivybridge mobile
* ivybridge xeon
* haswell mobile
* haswell
* haswell xeon
* haswell xeon v3



-adrian
John Baldwin
2015-04-28 13:35:10 UTC
Permalink
Post by Adrian Chadd
Post by Davide Italiano
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
This sounds to me just a way to hide a problem.
Very few people nowaday run on NUMA and they can tune the machine as
they like when they do testing.
If there's a lock contention problem, it needs to be fixed and not
hidden under another default.
The lock contention problem is inside ACPI and how it's designed/implemented.
We're not going to easily be able to make ACPI lock "better" as we're
constrained by how ACPI implements things in the shared ACPICA code.
Is the contention actually harmful? Note that this only happens when the
CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle
stuff uses hueristics to only drop into deeper sleep states if the CPU has
recently been idle "more" so that if you are relatively busy you will only go
into C1 instead. (I think this latter might have changed since eventtimers
came in, it looks like we now choose the idle state based on how long until
the next timer interrupt?)

If the only consequence of this is that it adds noise to profiling, then hack
your profiling results to ignore this lock. I think that is a better tradeoff
than sacraficing power gains to reduce noise in profiling output.

Alternatively, your machine may be better off using cpu_idle_mwait. There
are already CPUs now that only advertise deeper sleep states for use with
mwait but not ACPI, so we may certainly end up with defaulting to mwait
instead of ACPI for certain CPUs anyway.
--
John Baldwin
Konstantin Belousov
2015-04-28 14:13:02 UTC
Permalink
Post by John Baldwin
Post by Adrian Chadd
Post by Davide Italiano
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
This sounds to me just a way to hide a problem.
Very few people nowaday run on NUMA and they can tune the machine as
they like when they do testing.
If there's a lock contention problem, it needs to be fixed and not
hidden under another default.
The lock contention problem is inside ACPI and how it's designed/implemented.
We're not going to easily be able to make ACPI lock "better" as we're
constrained by how ACPI implements things in the shared ACPICA code.
Is the contention actually harmful? Note that this only happens when the
CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle
stuff uses hueristics to only drop into deeper sleep states if the CPU has
recently been idle "more" so that if you are relatively busy you will only go
into C1 instead. (I think this latter might have changed since eventtimers
came in, it looks like we now choose the idle state based on how long until
the next timer interrupt?)
You have to spin, waiting other cores, to get the right to reduce the
power state.
Post by John Baldwin
If the only consequence of this is that it adds noise to profiling, then hack
your profiling results to ignore this lock. I think that is a better tradeoff
than sacraficing power gains to reduce noise in profiling output.
I suspect that it adds latency, since interrupts cannot stop the wait for
the ACPI lock. Also, it probably increases the power usage since CPU
has to spend more time contending for the lock instead of sleeping.
Post by John Baldwin
Alternatively, your machine may be better off using cpu_idle_mwait. There
are already CPUs now that only advertise deeper sleep states for use with
mwait but not ACPI, so we may certainly end up with defaulting to mwait
instead of ACPI for certain CPUs anyway.
cpu_idle_mwait is quite useless, it only enters C1, which should be
almost the same as hlt. mwait for C1 might reduce latency of waking up,
but definitely would not reduce power consumption on par with higher Cx.

That said, I think that for non-laptop usage, limiting lowest state to C2
is fine. For Haswells, Intel recommendation for BIOS writers is to
limit the announced states to C2 (eliminating the BM avoidance at all).
Internally ACPI C2 is mapped to CPU C6 or might be even C7.
John Baldwin
2015-04-28 14:23:33 UTC
Permalink
Post by Konstantin Belousov
Post by John Baldwin
Post by Adrian Chadd
Post by Davide Italiano
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
This sounds to me just a way to hide a problem.
Very few people nowaday run on NUMA and they can tune the machine as
they like when they do testing.
If there's a lock contention problem, it needs to be fixed and not
hidden under another default.
The lock contention problem is inside ACPI and how it's designed/implemented.
We're not going to easily be able to make ACPI lock "better" as we're
constrained by how ACPI implements things in the shared ACPICA code.
Is the contention actually harmful? Note that this only happens when the
CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle
stuff uses hueristics to only drop into deeper sleep states if the CPU has
recently been idle "more" so that if you are relatively busy you will only go
into C1 instead. (I think this latter might have changed since eventtimers
came in, it looks like we now choose the idle state based on how long until
the next timer interrupt?)
You have to spin, waiting other cores, to get the right to reduce the
power state.
Yes, normally spinning wouldn't do that, but the cpu idle hooks run with
interrupts disabled. We could fix that perhaps though Acpi doesn't quite
have what we would want (a single op that would disable interrupts after
grabbing the lock, do the test and set of the bit in question and return
its old value leaving interrupts disabled after dropping the lock).

However, I would still like to know if the contention here is actually
harmful in some measurable way aside from showing up in profiling output.
Post by Konstantin Belousov
Post by John Baldwin
Alternatively, your machine may be better off using cpu_idle_mwait. There
are already CPUs now that only advertise deeper sleep states for use with
mwait but not ACPI, so we may certainly end up with defaulting to mwait
instead of ACPI for certain CPUs anyway.
cpu_idle_mwait is quite useless, it only enters C1, which should be
almost the same as hlt. mwait for C1 might reduce latency of waking up,
but definitely would not reduce power consumption on par with higher Cx.
Mmm, it was your pending patch I was thinking of. Don't you use mwait with
the hints to use deeper sleep states in your change?
Post by Konstantin Belousov
That said, I think that for non-laptop usage, limiting lowest state to C2
is fine. For Haswells, Intel recommendation for BIOS writers is to
limit the announced states to C2 (eliminating the BM avoidance at all).
Internally ACPI C2 is mapped to CPU C6 or might be even C7.
The problem of course is detecting non-laptops. :-/ In my own crude
measurements based on the power draw numbers in the BMC on recent
SuperMicro X9 boards for SandyBridge servers, most of the gain you get is
from C2; C3 doesn't add much difference once you are able to do C2. Also of
note is the comment above the busmaster register in question about USB. I'm
not sure if that is still true anymore. If it were, systems would never go
into C3 in which case this would be a moot point and there would be no need to
enable C3.
--
John Baldwin
Konstantin Belousov
2015-04-28 16:55:03 UTC
Permalink
Post by John Baldwin
Post by Konstantin Belousov
Post by John Baldwin
Post by Adrian Chadd
Post by Davide Italiano
Post by Adrian Chadd
Hi!
I've been doing some NUMA testing on large boxes and I've found that
there's lock contention in the ACPI path. It's due to my change a
while ago to start using sleep states above ACPI C1 by default. The
ACPI C3 state involves a bunch of register fiddling in the ACPI sleep
path that grabs a serialiser lock, and on an 80 thread box this is
costly.
I'd like to drop performance_cx_lowest to C2 in -HEAD. ACPI C2 state
doesn't require the same register fiddling (to disable bus mastering,
if I'm reading it right) and so it doesn't enter that particular
serialised path. I've verified on Westmere-EX, Sandybridge, Ivybridge
and Haswell boxes that ACPI C2 does let one drop down into a deeper
CPU sleep state (C6 on each of these). I think is still a good default
for both servers and desktops.
If no-one has a problem with this then I'll do it after the weekend.
This sounds to me just a way to hide a problem.
Very few people nowaday run on NUMA and they can tune the machine as
they like when they do testing.
If there's a lock contention problem, it needs to be fixed and not
hidden under another default.
The lock contention problem is inside ACPI and how it's designed/implemented.
We're not going to easily be able to make ACPI lock "better" as we're
constrained by how ACPI implements things in the shared ACPICA code.
Is the contention actually harmful? Note that this only happens when the
CPUs are idle, not when doing actual work. In addition, IIRC, the ACPI idle
stuff uses hueristics to only drop into deeper sleep states if the CPU has
recently been idle "more" so that if you are relatively busy you will only go
into C1 instead. (I think this latter might have changed since eventtimers
came in, it looks like we now choose the idle state based on how long until
the next timer interrupt?)
You have to spin, waiting other cores, to get the right to reduce the
power state.
Yes, normally spinning wouldn't do that, but the cpu idle hooks run with
interrupts disabled. We could fix that perhaps though Acpi doesn't quite
have what we would want (a single op that would disable interrupts after
grabbing the lock, do the test and set of the bit in question and return
its old value leaving interrupts disabled after dropping the lock).
However, I would still like to know if the contention here is actually
harmful in some measurable way aside from showing up in profiling output.
I think Adrian could run intel pmc on his box with C2 and C3 and
compare the power reports.
Post by John Baldwin
Post by Konstantin Belousov
Post by John Baldwin
Alternatively, your machine may be better off using cpu_idle_mwait. There
are already CPUs now that only advertise deeper sleep states for use with
mwait but not ACPI, so we may certainly end up with defaulting to mwait
instead of ACPI for certain CPUs anyway.
cpu_idle_mwait is quite useless, it only enters C1, which should be
almost the same as hlt. mwait for C1 might reduce latency of waking up,
but definitely would not reduce power consumption on par with higher Cx.
Mmm, it was your pending patch I was thinking of. Don't you use mwait with
the hints to use deeper sleep states in your change?
Only in the acpi idle method. It is not safe to blindly enter states
higher than C1 with mwait.

Intel wrote a driver for Linux which does not rely on ACPU _CST tables
for this. The driver has hard-coded tables for cores >= Nehalem which
specify supported states, latency and cache behaviour. This is what
I tried to mention in the original mail. If we write such driver (and
rip the tables from Linux), we could allow deeper states in the
cpu_idle_mwait. But I remember that avg did not liked the approach,
and I agree that this is not maintanable, if you are not Intel.
Post by John Baldwin
Post by Konstantin Belousov
That said, I think that for non-laptop usage, limiting lowest state to C2
is fine. For Haswells, Intel recommendation for BIOS writers is to
limit the announced states to C2 (eliminating the BM avoidance at all).
Internally ACPI C2 is mapped to CPU C6 or might be even C7.
The problem of course is detecting non-laptops. :-/ In my own crude
measurements based on the power draw numbers in the BMC on recent
SuperMicro X9 boards for SandyBridge servers, most of the gain you get is
from C2; C3 doesn't add much difference once you are able to do C2. Also of
note is the comment above the busmaster register in question about USB. I'm
not sure if that is still true anymore. If it were, systems would never go
into C3 in which case this would be a moot point and there would be no need to
enable C3.
I remember turbo boost requires C3, and non-trivially deep package C states
on older CPUs also require C3. This is an argument against Adrian' change,
but I think it is not applicable on newer processors.

Loading...