ULE steal_idle questions

Discussion:

ULE steal_idle questions

Don Lewis

2017-08-23 15:04:49 UTC

I've been looking at the steal_idle code in tdq_idled() and found some
things that puzzle me.

Consider a machine with three CPUs:
A, which is idle
B, which is busy running a thread
C, which is busy running a thread and has another thread in queue
It would seem to make sense that the tdq_load values for these three
CPUs would be 0, 1, and 2 respectively in order to select the best CPU
to run a new thread.

If so, then why do we pass thresh=1 to sched_highest() in the code that
implements steal_idle? That value is used to set cs_limit which is used
in this comparison in cpu_search:
if (match & CPU_SEARCH_HIGHEST)
if (tdq->tdq_load >= hgroup.cs_limit &&
That would seem to make CPU B a candidate for stealing a thread from.
Ignoring CPU C for the moment, that shouldn't happen if the thread is
running, but even if it was possible, it would just make CPU B go idle,
which isn't terribly helpful in terms of load balancing and would just
thrash the caches. The same comparison is repeated in tdq_idled() after
a candidate CPU has been chosen:
if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}

It looks to me like there is an off-by-one error here, and there is a
similar problem in the code that implements kern.sched.balance.

The reason I ask is that I've been debugging random segfaults and other
strange errors on my Ryzen machine and the problems mostly go away if I
either disable kern.sched.steal_idle and kern_sched.balance, or if I
leave kern_sched.steal_idle enabled and hack the code to change the
value of thresh from 1 to 2. See
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029> for the gory
details. I don't know if my CPU has what AMD calls the "performance
marginality issue".

Andriy Gapon

2017-08-23 19:26:36 UTC

Post by Don Lewis
I've been looking at the steal_idle code in tdq_idled() and found some
things that puzzle me.
A, which is idle
B, which is busy running a thread
C, which is busy running a thread and has another thread in queue
It would seem to make sense that the tdq_load values for these three
CPUs would be 0, 1, and 2 respectively in order to select the best CPU
to run a new thread.
If so, then why do we pass thresh=1 to sched_highest() in the code that
implements steal_idle? That value is used to set cs_limit which is used
if (match & CPU_SEARCH_HIGHEST)
if (tdq->tdq_load >= hgroup.cs_limit &&
That would seem to make CPU B a candidate for stealing a thread from.
Ignoring CPU C for the moment, that shouldn't happen if the thread is
running, but even if it was possible, it would just make CPU B go idle,
which isn't terribly helpful in terms of load balancing and would just
thrash the caches. The same comparison is repeated in tdq_idled() after
if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}
It looks to me like there is an off-by-one error here, and there is a
similar problem in the code that implements kern.sched.balance.

I agree with your analysis. I had the same questions as well.
I think that the tdq_transferable check is what saves the code from
running into any problems. But it indeed would make sense for the code
to understand that tdq_load includes a currently running, never
transferable thread as well.

Post by Don Lewis
The reason I ask is that I've been debugging random segfaults and other
strange errors on my Ryzen machine and the problems mostly go away if I
either disable kern.sched.steal_idle and kern_sched.balance, or if I
leave kern_sched.steal_idle enabled and hack the code to change the
value of thresh from 1 to 2. See
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029> for the gory
details. I don't know if my CPU has what AMD calls the "performance
marginality issue".

I have been following your experiments and it's interesting that
"massaging" the CPU in certain ways makes it a bit happier. But
certainly the fault is with the CPU as the code is trouble-free on many
different architectures including x86, and various processors from both
Intel and AMD [with earlier CPU families].

--
Andriy Gapon

Don Lewis

2017-08-23 20:58:42 UTC

Post by Andriy Gapon

Post by Don Lewis
I've been looking at the steal_idle code in tdq_idled() and found some
things that puzzle me.
A, which is idle
B, which is busy running a thread
C, which is busy running a thread and has another thread in queue
It would seem to make sense that the tdq_load values for these three
CPUs would be 0, 1, and 2 respectively in order to select the best CPU
to run a new thread.
If so, then why do we pass thresh=1 to sched_highest() in the code that
implements steal_idle? That value is used to set cs_limit which is used
if (match & CPU_SEARCH_HIGHEST)
if (tdq->tdq_load >= hgroup.cs_limit &&
That would seem to make CPU B a candidate for stealing a thread from.
Ignoring CPU C for the moment, that shouldn't happen if the thread is
running, but even if it was possible, it would just make CPU B go idle,
which isn't terribly helpful in terms of load balancing and would just
thrash the caches. The same comparison is repeated in tdq_idled() after
if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}
It looks to me like there is an off-by-one error here, and there is a
similar problem in the code that implements kern.sched.balance.

I agree with your analysis. I had the same questions as well.
I think that the tdq_transferable check is what saves the code from
running into any problems. But it indeed would make sense for the code
to understand that tdq_load includes a currently running, never
transferable thread as well.

Yes, I think that the tdq_transferable check will fail, but at the cost
of an unnecessary tdq_lock_pair()/tdq_unlock_pair() and another loop
iteration, including another expensive sched_highest() call. Consider
the case of a close to fully loaded system where all of the other CPUs
each have one running thread. The current code will try each of the
other CPUs, calling tdq_lock_pair(), failing the tdq_transferable check,
calling tdq_unlock_pair(), then restarting the loop with that CPU
removed from mask, and all of this done with interrupts disabled. The
proper thing to do in this case would be to just go into the idle state.

Post by Andriy Gapon

Post by Don Lewis
The reason I ask is that I've been debugging random segfaults and other
strange errors on my Ryzen machine and the problems mostly go away if I
either disable kern.sched.steal_idle and kern_sched.balance, or if I
leave kern_sched.steal_idle enabled and hack the code to change the
value of thresh from 1 to 2. See
<https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029> for the gory
details. I don't know if my CPU has what AMD calls the "performance
marginality issue".

I have been following your experiments and it's interesting that
"massaging" the CPU in certain ways makes it a bit happier. But
certainly the fault is with the CPU as the code is trouble-free on many
different architectures including x86, and various processors from both
Intel and AMD [with earlier CPU families].

The results of my experiments so far are pointing to the time spent
looping in tdq_idled() as at least one cause of the problems. If I
restrict tdq_idled() to look at just a single core, things are happy. If
I then set steal_thresh to 1 so that it has to loop, then things get
unhappy. If I allow tdq_idled() to look at both the current core and
current CCX with thresh at the CCX level set to 1 so that the code
loops, things are unhappy. If I set thresh to 2 so that the code does
not loop unnecessarily, things get happy again. If I allow tdq_idled()
to look at the entire topology (so that there are now three calls to
sched_highest(), things get unhappy again.

There is a known issue with executing IRET on one SMT thread when the
other SMT thread on that core is "busy", but I don't know how that would
affect things. Perhaps that can be fixed in microcode.

sched_highest() looks like it is really expensive in terms of CPU
cycles. On Ryzen, if we can't find a suitable thread on the current CXX
to transfer and step up to the chip level, sched_highest() will
recalculate the load on the current CCX even though we have already
rejected it. Things get worse when using cpuset because if tdq_move()
fails due to cpuset (or other) restrictions, then we call
sched_highest() all over again to redo all the calculations, but with
the previously chosen CPU removed from the potential choices. This will
get even worse with Threadripper and beyond. Even if it doesn't cause
obvious breakage, it is bad for interrupt latency.

I'm not convinced that the ghc and go problems are Ryzen bugs and not
bugs in the code for those two ports. I've never seen build failures
for those on my FX-8320E, but the increased number of threads on Ryzen
might expose some latent problems.

Konstantin Belousov

2017-08-24 13:00:36 UTC

Post by Don Lewis
I'm not convinced that the ghc and go problems are Ryzen bugs and not
bugs in the code for those two ports. I've never seen build failures
for those on my FX-8320E, but the increased number of threads on Ryzen
might expose some latent problems.

Could you post the verbose dmesg from Ryzen's boot somewhere ?

Don Lewis

2017-08-24 15:52:17 UTC

Post by Konstantin Belousov

Post by Don Lewis
I'm not convinced that the ghc and go problems are Ryzen bugs and not
bugs in the code for those two ports. I've never seen build failures
for those on my FX-8320E, but the increased number of threads on Ryzen
might expose some latent problems.

Could you post the verbose dmesg from Ryzen's boot somewhere ?

https://people.freebsd.org/~truckman/ryzen-dmesg.boot

Don Lewis

2017-08-24 16:41:03 UTC

Aside from the Ryzen problem, I think the steal_idle code should be
re-written so that it doesn't block interrupts for so long. In its
current state, interrupt latence increases with the number of cores and
the complexity of the topology.

What I'm thinking is that we should set a flag at the start of the
search for a thread to steal. If we are preempted by another, higher
priority thread, that thread will clear the flag. Next we start the
loop to search up the hierarchy. Once we find a candidate CPU:

steal = TDQ_CPU(cpu);
CPU_CLR(cpu, &mask);
tdq_lock_pair(tdq, steal);
if (tdq->tdq_load != 0) {
goto out; /* to exit loop and switch to the new thread */
}
if (flag was cleared) {
tdq_unlock_pair(tdq, steal);
goto restart; /* restart the search */
}
if (steal->tdq_load < thresh || steal->tdq_transferable == 0 ||
tdq_move(steal, tdq) == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}
out:
TDQ_UNLOCK(steal);
clear flag;
mi_switch(SW_VOL | SWT_IDLE, NULL);
thread_unlock(curthread);
return (0);

And we also have to clear the flag if we did not find a thread to steal.

Don Lewis

2017-08-24 19:25:13 UTC

Post by Andriy Gapon

Post by Don Lewis
I've been looking at the steal_idle code in tdq_idled() and found some
things that puzzle me.
A, which is idle
B, which is busy running a thread
C, which is busy running a thread and has another thread in queue
It would seem to make sense that the tdq_load values for these three
CPUs would be 0, 1, and 2 respectively in order to select the best CPU
to run a new thread.
If so, then why do we pass thresh=1 to sched_highest() in the code that
implements steal_idle? That value is used to set cs_limit which is used
if (match & CPU_SEARCH_HIGHEST)
if (tdq->tdq_load >= hgroup.cs_limit &&
That would seem to make CPU B a candidate for stealing a thread from.
Ignoring CPU C for the moment, that shouldn't happen if the thread is
running, but even if it was possible, it would just make CPU B go idle,
which isn't terribly helpful in terms of load balancing and would just
thrash the caches. The same comparison is repeated in tdq_idled() after
if (steal->tdq_load < thresh || steal->tdq_transferable == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}
It looks to me like there is an off-by-one error here, and there is a
similar problem in the code that implements kern.sched.balance.

I agree with your analysis. I had the same questions as well.
I think that the tdq_transferable check is what saves the code from
running into any problems. But it indeed would make sense for the code
to understand that tdq_load includes a currently running, never
transferable thread as well.

Things aren't quite as bad as I initially thought. cpu_search() does
look at tdq_transferable so sched_highest() should not return a cpu that
does not have a transferable thread at the time it was examined, so in
most cases the unnecessary lock/unlock shouldn't happen. The extra
check after the lock will catch the case where tdq_transferable went to
zero between when it was examined by cpu_search() and when we actually
grabbed the lock. Using a larger thresh value for SMT threads is still
a no-op, though.

Don Lewis

2017-08-25 18:24:10 UTC

Post by Don Lewis
Aside from the Ryzen problem, I think the steal_idle code should be
re-written so that it doesn't block interrupts for so long. In its
current state, interrupt latence increases with the number of cores and
the complexity of the topology.
What I'm thinking is that we should set a flag at the start of the
search for a thread to steal. If we are preempted by another, higher
priority thread, that thread will clear the flag. Next we start the
steal = TDQ_CPU(cpu);
CPU_CLR(cpu, &mask);
tdq_lock_pair(tdq, steal);
if (tdq->tdq_load != 0) {
goto out; /* to exit loop and switch to the new thread */
}
if (flag was cleared) {
tdq_unlock_pair(tdq, steal);
goto restart; /* restart the search */
}
if (steal->tdq_load < thresh || steal->tdq_transferable == 0 ||
tdq_move(steal, tdq) == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}
TDQ_UNLOCK(steal);
clear flag;
mi_switch(SW_VOL | SWT_IDLE, NULL);
thread_unlock(curthread);
return (0);
And we also have to clear the flag if we did not find a thread to steal.

I've implemented something like this and added a bunch of counters to it
to get a better understanding of its behavior. Instead of adding a flag
to detect preemption, I used the same switchcnt test as is used by
sched_idletd(). These are the results of a ~9 hour poudriere run:

kern.sched.steal.none: 9971668 # no threads were stolen
kern.sched.steal.fail: 23709 # unable to steal from cpu=sched_highest()
kern.sched.steal.level2: 191839 # somewhere on this chip
kern.sched.steal.level1: 557659 # a core on this CCX
kern.sched.steal.level0: 4555426 # the other SMT thread on this core
kern.sched.steal.restart: 404 # preemption detected so restart the search
kern.sched.steal.call: 15276638 # of times tdq_idled() called

There are a few surprises here.

One is the number of failed moves. I don't know if the load on the
source CPU fell below thresh, tdq_transferable went to zero, or if
tdq_move() failed. I also wonder if the failures are evenly distributed
across CPUs. It is possible that these failures are concentrated on CPU
0, which handles most interrupts. If interrupts don't affect switchcnt,
then the data collected by sched_highest() could be a bit stale and we
would not know it.

Something else that I did not expect is the how frequently threads are
stolen from the other SMT thread on the same core, even though I
increased steal_thresh from 2 to 3 to account for the off-by-one
problem. This is true even right after the system has booted and no
significant load has been applied. My best guess is that because of
affinity, both the parent and child processes run on the same CPU after
fork(), and if a number of processes are forked() in quick succession,
the run queue of that CPU can get really long. Forcing a thread
migration in exec() might be a good solution.

Bruce Evans

2017-08-26 00:28:50 UTC

Post by Don Lewis
...
Something else that I did not expect is the how frequently threads are
stolen from the other SMT thread on the same core, even though I
increased steal_thresh from 2 to 3 to account for the off-by-one
problem. This is true even right after the system has booted and no
significant load has been applied. My best guess is that because of
affinity, both the parent and child processes run on the same CPU after
fork(), and if a number of processes are forked() in quick succession,
the run queue of that CPU can get really long. Forcing a thread
migration in exec() might be a good solution.

Since you are trying a lot of combinations, maybe you can tell us which
ones work best. SCHED_4BSD works better for me on an old 2-core system.
SCHED_ULE works better on a not-so old 4x2 core (Haswell) system, but I
don't like it due to its complexity. It makes differences of at most
+-2% except when mistuned it can give -5% for real time (but better for
CPU and presumably power).

For SCHED_4BSD, I wrote fancy tuning for fork/exec and sometimes get
everything to like up for a 3% improvement (803 seconds instead of 823
on the old system, with -current much slower at 840+ and old versions
of ULE before steal_idle taking 890+). This is very resource (mainly
cache associativity?) dependent and my tuning makes little difference
on the newer system. SCHED_ULE still has bugfeatures which tend to
help large builds by reducing context switching, e.g., by bogusly
clamping all CPU-bound threads to nearly maximal priority.

Bruce

Don Lewis

2017-08-26 17:50:16 UTC

Post by Don Lewis

Post by Don Lewis
Aside from the Ryzen problem, I think the steal_idle code should be
re-written so that it doesn't block interrupts for so long. In its
current state, interrupt latence increases with the number of cores and
the complexity of the topology.
What I'm thinking is that we should set a flag at the start of the
search for a thread to steal. If we are preempted by another, higher
priority thread, that thread will clear the flag. Next we start the
steal = TDQ_CPU(cpu);
CPU_CLR(cpu, &mask);
tdq_lock_pair(tdq, steal);
if (tdq->tdq_load != 0) {
goto out; /* to exit loop and switch to the new thread */
}
if (flag was cleared) {
tdq_unlock_pair(tdq, steal);
goto restart; /* restart the search */
}
if (steal->tdq_load < thresh || steal->tdq_transferable == 0 ||
tdq_move(steal, tdq) == 0) {
tdq_unlock_pair(tdq, steal);
continue;
}
TDQ_UNLOCK(steal);
clear flag;
mi_switch(SW_VOL | SWT_IDLE, NULL);
thread_unlock(curthread);
return (0);
And we also have to clear the flag if we did not find a thread to steal.

I've implemented something like this and added a bunch of counters to it
to get a better understanding of its behavior. Instead of adding a flag
to detect preemption, I used the same switchcnt test as is used by
kern.sched.steal.none: 9971668 # no threads were stolen
kern.sched.steal.fail: 23709 # unable to steal from cpu=sched_highest()
kern.sched.steal.level2: 191839 # somewhere on this chip
kern.sched.steal.level1: 557659 # a core on this CCX
kern.sched.steal.level0: 4555426 # the other SMT thread on this core
kern.sched.steal.restart: 404 # preemption detected so restart the search
kern.sched.steal.call: 15276638 # of times tdq_idled() called
There are a few surprises here.
One is the number of failed moves. I don't know if the load on the
source CPU fell below thresh, tdq_transferable went to zero, or if
tdq_move() failed. I also wonder if the failures are evenly distributed
across CPUs. It is possible that these failures are concentrated on CPU
0, which handles most interrupts. If interrupts don't affect switchcnt,
then the data collected by sched_highest() could be a bit stale and we
would not know it.

Most of the above failed moves were do to the either tdq_load dropping
below the threshold or tdq_transferable going to zero. These are evenly
distributed across CPUs that we want to steal from. I didn't not bin
the results by which CPU this code was running on. Actual failures of
tdq_move() are bursty and not evenly distributed across CPUs.

I've created this review for my changes:
https://reviews.freebsd.org/D12130

Rodney W. Grimes

2017-08-26 18:12:02 UTC

Post by Bruce Evans

Post by Don Lewis
...
Something else that I did not expect is the how frequently threads are
stolen from the other SMT thread on the same core, even though I
increased steal_thresh from 2 to 3 to account for the off-by-one
problem. This is true even right after the system has booted and no
significant load has been applied. My best guess is that because of
affinity, both the parent and child processes run on the same CPU after
fork(), and if a number of processes are forked() in quick succession,
the run queue of that CPU can get really long. Forcing a thread
migration in exec() might be a good solution.

Since you are trying a lot of combinations, maybe you can tell us which
ones work best. SCHED_4BSD works better for me on an old 2-core system.
SCHED_ULE works better on a not-so old 4x2 core (Haswell) system, but I
don't like it due to its complexity. It makes differences of at most
+-2% except when mistuned it can give -5% for real time (but better for
CPU and presumably power).
For SCHED_4BSD, I wrote fancy tuning for fork/exec and sometimes get
everything to like up for a 3% improvement (803 seconds instead of 823
on the old system, with -current much slower at 840+ and old versions
of ULE before steal_idle taking 890+). This is very resource (mainly
cache associativity?) dependent and my tuning makes little difference
on the newer system. SCHED_ULE still has bugfeatures which tend to
help large builds by reducing context switching, e.g., by bogusly
clamping all CPU-bound threads to nearly maximal priority.

That last bugfeature is probably what makes current systems
interactive performance tank rather badly when under heavy
loads. Would it be hard to fix?

--
Rod Grimes ***@freebsd.org

Ian Lepore

2017-08-26 18:18:14 UTC

Post by Rodney W. Grimes

Post by Bruce Evans

Post by Don Lewis
...
Something else that I did not expect is the how frequently
threads are
stolen from the other SMT thread on the same core, even though I
increased steal_thresh from 2 to 3 to account for the off-by-one
problem.  This is true even right after the system has booted and
no
significant load has been applied.  My best guess is that because
of
affinity, both the parent and child processes run on the same CPU after
fork(), and if a number of processes are forked() in quick
succession,
the run queue of that CPU can get really long.  Forcing a thread
migration in exec() might be a good solution.

Since you are trying a lot of combinations, maybe you can tell us which
ones work best.  SCHED_4BSD works better for me on an old 2-core
system.
SCHED_ULE works better on a not-so old 4x2 core (Haswell) system,
but I
don't like it due to its complexity.  It makes differences of at
most
+-2% except when mistuned it can give -5% for real time (but better for
CPU and presumably power).
For SCHED_4BSD, I wrote fancy tuning for fork/exec and sometimes get
everything to like up for a 3% improvement (803 seconds instead of 823
on the old system, with -current much slower at 840+ and old
versions
of ULE before steal_idle taking 890+).  This is very resource
(mainly
cache associativity?) dependent and my tuning makes little
difference
on the newer system.  SCHED_ULE still has bugfeatures which tend to
help large builds by reducing context switching, e.g., by bogusly
clamping all CPU-bound threads to nearly maximal priority.

That last bugfeature is probably what makes current systems
interactive performance tank rather badly when under heavy
loads. Would it be hard to fix?

I would second that sentiment... as time goes on, heavily loaded
systems seem to become less and less interactive-friendly. Also,
running the heavy-load jobs such as builds with nice, even -n 20,
doesn't seem to make any noticible difference in terms of making un-
nice'd processes more responsive (not sure there's any relationship in
the underlying causes of that, though).

-- Ian

Bruce Evans

2017-08-27 06:32:43 UTC

Post by Ian Lepore

Post by Rodney W. Grimes

[... context mostly lost to mangling of spaces to \xa0's]
on the newer system.\xa0\xa0SCHED_ULE still has bugfeatures which tend to

Oops, I meant SCHED_4BSD.

Post by Ian Lepore

Post by Rodney W. Grimes

help large builds by reducing context switching, e.g., by bogusly
clamping all CPU-bound threads to nearly maximal priority.

That last bugfeature is probably what makes current systems
interactive performance tank rather badly when under heavy
loads.\xa0\xa0Would it be hard to fix?

I fix it in some of my versions of SCHED_4BSD. It rarely matters.
I even turn off PREEMPTION and IPI_PREEMPTION on SMP systems to
favour large builds with fewer context switches and don't notice
interactivity problems. This depends on the shell not running the
build and not starting to many CPU hogs so that it stays at numerically
low priority.

Post by Ian Lepore
I would second that sentiment... as time goes on, heavily loaded
systems seem to become less and less interactive-friendly. \xa0Also,
running the heavy-load jobs such as builds with nice, even -n 20,
doesn't seem to make any noticible difference in terms of making un-
nice'd processes more responsive (not sure there's any relationship in
the underlying causes of that, though).

niceness is quite broken in both SCHED_4BSD and SCHED_ULE. It was
partly fixed in SCHED_4BSD in ~1999, but re-broken soon after (it
is difficult to map a large dynamic range of CPU usage counts (estcpu)
into the user priority range. The niceness sub-range wants to more
than it was (81), but even 81 didn't fit and caused bugs. Fixes
reduced it to 41 where it barely does anything. ULE intentionally
copied some bugs from this (like ensuring that nice -20 processes
never run in competetion with nice --20 processes).

This is fixed in SCHED_4BSD in some of my versions, but I only use
nice to test the fix. The implementation uses virtual ticks, with
nice'd processes being charged more. This is almost perfectly fair,
with the relative CPU allocation following a table, but Linux in 2004
somehow does better (except for no way to change the policy) using
an apparently much simpler algorithm.

Bruce

Don Lewis

2017-08-26 18:29:29 UTC

Post by Rodney W. Grimes

Post by Bruce Evans

Post by Don Lewis
...
Something else that I did not expect is the how frequently threads are
stolen from the other SMT thread on the same core, even though I
increased steal_thresh from 2 to 3 to account for the off-by-one
problem. This is true even right after the system has booted and no
significant load has been applied. My best guess is that because of
affinity, both the parent and child processes run on the same CPU after
fork(), and if a number of processes are forked() in quick succession,
the run queue of that CPU can get really long. Forcing a thread
migration in exec() might be a good solution.

Since you are trying a lot of combinations, maybe you can tell us which
ones work best. SCHED_4BSD works better for me on an old 2-core system.
SCHED_ULE works better on a not-so old 4x2 core (Haswell) system, but I
don't like it due to its complexity. It makes differences of at most
+-2% except when mistuned it can give -5% for real time (but better for
CPU and presumably power).
For SCHED_4BSD, I wrote fancy tuning for fork/exec and sometimes get
everything to like up for a 3% improvement (803 seconds instead of 823
on the old system, with -current much slower at 840+ and old versions
of ULE before steal_idle taking 890+). This is very resource (mainly
cache associativity?) dependent and my tuning makes little difference
on the newer system. SCHED_ULE still has bugfeatures which tend to
help large builds by reducing context switching, e.g., by bogusly
clamping all CPU-bound threads to nearly maximal priority.

That last bugfeature is probably what makes current systems
interactive performance tank rather badly when under heavy
loads. Would it be hard to fix?

I actually haven't noticed that problem on my package build boxes. I've
experienced decent interactive performance even when the load average is
in the 60 to 80 range. I also have poudriere configured to use tmpfs
and the only issue I run into is when it starts getting heavily into
swap (like 20G) and I leave my session idle for a while, which lets my
shell and sshd get swapped out. Then it takes them a while to wake up
again. Once they are paged in, then things feel snappy again. This is
remote access, so I can't comment on what X11 feels like.

Konstantin Belousov

2017-08-26 18:46:50 UTC

Post by Don Lewis
I actually haven't noticed that problem on my package build boxes. I've
experienced decent interactive performance even when the load average is
in the 60 to 80 range. I also have poudriere configured to use tmpfs
and the only issue I run into is when it starts getting heavily into
swap (like 20G) and I leave my session idle for a while, which lets my
shell and sshd get swapped out. Then it takes them a while to wake up
again. Once they are paged in, then things feel snappy again. This is
remote access, so I can't comment on what X11 feels like.

I believe what people complain about is the following scenario:
they have some interactive long living process, say firefox or mplayer.
The process' threads consume CPU cycles, so the ULE interactivity
detection logic actually classifies the threads as non-interactive.

This is not much problematic until a parallel build starts where
toolchain processes are typically short-lived. This makes them
classified as interactive, and their dynamic priority are lower than the
priority of long-lived threads which are interactive by user perception.

I did not analyzed the KTR dumps but this explanation more or less
coincides with the system slugginess when attempt to use mplayer while
heavily oversubscribed build (e.g. make -j 10 on 4 cores x 2 SMT
machine) is started.

Don Lewis

2017-08-26 19:47:40 UTC

Post by Konstantin Belousov

Post by Don Lewis
I actually haven't noticed that problem on my package build boxes. I've
experienced decent interactive performance even when the load average is
in the 60 to 80 range. I also have poudriere configured to use tmpfs
and the only issue I run into is when it starts getting heavily into
swap (like 20G) and I leave my session idle for a while, which lets my
shell and sshd get swapped out. Then it takes them a while to wake up
again. Once they are paged in, then things feel snappy again. This is
remote access, so I can't comment on what X11 feels like.

they have some interactive long living process, say firefox or mplayer.
The process' threads consume CPU cycles, so the ULE interactivity
detection logic actually classifies the threads as non-interactive.
This is not much problematic until a parallel build starts where
toolchain processes are typically short-lived. This makes them
classified as interactive, and their dynamic priority are lower than the
priority of long-lived threads which are interactive by user perception.
I did not analyzed the KTR dumps but this explanation more or less
coincides with the system slugginess when attempt to use mplayer while
heavily oversubscribed build (e.g. make -j 10 on 4 cores x 2 SMT
machine) is started.

I can believe that. I keep an excessive number of tabs open in firefox
and it would frequenty get into a state where it would consume 100% of a
CPU core. Very recent versions of firefox are a lot better.

Xorg is another possible victim. I've just noticed that when certain
windows have mouse focus (firefox being one, wish-based apps are
another) that the Xorg %CPU goes to 80%-90%. I think this crept in with
the lastest MATE upgrade. If Xorg is treated as non-interactive, then
the desktop experience is going to be less than optimal if there is
competing load.

Don Lewis

2017-08-26 19:58:37 UTC

Post by Don Lewis

Post by Konstantin Belousov

Post by Don Lewis
I actually haven't noticed that problem on my package build boxes. I've
experienced decent interactive performance even when the load average is
in the 60 to 80 range. I also have poudriere configured to use tmpfs
and the only issue I run into is when it starts getting heavily into
swap (like 20G) and I leave my session idle for a while, which lets my
shell and sshd get swapped out. Then it takes them a while to wake up
again. Once they are paged in, then things feel snappy again. This is
remote access, so I can't comment on what X11 feels like.

they have some interactive long living process, say firefox or mplayer.
The process' threads consume CPU cycles, so the ULE interactivity
detection logic actually classifies the threads as non-interactive.
This is not much problematic until a parallel build starts where
toolchain processes are typically short-lived. This makes them
classified as interactive, and their dynamic priority are lower than the
priority of long-lived threads which are interactive by user perception.
I did not analyzed the KTR dumps but this explanation more or less
coincides with the system slugginess when attempt to use mplayer while
heavily oversubscribed build (e.g. make -j 10 on 4 cores x 2 SMT
machine) is started.

I can believe that. I keep an excessive number of tabs open in firefox
and it would frequenty get into a state where it would consume 100% of a
CPU core. Very recent versions of firefox are a lot better.
Xorg is another possible victim. I've just noticed that when certain
windows have mouse focus (firefox being one, wish-based apps are
another) that the Xorg %CPU goes to 80%-90%. I think this crept in with
the lastest MATE upgrade. If Xorg is treated as non-interactive, then
the desktop experience is going to be less than optimal if there is
competing load.

I've got poudriere running right now on my primary package build box.
The priorties of the compiler processes are currently in the range of
74-96.

On my desktop, firefox is running at priority 24. Xorg when it is not
being a CPU hog gets all the way down to priority 20. When the mouse is
pointing to one of the windows that makes it go nuts, then it gets all
the way up to priority 98.

Jan Bramkamp

2017-08-30 12:24:30 UTC

Post by Don Lewis

Post by Don Lewis

Post by Konstantin Belousov

Post by Don Lewis
I actually haven't noticed that problem on my package build boxes. I've
experienced decent interactive performance even when the load average is
in the 60 to 80 range. I also have poudriere configured to use tmpfs
and the only issue I run into is when it starts getting heavily into
swap (like 20G) and I leave my session idle for a while, which lets my
shell and sshd get swapped out. Then it takes them a while to wake up
again. Once they are paged in, then things feel snappy again. This is
remote access, so I can't comment on what X11 feels like.

they have some interactive long living process, say firefox or mplayer.
The process' threads consume CPU cycles, so the ULE interactivity
detection logic actually classifies the threads as non-interactive.
This is not much problematic until a parallel build starts where
toolchain processes are typically short-lived. This makes them
classified as interactive, and their dynamic priority are lower than the
priority of long-lived threads which are interactive by user perception.
I did not analyzed the KTR dumps but this explanation more or less
coincides with the system slugginess when attempt to use mplayer while
heavily oversubscribed build (e.g. make -j 10 on 4 cores x 2 SMT
machine) is started.

I can believe that. I keep an excessive number of tabs open in firefox
and it would frequenty get into a state where it would consume 100% of a
CPU core. Very recent versions of firefox are a lot better.
Xorg is another possible victim. I've just noticed that when certain
windows have mouse focus (firefox being one, wish-based apps are
another) that the Xorg %CPU goes to 80%-90%. I think this crept in with
the lastest MATE upgrade. If Xorg is treated as non-interactive, then
the desktop experience is going to be less than optimal if there is
competing load.

I've got poudriere running right now on my primary package build box.
The priorties of the compiler processes are currently in the range of
74-96.
On my desktop, firefox is running at priority 24. Xorg when it is not
being a CPU hog gets all the way down to priority 20. When the mouse is
pointing to one of the windows that makes it go nuts, then it gets all
the way up to priority 98.

On my old desktop (AMD Phenom II X6 1060T) which doubled as poudriere
compile server I wrapped Xorg and moused with cpuset and rtprio. It
solved my problem, but felt wrong.

17 Replies
2 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Don Lewis 2017-08-23 15:04:49 UTC

Andriy Gapon 2017-08-23 19:26:36 UTC

Don Lewis 2017-08-23 20:58:42 UTC

Konstantin Belousov 2017-08-24 13:00:36 UTC

Don Lewis 2017-08-24 15:52:17 UTC

Don Lewis 2017-08-24 16:41:03 UTC

Don Lewis 2017-08-24 19:25:13 UTC

Don Lewis 2017-08-25 18:24:10 UTC

Bruce Evans 2017-08-26 00:28:50 UTC

Don Lewis 2017-08-26 17:50:16 UTC

Rodney W. Grimes 2017-08-26 18:12:02 UTC

Ian Lepore 2017-08-26 18:18:14 UTC

Bruce Evans 2017-08-27 06:32:43 UTC

Don Lewis 2017-08-26 18:29:29 UTC

Konstantin Belousov 2017-08-26 18:46:50 UTC

Don Lewis 2017-08-26 19:47:40 UTC

Don Lewis 2017-08-26 19:58:37 UTC

Jan Bramkamp 2017-08-30 12:24:30 UTC

about - legalese

Loading...