Starting APs earlier during boot

Discussion:

John Baldwin

2016-02-16 20:50:22 UTC

Currently the kernel bootstraps the non-boot processors fairly early in the
SI_SUB_CPU SYSINIT. The APs then spin waiting to be "released". We currently
release the APs as one of the last steps at SI_SUB_SMP. On the one hand this
removes much of the need for synchronization while SYSINITs are running since
SYSINITs basically assume they are single-threaded. However, it also enforces
some odd quirks. Several places that deal with per-CPU resources have to
split initialization up so that the BSP init happens in one SYSINIT and the
initialization of the APs happens in a second SYSINIT at SI_SUB_SMP.

Another issue that is becoming more prominent on x86 (and probably will also
affect other platforms if it isn't already) is that to support working
interrupts for interrupt config hooks we bind all interrupts to the BSP during
boot and only distribute them among other CPUs near the end at SI_SUB_SMP.
This is especially problematic with drivers for modern hardware allocating
num(CPUs) interrupts (hoping to use one per CPU). On x86 we have aboug 190
IDT vectors available for device interrupts, so in theory we should be able to
tolerate a lot of drivers doing this (e.g. 60 drivers could allocate 3
interrupts for every CPU and we should still be fine). However, if you have,
say, 32 cores in a system, then you can only handle about 5 drivers doing
this before you run out of vectors on CPU 0.

Longer term we would also like to eventually have most drivers attach in the
same environment during boot as during post-boot. Right now post-boot is
quite different as all CPUs are running, interrupts work, etc. One of the
goals of multipass support for new-bus is to help us get there by probing
enough hardware to get timers working and starting the scheduler before
probing the rest of the devices. That goal isn't quite realized yet.

However, we can run a slightly simpler version of our scheduler before
timers are working. In fact, sleep/wakeup work just fine fairly early (we
allocate the necessary structures at SI_SUB_KMEM which is before the APs
are even started). Once idle threads are created and ready we could in
theory let the APs startup and run other threads. You just don't have working
timeouts. OTOH, you can sort of simulate timeouts if you modify the scheduler
to yield the CPU instead of blocking the thread for a sleep with a timeout.
The effect would be for threads that do sleeps with a timeout to fall back to
polling before timers are working. In practice, all of the early kernel
threads use sleeps without timeouts when idle so this doesn't really matter.

I've implemented these changes and tested them for x86. For x86 at least
AP startup needed some bits of the interrupt infrastructure in place, so
I moved SI_SUB_SMP up to after SI_SUB_INTR but before SI_SUB_SOFTINTR. I
modified the *sleep() and cv_*wait*() routines to not always bail if cold
is true. Instead, sleeps without a timeout are permitted to sleep
"normally". Sleeps with a timeout drop their interlock and yield the
CPU (but remain runnable). Since APs are now fully running this means
interrupts are now routed to all CPUs from the get go removing the need for
the post-boot shuffle. This also resolves the issue of running out of IDT
vectors on the boot CPU.

I believe that adopting other platforms for this change should be relatively
simple, but we should do that before committing the full patch. I do think
that some parts of the patch (such as the changes to the sleep routines, and
using SI_SUB_LAST instead of SI_SUB_SMP as a catch-all SYSINIT) can be
committed now without breaking anything.

However, I'd like feedback on the general idea and if it is acceptable I'd
like to coordinate testing with other platforms so this can go into the
tree.

The current changes are in the 'ap_startup' branch at github/bsdjhb/freebsd.
You can view them here:

https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup

--
John Baldwin

Julian Elischer

2016-02-17 05:33:15 UTC

Permalink

[...]

what is the goal? cleaner code? faster boot?

Warner Losh

2016-02-17 06:23:17 UTC

Permalink

Post by Julian Elischer

[...]
what is the goal? cleaner code? faster boot?

Two goals were in his original email.

(1) Start APs earlier so we can avoid issues with interrupt allocation
(we're currently hitting limits of 160 interrupts when only one is active).
(2) Make allocations more regular between startup and later loading drivers
later. Right now some drivers defer a lot of work so they can allocate
things at a time when all the resources are available. This helps make that
code more regular and actually the same between the different cases.

It has little to do with a faster boot, though it might enable parallel
newbus tree enumeration if that ever gets properly locked.

Warner

Poul-Henning Kamp

2016-02-17 09:46:51 UTC

Permalink

--------

Post by Warner Losh

Post by Julian Elischer
what is the goal? cleaner code? faster boot?

Two goals were in his original email.

And I hope that in the longer term we also aim to configure I/O
in parallel ?

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Warner Losh

2016-02-17 16:14:34 UTC

Permalink

Post by Poul-Henning Kamp
--------
In message <CANCZdfqRiEb=fEV1fiE8E9Lr=

Post by Warner Losh

Post by Julian Elischer
what is the goal? cleaner code? faster boot?

Two goals were in his original email.

And I hope that in the longer term we also aim to configure I/O
in parallel ?

What do you mean by 'configure I/O in parallel?'

Warner

Poul-Henning Kamp

2016-02-17 19:15:56 UTC

Permalink

--------

Post by Warner Losh

Post by Poul-Henning Kamp
And I hope that in the longer term we also aim to configure I/O
in parallel ?

What do you mean by 'configure I/O in parallel?'

probe/attach device drivers in parallel to speed up boot.

John Baldwin

2016-02-17 18:19:43 UTC

Permalink

Post by Poul-Henning Kamp
--------

Post by Warner Losh

Post by Julian Elischer
what is the goal? cleaner code? faster boot?

Two goals were in his original email.

And I hope that in the longer term we also aim to configure I/O
in parallel ?

I'm a bit leery of doing this fully parallel. In particular, users currently
depend on the behavior of deterministic names in new-bus (so em0 is always em0
and not sometimes em1). OTOH, I think that we could eventually allow drivers
to start doing some of the background scans sooner and only harvest the
results at the interrupt config hooks instead of starting the scans and
timers at the interrupt config hook (and this is a step towards that). From
what I understand, most of our boot time start up delay isn't the new-bus
device probe but userland startup. Nevertheless, I think the changes I've
proposed here are a prerequisite for even thinking about possibly making
device probe more parallel.

--
John Baldwin

Konstantin Belousov

2016-02-17 09:42:41 UTC

Permalink

Post by John Baldwin
Currently the kernel bootstraps the non-boot processors fairly early in the
SI_SUB_CPU SYSINIT. The APs then spin waiting to be "released". We currently
release the APs as one of the last steps at SI_SUB_SMP. On the one hand this
removes much of the need for synchronization while SYSINITs are running since
SYSINITs basically assume they are single-threaded. However, it also enforces
some odd quirks. Several places that deal with per-CPU resources have to
split initialization up so that the BSP init happens in one SYSINIT and the
initialization of the APs happens in a second SYSINIT at SI_SUB_SMP.
Another issue that is becoming more prominent on x86 (and probably will also
affect other platforms if it isn't already) is that to support working
interrupts for interrupt config hooks we bind all interrupts to the BSP during
boot and only distribute them among other CPUs near the end at SI_SUB_SMP.
This is especially problematic with drivers for modern hardware allocating
num(CPUs) interrupts (hoping to use one per CPU). On x86 we have aboug 190
IDT vectors available for device interrupts, so in theory we should be able to
tolerate a lot of drivers doing this (e.g. 60 drivers could allocate 3
interrupts for every CPU and we should still be fine). However, if you have,
say, 32 cores in a system, then you can only handle about 5 drivers doing
this before you run out of vectors on CPU 0.
Longer term we would also like to eventually have most drivers attach in the
same environment during boot as during post-boot. Right now post-boot is
quite different as all CPUs are running, interrupts work, etc. One of the
goals of multipass support for new-bus is to help us get there by probing
enough hardware to get timers working and starting the scheduler before
probing the rest of the devices. That goal isn't quite realized yet.
However, we can run a slightly simpler version of our scheduler before
timers are working. In fact, sleep/wakeup work just fine fairly early (we
allocate the necessary structures at SI_SUB_KMEM which is before the APs
are even started). Once idle threads are created and ready we could in
theory let the APs startup and run other threads. You just don't have working
timeouts. OTOH, you can sort of simulate timeouts if you modify the scheduler
to yield the CPU instead of blocking the thread for a sleep with a timeout.
The effect would be for threads that do sleeps with a timeout to fall back to
polling before timers are working. In practice, all of the early kernel
threads use sleeps without timeouts when idle so this doesn't really matter.

I understand that timeouts can be somewhat simulated this way.

But I do not quite understand how generic scheduling can work without
(timer) interrupts. Suppose that we have two threads 1 and 2 of the same
priority, both runnable, and due to some event thread 2 preempted thread
1. If thread 2 just runs without calling the preempt functions like
msleep, what would guarentee that thread 1 eventually gets it CPU slice ?

E.g. there might be no interrupts set up yet, and idle thread on UP
gets on CPU, then the whole boot process could deadlock.

Post by John Baldwin
I've implemented these changes and tested them for x86. For x86 at least
AP startup needed some bits of the interrupt infrastructure in place, so
I moved SI_SUB_SMP up to after SI_SUB_INTR but before SI_SUB_SOFTINTR. I
modified the *sleep() and cv_*wait*() routines to not always bail if cold
is true. Instead, sleeps without a timeout are permitted to sleep
"normally". Sleeps with a timeout drop their interlock and yield the
CPU (but remain runnable). Since APs are now fully running this means
interrupts are now routed to all CPUs from the get go removing the need for
the post-boot shuffle. This also resolves the issue of running out of IDT
vectors on the boot CPU.
I believe that adopting other platforms for this change should be relatively
simple, but we should do that before committing the full patch. I do think
that some parts of the patch (such as the changes to the sleep routines, and
using SI_SUB_LAST instead of SI_SUB_SMP as a catch-all SYSINIT) can be
committed now without breaking anything.
However, I'd like feedback on the general idea and if it is acceptable I'd
like to coordinate testing with other platforms so this can go into the
tree.
The current changes are in the 'ap_startup' branch at github/bsdjhb/freebsd.
https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup
--
John Baldwin
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

Poul-Henning Kamp

2016-02-17 09:45:43 UTC

Permalink

--------

Post by Konstantin Belousov
E.g. there might be no interrupts set up yet, and idle thread on UP
gets on CPU, then the whole boot process could deadlock.

idle_thread:

while (!interrupts_setup_done())
yield();

John Baldwin

2016-02-17 17:00:26 UTC

Permalink

Post by Konstantin Belousov

Post by John Baldwin
However, we can run a slightly simpler version of our scheduler before
timers are working. In fact, sleep/wakeup work just fine fairly early (we
allocate the necessary structures at SI_SUB_KMEM which is before the APs
are even started). Once idle threads are created and ready we could in
theory let the APs startup and run other threads. You just don't have working
timeouts. OTOH, you can sort of simulate timeouts if you modify the scheduler
to yield the CPU instead of blocking the thread for a sleep with a timeout.
The effect would be for threads that do sleeps with a timeout to fall back to
polling before timers are working. In practice, all of the early kernel
threads use sleeps without timeouts when idle so this doesn't really matter.

I understand that timeouts can be somewhat simulated this way.
But I do not quite understand how generic scheduling can work without
(timer) interrupts. Suppose that we have two threads 1 and 2 of the same
priority, both runnable, and due to some event thread 2 preempted thread
1. If thread 2 just runs without calling the preempt functions like
msleep, what would guarentee that thread 1 eventually gets it CPU slice ?

Nothing, but the only thread we have that does that during early startup is
thread0 (which is what should be running as it is the one that makes
progress to getting timers setup). Currently the sleep calls just always
return which means if any thread gets on the CPU it never yields and is
stuck forever. My changes make it so only a CPU bound thread is stuck
forever, and we only have one such thread before timers are running: thread0.

Post by Konstantin Belousov
E.g. there might be no interrupts set up yet, and idle thread on UP
gets on CPU, then the whole boot process could deadlock.

The idle threads are special as they yield explicitly if there are any
runnable threads on the run queues.

--
John Baldwin

John Baldwin

2016-03-18 19:02:30 UTC

Permalink

After some more testing, I've simplified the early scheduler a bit. It no
longer tries to simulate timeouts by just keeping the thread runnable. Instead,
a sleep with a timeout just panics. However, it does still permit sleeps with
infinite sleeps. Some code that uses a timeout really wants a timeout (note
that pause() has a hack to fallback to DELAY() internally if cold is true for
this reason). Instead, my feeling is that any kthreads that need timeouts to
work need to defer their startup until SI_SUB_KICK_SCHEDULER.

Post by John Baldwin
However, I'd like feedback on the general idea and if it is acceptable I'd
like to coordinate testing with other platforms so this can go into the
tree.

I don't think I've seen any objections? This does need more testing. I will
update the patch to add a new EARLY_AP_STARTUP kernel option so this can be
committed (but not yet enabled) allowing for easier testing (and allowing
other platforms to catch up to x86).

Post by John Baldwin
The current changes are in the 'ap_startup' branch at github/bsdjhb/freebsd.
https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup

--
John Baldwin

K. Macy

2016-03-18 19:37:24 UTC

Permalink

So none of these changes have been committed yet?

I'm hitting hangs in USB on boot with recent HEAD and without having
investigating had thought this might be what exposed the problem.

Thanks.

-M

Post by John Baldwin

Post by John Baldwin
Currently the kernel bootstraps the non-boot processors fairly early in

the

Post by John Baldwin
SI_SUB_CPU SYSINIT. The APs then spin waiting to be "released". We

currently

Post by John Baldwin
release the APs as one of the last steps at SI_SUB_SMP. On the one hand

this

Post by John Baldwin
removes much of the need for synchronization while SYSINITs are running

since

Post by John Baldwin
SYSINITs basically assume they are single-threaded. However, it also

enforces

Post by John Baldwin
some odd quirks. Several places that deal with per-CPU resources have to
split initialization up so that the BSP init happens in one SYSINIT and

the

Post by John Baldwin
initialization of the APs happens in a second SYSINIT at SI_SUB_SMP.
Another issue that is becoming more prominent on x86 (and probably will

also

Post by John Baldwin
affect other platforms if it isn't already) is that to support working
interrupts for interrupt config hooks we bind all interrupts to the BSP

during

Post by John Baldwin
boot and only distribute them among other CPUs near the end at

SI_SUB_SMP.

Post by John Baldwin
This is especially problematic with drivers for modern hardware

allocating

Post by John Baldwin
num(CPUs) interrupts (hoping to use one per CPU). On x86 we have aboug

190

Post by John Baldwin
IDT vectors available for device interrupts, so in theory we should be

able to

Post by John Baldwin
tolerate a lot of drivers doing this (e.g. 60 drivers could allocate 3
interrupts for every CPU and we should still be fine). However, if you

have,

Post by John Baldwin
say, 32 cores in a system, then you can only handle about 5 drivers doing
this before you run out of vectors on CPU 0.
Longer term we would also like to eventually have most drivers attach in

the

Post by John Baldwin
same environment during boot as during post-boot. Right now post-boot is
quite different as all CPUs are running, interrupts work, etc. One of

the

Post by John Baldwin
goals of multipass support for new-bus is to help us get there by probing
enough hardware to get timers working and starting the scheduler before
probing the rest of the devices. That goal isn't quite realized yet.
However, we can run a slightly simpler version of our scheduler before
timers are working. In fact, sleep/wakeup work just fine fairly early

(we

Post by John Baldwin
allocate the necessary structures at SI_SUB_KMEM which is before the APs
are even started). Once idle threads are created and ready we could in
theory let the APs startup and run other threads. You just don't have

working

Post by John Baldwin
timeouts. OTOH, you can sort of simulate timeouts if you modify the

scheduler

Post by John Baldwin
to yield the CPU instead of blocking the thread for a sleep with a

timeout.

Post by John Baldwin
The effect would be for threads that do sleeps with a timeout to fall

back to

Post by John Baldwin
polling before timers are working. In practice, all of the early kernel
threads use sleeps without timeouts when idle so this doesn't really

matter.
After some more testing, I've simplified the early scheduler a bit. It no
longer tries to simulate timeouts by just keeping the thread runnable.
Instead,
a sleep with a timeout just panics. However, it does still permit sleeps with
infinite sleeps. Some code that uses a timeout really wants a timeout (note
that pause() has a hack to fallback to DELAY() internally if cold is true for
this reason). Instead, my feeling is that any kthreads that need timeouts to
work need to defer their startup until SI_SUB_KICK_SCHEDULER.

Post by John Baldwin
However, I'd like feedback on the general idea and if it is acceptable

I'd

Post by John Baldwin
like to coordinate testing with other platforms so this can go into the
tree.

Post by John Baldwin
The current changes are in the 'ap_startup' branch at

github/bsdjhb/freebsd.

Post by John Baldwin
https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup

--
John Baldwin
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
<javascript:;>"

K. Macy

2016-03-19 02:02:42 UTC

Permalink

Post by K. Macy
So none of these changes have been committed yet?
I'm hitting hangs in USB on boot with recent HEAD and without having
investigating had thought this might be what exposed the problem.

Never mind. It's yet another ZFS namespace deadlock.

-M

Post by K. Macy

Post by John Baldwin

After some more testing, I've simplified the early scheduler a bit. It no
longer tries to simulate timeouts by just keeping the thread runnable.
Instead,
a sleep with a timeout just panics. However, it does still permit sleeps with
infinite sleeps. Some code that uses a timeout really wants a timeout (note
that pause() has a hack to fallback to DELAY() internally if cold is true for
this reason). Instead, my feeling is that any kthreads that need timeouts to
work need to defer their startup until SI_SUB_KICK_SCHEDULER.

Post by John Baldwin
However, I'd like feedback on the general idea and if it is acceptable I'd
like to coordinate testing with other platforms so this can go into the
tree.

Post by John Baldwin
The current changes are in the 'ap_startup' branch at
github/bsdjhb/freebsd.
https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup

--
John Baldwin
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

John Baldwin

2016-03-21 22:34:40 UTC

Permalink

I've committed some comestic ones (e.g., moving some SYSINITs from SI_SUB_SMP
to SI_SUB_LAST), but nothing that should change actual behavior yet.

Post by K. Macy
-M

Post by John Baldwin

Post by John Baldwin
Currently the kernel bootstraps the non-boot processors fairly early in

the

Post by John Baldwin
SI_SUB_CPU SYSINIT. The APs then spin waiting to be "released". We

currently

Post by John Baldwin
release the APs as one of the last steps at SI_SUB_SMP. On the one hand

this

Post by John Baldwin
removes much of the need for synchronization while SYSINITs are running

since

Post by John Baldwin
SYSINITs basically assume they are single-threaded. However, it also

enforces

Post by John Baldwin
some odd quirks. Several places that deal with per-CPU resources have to
split initialization up so that the BSP init happens in one SYSINIT and

the

Post by John Baldwin
initialization of the APs happens in a second SYSINIT at SI_SUB_SMP.
Another issue that is becoming more prominent on x86 (and probably will

also

Post by John Baldwin
affect other platforms if it isn't already) is that to support working
interrupts for interrupt config hooks we bind all interrupts to the BSP

during

Post by John Baldwin
boot and only distribute them among other CPUs near the end at

SI_SUB_SMP.

Post by John Baldwin
This is especially problematic with drivers for modern hardware

allocating

Post by John Baldwin
num(CPUs) interrupts (hoping to use one per CPU). On x86 we have aboug

190

Post by John Baldwin
IDT vectors available for device interrupts, so in theory we should be

able to

Post by John Baldwin
tolerate a lot of drivers doing this (e.g. 60 drivers could allocate 3
interrupts for every CPU and we should still be fine). However, if you

have,

the

Post by John Baldwin
same environment during boot as during post-boot. Right now post-boot is
quite different as all CPUs are running, interrupts work, etc. One of

the

(we

working

Post by John Baldwin
timeouts. OTOH, you can sort of simulate timeouts if you modify the

scheduler

Post by John Baldwin
to yield the CPU instead of blocking the thread for a sleep with a

timeout.

Post by John Baldwin
The effect would be for threads that do sleeps with a timeout to fall

back to

Post by John Baldwin
polling before timers are working. In practice, all of the early kernel
threads use sleeps without timeouts when idle so this doesn't really

Post by John Baldwin
However, I'd like feedback on the general idea and if it is acceptable

I'd

Post by John Baldwin
like to coordinate testing with other platforms so this can go into the
tree.

Post by John Baldwin
The current changes are in the 'ap_startup' branch at

github/bsdjhb/freebsd.

Post by John Baldwin
https://github.com/bsdjhb/freebsd/compare/master...bsdjhb:ap_startup

--
John Baldwin
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
<javascript:;>"

--
John Baldwin

John Baldwin

2016-04-22 19:33:49 UTC

Permalink

I've posted a review of the final set of changes here:

https://reviews.freebsd.org/D6069

To permit a smoother transition, the earlier startup is temporarily controlled
by a kernel option (EARLY_AP_STARTUP) so that it can be easily disabled if there
are regressions and to allow individual platforms to have time to port over.
My plan is to enablet his by default on x86 once it is in the tree. I would
like to have all platforms cut over for 11.0.

--
John Baldwin