freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs?

Discussion:

Adrian Chadd

2015-08-20 22:15:08 UTC

Hi!

This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."

Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?

-adrian

Ryan Stone

2015-08-21 14:23:36 UTC

Permalink

I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.

The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.

(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)

Post by Adrian Chadd
Hi!
This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."
Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?
-adrian
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

Eric van Gyzen

2015-08-21 15:19:47 UTC

Permalink

I mentioned this to Adrian, but I'll mention here for everyone else's benefit.

Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik:

https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html

As I recall, Scott Long also ran into this a few months ago.

It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.

Eric

Post by Ryan Stone
I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.
The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.
(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)

Adrian Chadd

2015-08-21 15:25:27 UTC

Permalink

Ah, cool. I'll give it a whirl.

I'm a little worried about having all of the other cores spinning in
this case (mostly thermal; the machines get VERY LOUD when the CPUs
are spinning..)

-a

Post by Eric van Gyzen
I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
As I recall, Scott Long also ran into this a few months ago.
It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
Eric

Eric van Gyzen

2015-08-21 15:41:31 UTC

Permalink

Spinning is probably the only safe option in NMI context, since the NMI could have arrived at literally any time in any context (e.g. holding a spin lock, interrupts disabled). :-/

Eric

Post by Adrian Chadd
Ah, cool. I'll give it a whirl.
I'm a little worried about having all of the other cores spinning in
this case (mostly thermal; the machines get VERY LOUD when the CPUs
are spinning..)
-a

Scott Long via freebsd-arch

2015-08-21 15:31:13 UTC

Permalink

I might have a fix for this, I’ll check the netflix repo and see if it’s something that is ready to go upstream to freebsd.

Scott