Discussion:
freebsd-head: suddenly NMI panics lead to ddb being unable to stop CPUs?
Adrian Chadd
2015-08-20 22:15:08 UTC
Permalink
Hi!

This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."

Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?


-adrian
Ryan Stone
2015-08-21 14:23:36 UTC
Permalink
I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.

The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.

(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)
Post by Adrian Chadd
Hi!
This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."
Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?
-adrian
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
Eric van Gyzen
2015-08-21 15:19:47 UTC
Permalink
I mentioned this to Adrian, but I'll mention here for everyone else's benefit.

Ryan is exactly right. There was a thread a while ago, with a proposed patch from Kostik:

https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html

As I recall, Scott Long also ran into this a few months ago.

It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.

Eric
Post by Ryan Stone
I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.
The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.
(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)
Post by Adrian Chadd
Hi!
This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."
Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?
-adrian
Adrian Chadd
2015-08-21 15:25:27 UTC
Permalink
Ah, cool. I'll give it a whirl.

I'm a little worried about having all of the other cores spinning in
this case (mostly thermal; the machines get VERY LOUD when the CPUs
are spinning..)


-a
Post by Eric van Gyzen
I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
As I recall, Scott Long also ran into this a few months ago.
It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
Eric
Post by Ryan Stone
I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.
The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.
(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)
Post by Adrian Chadd
Hi!
This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."
Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?
-adrian
Eric van Gyzen
2015-08-21 15:41:31 UTC
Permalink
Spinning is probably the only safe option in NMI context, since the NMI could have arrived at literally any time in any context (e.g. holding a spin lock, interrupts disabled). :-/

Eric
Post by Adrian Chadd
Ah, cool. I'll give it a whirl.
I'm a little worried about having all of the other cores spinning in
this case (mostly thermal; the machines get VERY LOUD when the CPUs
are spinning..)
-a
Post by Eric van Gyzen
I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
As I recall, Scott Long also ran into this a few months ago.
It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
Eric
Post by Ryan Stone
I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.
The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.
(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)
Post by Adrian Chadd
Hi!
This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."
Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?
-adrian
Scott Long via freebsd-arch
2015-08-21 15:31:13 UTC
Permalink
I might have a fix for this, I’ll check the netflix repo and see if it’s something that is ready to go upstream to freebsd.

Scott
Post by Eric van Gyzen
I mentioned this to Adrian, but I'll mention here for everyone else's benefit.
https://lists.freebsd.org/pipermail/freebsd-arch/2014-July/015584.html
As I recall, Scott Long also ran into this a few months ago.
It happens for any NMI: entering the debugger, a PCI Parity or System Error, a hardware watchdog timeout, and probably other sources I'm not remembering.
Eric
Post by Ryan Stone
I have seen similar behaviour before. The problem is that every CPU
receives an NMI concurrently. As I recall, one of them gets some kind of
pseudo-spinlock and tries to stop the other CPUs with an NMI. However,
because they are already in an NMI handler, they don't get the second NMI
and don't stop properly.
The case that I saw actually had to do with a panic triggered by an NMI,
not entering the debugger, but I believe that both cases use
stop_cpus_hard() under the hood and have a similar issue.
(I also recall seeing the exact situation that you describe while
originally developing SR-IOV on an alpha version of the Fortville hardware
and firmware with a very buggy SR-IOV implementation. I've never seen it
on ixgbe before, although I haven't used SR-IOV there very much at all)
Post by Adrian Chadd
Hi!
This has started happening on -HEAD recently. No, I don't have any
more details yet than "recently."
Whenever I get an NMI panic (and getting an NMI is a separate issue,
sigh) I get a slew of "failed to stop cpu" messages, and all CPUs
enter ddb. This is .. sub-optimal. Has anyone seen this? Does anyone
have any ideas?
-adrian
Loading...