Discussion:
x86: finding interrupts that aren't being accounted for?
Adrian Chadd
2015-04-06 07:21:29 UTC
Permalink
Hi,

I have an .. odd problem on a Lenovo X230.

I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.

When I unplug it, everything returns to normal.

Other cards don't do this.

So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.

pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.

However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.

My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.

So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.

Thanks!


-adrian
John Baldwin
2015-04-06 19:18:23 UTC
Permalink
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
Thanks!
SMM? Perhaps SMM doesn't hide itself from PMC counters (but it can hide itself
from samples).

If it is SMM there's not really anything you can do about it. Try getting a
KTR_SCHED trace and looking at it in schedgraph. When I've seen SMM isuses in
the past it shows up as hole in the graph where nothing happens in the system.

In your case you could perhaps be getting PCI errors that are triggering the
SMM handler. Perhaps compare pciconf -le before and after to see if there are
any changes.
--
John Baldwin
Adrian Chadd
2015-04-06 20:38:28 UTC
Permalink
Post by John Baldwin
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
Thanks!
SMM? Perhaps SMM doesn't hide itself from PMC counters (but it can hide itself
from samples).
If it is SMM there's not really anything you can do about it. Try getting a
KTR_SCHED trace and looking at it in schedgraph. When I've seen SMM isuses in
the past it shows up as hole in the graph where nothing happens in the system.
In your case you could perhaps be getting PCI errors that are triggering the
SMM handler. Perhaps compare pciconf -le before and after to see if there are
any changes.
Hm, ok. Can we extract PCIe errors yet?



-adrian
Rui Paulo
2015-04-06 21:15:13 UTC
Permalink
Post by Adrian Chadd
Post by John Baldwin
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
Thanks!
SMM? Perhaps SMM doesn't hide itself from PMC counters (but it can hide itself
from samples).
If it is SMM there's not really anything you can do about it. Try getting a
KTR_SCHED trace and looking at it in schedgraph. When I've seen SMM isuses in
the past it shows up as hole in the graph where nothing happens in the system.
In your case you could perhaps be getting PCI errors that are triggering the
SMM handler. Perhaps compare pciconf -le before and after to see if there are
any changes.
Hm, ok. Can we extract PCIe errors yet?
Yes, check pciconf.

--
Rui Paulo
Adrian Chadd
2015-04-06 21:16:23 UTC
Permalink
Post by Rui Paulo
Post by Adrian Chadd
Post by John Baldwin
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
Thanks!
SMM? Perhaps SMM doesn't hide itself from PMC counters (but it can hide itself
from samples).
If it is SMM there's not really anything you can do about it. Try getting a
KTR_SCHED trace and looking at it in schedgraph. When I've seen SMM isuses in
the past it shows up as hole in the graph where nothing happens in the system.
In your case you could perhaps be getting PCI errors that are triggering the
SMM handler. Perhaps compare pciconf -le before and after to see if there are
any changes.
Hm, ok. Can we extract PCIe errors yet?
Yes, check pciconf.
I'll try, but the system is pretty unusable whilst the card is plugged in...

Thanks!



-a
John Baldwin
2015-04-06 21:28:02 UTC
Permalink
Post by Adrian Chadd
Post by Rui Paulo
Post by Adrian Chadd
Post by John Baldwin
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
Thanks!
SMM? Perhaps SMM doesn't hide itself from PMC counters (but it can hide itself
from samples).
If it is SMM there's not really anything you can do about it. Try getting a
KTR_SCHED trace and looking at it in schedgraph. When I've seen SMM isuses in
the past it shows up as hole in the graph where nothing happens in the system.
In your case you could perhaps be getting PCI errors that are triggering the
SMM handler. Perhaps compare pciconf -le before and after to see if there are
any changes.
Hm, ok. Can we extract PCIe errors yet?
Yes, check pciconf.
I'll try, but the system is pretty unusable whilst the card is plugged in...
PCI errors latch. You can run 'pciconf -le' after you yank the card back out.
I would just do this:

'pciconf -le > before'
<insert card and yank it back out>
'pciconf -le > after'

Compare before and after using something like 'kompare'.
--
John Baldwin
Rui Paulo
2015-04-06 19:01:45 UTC
Permalink
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
If the cores are being used, you should be getting some samples as to where
the PC is. pmcstat doesn't show that? How about DTrace?
--
Rui Paulo
Adrian Chadd
2015-04-06 20:37:44 UTC
Permalink
Post by Rui Paulo
Post by Adrian Chadd
Hi,
I have an .. odd problem on a Lenovo X230.
I just threw in a very old wifi card (Intel 3945) into the expresscard
(pcie) slot. Now, we don't have any pcie-hp support in -HEAD just yet,
but i wasn't expecting the system to crawl to a halt.
When I unplug it, everything returns to normal.
Other cards don't do this.
So, I figured it may be interrupt spam - but vmstat -ia shows no
interrupts going crazy.
pmcstat -S CPU_CLK_UNHALTED_CORE -T -w 5 doesn't register anything
either - only a handful of background samples.
However, /counter/ mode pmc tells a different story - pmcstat -s
CPU_CLK_UNHALTED_CORE -w 1 shows all four cores going at 110% when the
card is inserted, with brief periods of idle. Once I remove the card,
the counters go back down to zero.
My working theory is: something is chewing CPU and it's likely
interrupts, but if it is, it's something far, far earlier than the x86
interrupt C code, which counts interrupts and spurious events.
So - has anyone diagnosed this stuff on FreeBSD/x86 before? I was kind
of hoping we'd at least get accurate statistics about spurious
interrupts, and if we don't, I'd like to understand why.
If the cores are being used, you should be getting some samples as to where
the PC is. pmcstat doesn't show that? How about DTrace?
Nope, nothing. Nothing at all shows up in the sample based checks.


-adrian
Loading...