Network card interrupt handling

Discussion:

Sean Bruno

2015-08-26 16:30:48 UTC

We've been diagnosing what appeared to be out of order processing in
the network stack this week only to find out that the network card
driver was shoveling bits to us out of order (em).

This *seems* to be due to a design choice where the driver is allowed
to assert a "soft interrupt" to the h/w device while real interrupts
are disabled. This allows a fake "em_msix_rx" to be started *while*
"em_handle_que" is running from the taskqueue. We've isolated and
worked around this by setting our processing_limit in the driver to
- -1. This means that *most* packet processing is now handled in the
MSI-X handler instead of being deferred. Some periodic interference
is still detectable via em_local_timer() which causes one of these
"fake" interrupt assertions in the normal, card is *not* hung case.

Both functions use identical code for a start. Both end up down
inside of em_rxeof() to process packets. Both drop the RX lock prior
to handing the data up the network stack.

This means that the em_handle_que running from the taskqueue will be
preempted. Dtrace confirms that this allows out of order processing
to occur at times and generates a lot of resets.

The reason I'm bringing this up on -arch and not on -net is that this
is a common design pattern in some of the Ethernet drivers. We've
done preliminary tests on a patch that moves *all* processing of RX
packets to the rx_task taskqueue, which means that em_handle_que is
now the only path to get packets processed.

<stable10 diff>
https://people.freebsd.org/~sbruno/em_interupt_to_taskqueue.diff

My sense is that this is a slightly "better" method to handle the
packets but removes some immediacy from packet processing since all
processing is deferred. However, all packet processing is now
serialized per queue, which I think is the proper implementation.

Am I smoking "le dope" here or is this the way forward?

sean

Jack Vogel

2015-08-26 16:36:06 UTC

Permalink

I recall actually trying something like this once myself Sean, but if
memory serves the performance was poor enough that I decided against
pursuing it. Still, maybe it deserves further investigation.

Jack

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512
We've been diagnosing what appeared to be out of order processing in
the network stack this week only to find out that the network card
driver was shoveling bits to us out of order (em).
This *seems* to be due to a design choice where the driver is allowed
to assert a "soft interrupt" to the h/w device while real interrupts
are disabled. This allows a fake "em_msix_rx" to be started *while*
"em_handle_que" is running from the taskqueue. We've isolated and
worked around this by setting our processing_limit in the driver to
- -1. This means that *most* packet processing is now handled in the
MSI-X handler instead of being deferred. Some periodic interference
is still detectable via em_local_timer() which causes one of these
"fake" interrupt assertions in the normal, card is *not* hung case.
Both functions use identical code for a start. Both end up down
inside of em_rxeof() to process packets. Both drop the RX lock prior
to handing the data up the network stack.
This means that the em_handle_que running from the taskqueue will be
preempted. Dtrace confirms that this allows out of order processing
to occur at times and generates a lot of resets.
The reason I'm bringing this up on -arch and not on -net is that this
is a common design pattern in some of the Ethernet drivers. We've
done preliminary tests on a patch that moves *all* processing of RX
packets to the rx_task taskqueue, which means that em_handle_que is
now the only path to get packets processed.
<stable10 diff>
https://people.freebsd.org/~sbruno/em_interupt_to_taskqueue.diff
My sense is that this is a slightly "better" method to handle the
packets but removes some immediacy from packet processing since all
processing is deferred. However, all packet processing is now
serialized per queue, which I think is the proper implementation.
Am I smoking "le dope" here or is this the way forward?
sean
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQF8BAEBCgBmBQJV3em1XxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w
ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRCQUFENDYzMkU3MTIxREU4RDIwOTk3REQx
MjAxRUZDQTFFNzI3RTY0AAoJEBIB78oecn5klyYH+wX22JSRYkKMeCJGVSi1dJiM
fcd+DWZVhru2qyUNEfhBSoGEgi7HzXqaBwddy7GK2IRtbKeRlF/oebsII941SIsz
t2f35MoZunw0rWObIEw4WxxkXAajeATDjx87wozVmsZZ40JbmgZ0jKIGXiNie3Is
04pkXiIOElWqjlLtFlkITUUrOeKsN7kKbwaZAHYeFRdbUgsnxsh7fRvsFucOCgyr
CONHBGWEwu/g50YUruR+rPOHFAA1HD3dQuIoHcTjQx/uX4l5bw+8CFzzMjpw6X9d
G+boH6l1ZZ6U3uZCXOSmkPiXka7Ix8rdbUyrUrJTJrGEB7+U7rF2lSSq8cX+4pk=
=UibL
-----END PGP SIGNATURE-----
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

John-Mark Gurney

2015-08-28 18:48:00 UTC

Permalink

I have a better question, for MSI-X, we have a dedicated interrupt
thread to do the processing, so why are we even doing any moderation
in this case? It's not any different than spinning in the task queue..

How about the attached patch that just disables taskqueue processing
for MSI-X RX interrupts, and does all processing in the interrupt
thread?

Do you need to add the rx_task to the em_local_timer? If so, then
I would look at setting a flag in the _rxeof that says that processing
is happening... and in the case of the taskqueue, when it sees this
flag set, it just exits, while for the interrupt filter case, we'd
need to be more careful (possibly set a flag that the taskqueue will
inspect, and cause it to stop processing the rx queue)...

Both functions use identical code for a start. Both end up down
inside of em_rxeof() to process packets. Both drop the RX lock prior
to handing the data up the network stack.
This means that the em_handle_que running from the taskqueue will be
preempted. Dtrace confirms that this allows out of order processing
to occur at times and generates a lot of resets.
The reason I'm bringing this up on -arch and not on -net is that this
is a common design pattern in some of the Ethernet drivers. We've
done preliminary tests on a patch that moves *all* processing of RX
packets to the rx_task taskqueue, which means that em_handle_que is
now the only path to get packets processed.
<stable10 diff>
https://people.freebsd.org/~sbruno/em_interupt_to_taskqueue.diff
My sense is that this is a slightly "better" method to handle the
packets but removes some immediacy from packet processing since all
processing is deferred. However, all packet processing is now
serialized per queue, which I think is the proper implementation.
Am I smoking "le dope" here or is this the way forward?

I think you discovered an interresting issue..

btw, since you're hacking on em a lot, interrested in fixing em's
jumbo frames so it doesn't use 9k clusters, but instead page sized
clusters?

--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."

Jack Vogel

2015-08-28 19:04:32 UTC

Permalink

The reason the extra handling was added into the local timer
was due to chasing hangs in the past, the thought was an interrupt
may have been missed. Flags sound like a nice idea, but there is
the possibility of a race condition and something still gets missed.

Its been quite a few years ago, but there was a time when the em
driver was having very intermittent hangs, in fact Sean may have
been one of the victims, and this stuff was an attempt to solve that.

Every time I looked at the em driver it just cried out to be thoroughly
cleaned up or rewritten, but regression testing would be a pain doing
that too.

In any case, its no longer my job, and I'm glad Sean is giving it the
attention he is :)

Jack

Post by John-Mark Gurney

I have a better question, for MSI-X, we have a dedicated interrupt
thread to do the processing, so why are we even doing any moderation
in this case? It's not any different than spinning in the task queue..
How about the attached patch that just disables taskqueue processing
for MSI-X RX interrupts, and does all processing in the interrupt
thread?
Do you need to add the rx_task to the em_local_timer? If so, then
I would look at setting a flag in the _rxeof that says that processing
is happening... and in the case of the taskqueue, when it sees this
flag set, it just exits, while for the interrupt filter case, we'd
need to be more careful (possibly set a flag that the taskqueue will
inspect, and cause it to stop processing the rx queue)...

I think you discovered an interresting issue..
btw, since you're hacking on em a lot, interrested in fixing em's
jumbo frames so it doesn't use 9k clusters, but instead page sized
clusters?
--
John-Mark Gurney Voice: +1 415 225 5579
"All that I will do, has been done, All that I have, has not."
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

Adrian Chadd

2015-08-28 19:41:36 UTC

Permalink

[snip]

Well, the other big reason for doing it deferred like this is to avoid
network based deadlocks because you're being fed packets faster than
you can handle them. If you never yield, you stop other NIC
processing.

People used to do run-to-completion and then complained when this
happened, so polling was a thing.

So - I'm all for doing it with a fast interrupt handler and a fast
taskqueue. As long as we don't run things to completion and
re-schedule the taskqueue (so other things on that core get network
processing) then I'm okay.

(I kinda want us to have NAPI at some point...)

-adrian

John-Mark Gurney

2015-08-31 00:00:03 UTC

Permalink

Post by Adrian Chadd
[snip]
Well, the other big reason for doing it deferred like this is to avoid
network based deadlocks because you're being fed packets faster than
you can handle them. If you never yield, you stop other NIC
processing.

You snipped the part of me asking isn't the interrupt thread just the
same interruptable context as the task queue? Maybe the priority is
different, but that can be adjusted to be the same and still save the
context switch...

There is no break/moderation in the taskqueue, as it'll just enqueue
itself, and when the task queue breaks out, it'll just immediately run
itself, since it has a dedicated thread to itself... So, looks like
you get the same spinning behavior...

Post by Adrian Chadd
People used to do run-to-completion and then complained when this
happened, so polling was a thing.

Maybe when using PCI shared interrupts, but we are talking about PCIe
MSI-X unshared interrupts.

Post by Adrian Chadd
So - I'm all for doing it with a fast interrupt handler and a fast
taskqueue. As long as we don't run things to completion and
re-schedule the taskqueue (so other things on that core get network
processing) then I'm okay.
(I kinda want us to have NAPI at some point...)

--
John-Mark Gurney Voice: +1 415 225 5579

"All that I will do, has been done, All that I have, has not."

Adrian Chadd

2015-08-31 00:24:48 UTC

Permalink

Post by John-Mark Gurney

You snipped the part of me asking isn't the interrupt thread just the
same interruptable context as the task queue? Maybe the priority is
different, but that can be adjusted to be the same and still save the
context switch...
There is no break/moderation in the taskqueue, as it'll just enqueue
itself, and when the task queue breaks out, it'll just immediately run
itself, since it has a dedicated thread to itself... So, looks like
you get the same spinning behavior...

Post by Adrian Chadd
People used to do run-to-completion and then complained when this
happened, so polling was a thing.

Maybe when using PCI shared interrupts, but we are talking about PCIe
MSI-X unshared interrupts.

Well, try it and see what happens. You can still get network livelock
and starvation of other interfaces with ridiculously high pps if you
never yield. :P

-adrian

John Baldwin

2015-08-31 21:18:17 UTC

Permalink

Post by John-Mark Gurney

You snipped the part of me asking isn't the interrupt thread just the
same interruptable context as the task queue? Maybe the priority is
different, but that can be adjusted to be the same and still save the
context switch...
There is no break/moderation in the taskqueue, as it'll just enqueue
itself, and when the task queue breaks out, it'll just immediately run
itself, since it has a dedicated thread to itself... So, looks like
you get the same spinning behavior...

Yes, that is true and why all the interrupt moderation stuff in the NIC
drivers that I've seen has always been pointless. All it does is add
extra overhead since you waste time with extra context switches back to
yourself in between servicing packets. It does not permit any other
NICs to run at all. (One of the goals of my other patches that I
mentioned is to make it possible for multiple devices to share ithreads
even when using discrete interrupts (e.g. MSI) so that the yielding
done actually would give a chance for other devices to run, but currently
it is all just a waste of CPU cycles).

If you think this actually helps, I challenge to you capture a KTR_SCHED
trace of it ever working as intended.

--
John Baldwin

Sean Bruno

2015-08-31 15:49:21 UTC

Permalink

Post by John-Mark Gurney
I have a better question, for MSI-X, we have a dedicated interrupt
thread to do the processing, so why are we even doing any
moderation in this case? It's not any different than spinning in
the task queue..
How about the attached patch that just disables taskqueue
processing for MSI-X RX interrupts, and does all processing in the
interrupt thread?

This is another design that I had thought of. For em(4) when using
seperate ISR threads for *each* rx queue and *each* tx queue, I think
that doing processing in the interrupt thread is the right thing to do.

I'm unsure of what the correct thing to do when tx/rx is combined into
a single handler though (igb/ix for example). This would lead to
possible starvation as Adrian has pointed out. There is nothing
stopping us from breaking the queues apart into seperate tx/rx threads
of execution for these drivers. em(4) was my little science project
to see what the behavior would be.

Post by John-Mark Gurney
Do you need to add the rx_task to the em_local_timer? If so, then
I would look at setting a flag in the _rxeof that says that
processing is happening... and in the case of the taskqueue, when
it sees this flag set, it just exits, while for the interrupt
filter case, we'd need to be more careful (possibly set a flag that
the taskqueue will inspect, and cause it to stop processing the rx
queue)...

^^ I'll ponder this a bit further today and comment after coffee.

Post by John-Mark Gurney
btw, since you're hacking on em a lot, interrested in fixing em's
jumbo frames so it doesn't use 9k clusters, but instead page sized
clusters?

Uh ... hrm. I can look into it, but would need more details as I'm
pretty ignorant of what you're referring to. Ping me off list and
I'll take a look (jumbo frames is out of scope for $dayjob).

sean

John Baldwin

2015-08-28 17:38:36 UTC

Permalink

Post by Sean Bruno
We've been diagnosing what appeared to be out of order processing in
the network stack this week only to find out that the network card
driver was shoveling bits to us out of order (em).
This *seems* to be due to a design choice where the driver is allowed
to assert a "soft interrupt" to the h/w device while real interrupts
are disabled. This allows a fake "em_msix_rx" to be started *while*
"em_handle_que" is running from the taskqueue. We've isolated and
worked around this by setting our processing_limit in the driver to
-1. This means that *most* packet processing is now handled in the
MSI-X handler instead of being deferred. Some periodic interference
is still detectable via em_local_timer() which causes one of these
"fake" interrupt assertions in the normal, card is *not* hung case.
Both functions use identical code for a start. Both end up down
inside of em_rxeof() to process packets. Both drop the RX lock prior
to handing the data up the network stack.
This means that the em_handle_que running from the taskqueue will be
preempted. Dtrace confirms that this allows out of order processing
to occur at times and generates a lot of resets.
The reason I'm bringing this up on -arch and not on -net is that this
is a common design pattern in some of the Ethernet drivers. We've
done preliminary tests on a patch that moves *all* processing of RX
packets to the rx_task taskqueue, which means that em_handle_que is
now the only path to get packets processed.

It is only a common pattern in the Intel drivers. :-/ We (collectively)
spent quite a while fixing this in ixgbe and igb. Longer (hopefully more
like medium) term I have an update to the interrupt API I want to push in
that allows drivers to manually schedule interrupt handlers using an
'hwi' API to replace the manual taskqueues. This also ensures that
the handler that dequeues packets is only ever running in an ithread
context and never concurrently.

--
John Baldwin

K. Macy

2015-08-29 01:25:53 UTC

Permalink

Post by John Baldwin

Jeff has a generalization of the net_task infrastructure used at Nokia
called grouptaskq that I've used for iflib. That does essentially what you
refer to. I've converted ixl and am currently about to test an ixgbe
conversion. I anticipate converting mlxen, all Intel drivers as well as the
remaining drivers with device specific code in netmap. The one catch is
finding someone who will publicly admit to owning re hardware so that I can
buy it from him and test my changes.

Cheers.

Garrett Cooper

2015-08-29 01:52:06 UTC

Permalink

Post by K. Macy

Post by John Baldwin

I have 2 re NICs in my fileserver at home (Asus went cheap on some of their MBs a while back), but the cards shouldn't cost more than $15 + shipping (look for "Realtek 8169" on Google).

HTH!
-NGie

Garrett Cooper

2015-08-29 06:27:02 UTC

Permalink

…

Post by Garrett Cooper
I have 2 re NICs in my fileserver at home (Asus went cheap on some of their MBs a while back), but the cards shouldn't cost more than $15 + shipping (look for "Realtek 8169" on Google).

QEMU also emulates re(4) too, depending on what NIC you ask for at boot.
Cheers,
-NGie

John Baldwin

2015-08-31 21:41:14 UTC

Permalink

Post by K. Macy

Post by John Baldwin

Note that the ithread changes I refer to are for all devices (not just
network interfaces) and fix some other issues as well (e.g. INTR_FILTER is
always enabled and races with tearing down filters are closed, it also uses
a more thread_lock()-friendly state for idle ithreads, and it also allows us
to experiment with sharing ithreads among devices as well as having multiple
threads service a queue of interrupt handlers if desired). It may be that
this will make your life easier since you might be able to reuse the new
primitives more directly rather than bypassing ithreads. I've posted the
changes to arch@ a few different times over the past several years just
haven't pushed them in. (They aren't perfect in that I don't yet have
APIs for changing the plumbing around due to lack of use cases to build
the APIs from.)

--
John Baldwin

Hooman Fazaeli

2015-08-30 21:21:32 UTC

Permalink

Which versions of the driver have this problem?

--
Best regards
Hooman Fazaeli