Discussion:
numa and taskqueues
Emeric POUPON
2017-05-19 07:13:51 UTC
Permalink
Hello,

I have made a review to boost ipsec performance when very few flows are involved: https://reviews.freebsd.org/D10680 (reviews would be appreciated btw!)
The idea is to dispatch the crypto jobs using a taskqueue (with nb threads = nbcpus), details are in the review.

However, this does not scale well on multi socket architectures (ex: 2*6 cores), a lot of time is wasted in the locks.

For testing purposes, I created as many taskqueues as domains and I modified the taskqueue_start_threads function to specify a cpuset_t mask.
The idea here is to stay on the same domain to dispatch the crypto jobs and to notify back the crypto users.
This gives quite good performance so it seems to be an promising way.

Now the question is: how can I make the taskqueues "domain aware"?
Do I have to add some logic in crypto(9) or could this be abstracted in some other part of the kernel?
Another annoying part is the kprocs used by the return queues. We would also have to bind them to a single domain. How?

What do you think?

Emeric
Adrian Chadd
2017-05-19 15:41:13 UTC
Permalink
Hi,

I've been worried about the trend to create ncpu*taskqueue or
ndomain*taskqueue for things unless we really need the priority /
preemption behaviour. Otherwise we will just end up with a lot of
pcpu/pdomain taskqueues that sit idly and/or compete infficiently.

Anyway - I think it'd be nice to have domain aware and pcpu aware
taskqueues so we can eventually migrate to a taskqueue group model of
"one top level things for net processing" for devices to share, etc,
etc. But for the short term just prototype it with some thin API in
crypto that wraps the taskqueue / kproc work so it gets done, then
push that work out for review/evaluation. if it does indeed work the
way you intend, we can try to use it as a template for a higher level,
shared taskqueue thing.

Thanks,


-adrian
Post by Emeric POUPON
Hello,
I have made a review to boost ipsec performance when very few flows are involved: https://reviews.freebsd.org/D10680 (reviews would be appreciated btw!)
The idea is to dispatch the crypto jobs using a taskqueue (with nb threads = nbcpus), details are in the review.
However, this does not scale well on multi socket architectures (ex: 2*6 cores), a lot of time is wasted in the locks.
For testing purposes, I created as many taskqueues as domains and I modified the taskqueue_start_threads function to specify a cpuset_t mask.
The idea here is to stay on the same domain to dispatch the crypto jobs and to notify back the crypto users.
This gives quite good performance so it seems to be an promising way.
Now the question is: how can I make the taskqueues "domain aware"?
Do I have to add some logic in crypto(9) or could this be abstracted in some other part of the kernel?
Another annoying part is the kprocs used by the return queues. We would also have to bind them to a single domain. How?
What do you think?
Emeric
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
Emeric POUPON
2017-05-30 10:56:56 UTC
Permalink
Hi,
Post by Adrian Chadd
Anyway - I think it'd be nice to have domain aware and pcpu aware
taskqueues so we can eventually migrate to a taskqueue group model of
"one top level things for net processing" for devices to share, etc,
etc. But for the short term just prototype it with some thin API in
crypto that wraps the taskqueue / kproc work so it gets done, then
push that work out for review/evaluation. if it does indeed work the
way you intend, we can try to use it as a template for a higher level,
shared taskqueue thing.
It looks like it is somewhat mandatory to modify the taskqueue API to pin threads to the
correct CPUs. The logic to define which CPU need to be set is another story that indeed can first
be implemented in crypto(9).

By the way:
1/ do you have some pointers on domain enumeration and other numa related code?
2/ about https://reviews.freebsd.org/D10680, I think it would be great to have this commited as a first step.
Since it seems to be stuck, maybe I can add more people on this. Any suggestion?

Emeric
Adrian Chadd
2017-05-30 14:46:29 UTC
Permalink
Post by Emeric POUPON
Hi,
Post by Adrian Chadd
Anyway - I think it'd be nice to have domain aware and pcpu aware
taskqueues so we can eventually migrate to a taskqueue group model of
"one top level things for net processing" for devices to share, etc,
etc. But for the short term just prototype it with some thin API in
crypto that wraps the taskqueue / kproc work so it gets done, then
push that work out for review/evaluation. if it does indeed work the
way you intend, we can try to use it as a template for a higher level,
shared taskqueue thing.
It looks like it is somewhat mandatory to modify the taskqueue API to pin threads to the
correct CPUs. The logic to define which CPU need to be set is another story that indeed can first
be implemented in crypto(9).
1/ do you have some pointers on domain enumeration and other numa related code?
Sorry, I'm a bit too busy with other things to dive in right now :(
Post by Emeric POUPON
2/ about https://reviews.freebsd.org/D10680, I think it would be great to have this commited as a first step.
Since it seems to be stuck, maybe I can add more people on this. Any suggestion?
Well, what's with the ~ 8% performance decrease? Do you know what's
going on? For a "we're parallelising IPSEC operations", seeing it get
slower with more flows is a bit concerning.

Thanks,




-adrian
Emeric POUPON
2017-05-30 14:46:01 UTC
Permalink
Hi,
Post by Adrian Chadd
Post by Emeric POUPON
2/ about https://reviews.freebsd.org/D10680, I think it would be great to have
this commited as a first step.
Since it seems to be stuck, maybe I can add more people on this. Any suggestion?
Well, what's with the ~ 8% performance decrease? Do you know what's
going on? For a "we're parallelising IPSEC operations", seeing it get
slower with more flows is a bit concerning.
Thanks,
Actually, there is a performance boost only when few flows are involved.
That's why this is not activated by default and a sysctl is here to enable the feature.

To sum up, the more different flows you process (both ciphered and unciphered), the more network queues are hit and the more CPU units are triggered from ipsec.
In this case, we indeed notice a loss, certainly due to the extra queing/reordering performed.
Adrian Chadd
2017-05-30 18:26:26 UTC
Permalink
Post by Emeric POUPON
Hi,
Post by Adrian Chadd
Post by Emeric POUPON
2/ about https://reviews.freebsd.org/D10680, I think it would be great to have
this commited as a first step.
Since it seems to be stuck, maybe I can add more people on this. Any suggestion?
Well, what's with the ~ 8% performance decrease? Do you know what's
going on? For a "we're parallelising IPSEC operations", seeing it get
slower with more flows is a bit concerning.
Thanks,
Actually, there is a performance boost only when few flows are involved.
That's why this is not activated by default and a sysctl is here to enable the feature.
To sum up, the more different flows you process (both ciphered and unciphered), the more network queues are hit and the more CPU units are triggered from ipsec.
In this case, we indeed notice a loss, certainly due to the extra queing/reordering performed.
Can you dig into that a bit more? Do you know exactly what's going on?
eg, is it a "lock contention" problem? Is it a "stuff is context
switching, thus latency" problem? etc, etc.



-adrian
Emeric POUPON
2017-05-31 13:53:21 UTC
Permalink
Post by Adrian Chadd
Post by Emeric POUPON
Actually, there is a performance boost only when few flows are involved.
That's why this is not activated by default and a sysctl is here to enable the feature.
To sum up, the more different flows you process (both ciphered and unciphered),
the more network queues are hit and the more CPU units are triggered from
ipsec.
In this case, we indeed notice a loss, certainly due to the extra
queing/reordering performed.
Can you dig into that a bit more? Do you know exactly what's going on?
eg, is it a "lock contention" problem? Is it a "stuff is context
switching, thus latency" problem? etc, etc.
Unfortunately I cannot tell you the exact reason right now.
I am sure there is no lock contention involved though (except of course when several domains are involved).
Did you expect such a dev to be enabled by default?

Emeric
Adrian Chadd
2017-05-31 17:30:41 UTC
Permalink
Post by Emeric POUPON
Post by Adrian Chadd
Post by Emeric POUPON
Actually, there is a performance boost only when few flows are involved.
That's why this is not activated by default and a sysctl is here to enable the feature.
To sum up, the more different flows you process (both ciphered and unciphered),
the more network queues are hit and the more CPU units are triggered from
ipsec.
In this case, we indeed notice a loss, certainly due to the extra
queing/reordering performed.
Can you dig into that a bit more? Do you know exactly what's going on?
eg, is it a "lock contention" problem? Is it a "stuff is context
switching, thus latency" problem? etc, etc.
Unfortunately I cannot tell you the exact reason right now.
I am sure there is no lock contention involved though (except of course when several domains are involved).
Did you expect such a dev to be enabled by default?
Well, I'd really like to get to the bottom of these. :-P



-adrian

Loading...