Gleb Smirnoff
2015-12-05 05:29:40 UTC
Hi,
[first paragraph for arch subscribers, To: recepients may skip]
This patch is kinda a prerequisite for the non-blocking sendfile(2),
that was jointly developed by NGINX and Netflix in 2014 and has been
running in Netflix production for a year, serving 35% of the whole
North America (US, Canada, Mexico) Internet traffic.
Technically, the new sendfile(2) doesn't require the new
vm_pager_get_pages() KPI. We currently run it on the old KPI. However,
kib@ suggested that we are abusing the KPI, carefully using its
edge cases. To address this critic, back in spring, I suggested a KPI,
where vm_pager_get_pages() offers all-or-none approach to the array of
pages. Again, kib@ wasn't satisfied, as for "the main user" of
vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal.
The problem was slowly debated through the summer. And then in October
jeff@ suggested yet another extension of the KPI, which I have
implemented and it is described below.
[for those interested in new sendfile(2), skip to the last paragraph,
for those willing to review new pager KPI, read on]
The new KPI offers this prototype for vm_pager_get_pages():
int
vm_pager_get_pages(vm_object_t object, vm_page_t pages[], int count,
int *rbehind, in *rahead);
Where "count" stands for number of pages in the array. The rbehind
and rahead if not NULL specify how many pages the caller is willing to
allow the pager to pre-cache, if the pager can.
Pager doesn't promise to do any read behind or read ahead. If it does,
then only the pager is responsive for grabbing, busying, unbusying and
queueing these pages. It also writes the actual values of completed
read ahead and read behind back to the pointers.
Pager promises to page in "count" pages or fail. Pager expects the
pages to be busied, and returns them busied. For a multi page requests,
the pager demands that the region is a valid region, that exists in
the pager, which can be checked by preceding call to vm_pager_haspage().
For single page requests, there is no such demand.
The net result is a win for both vm_fault() and for new sendfile().
The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
which removes one I/O operation. The logic for read ahead/behind,
which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
we no longer do useless operations when having a fault on ZFS.
The vm_fault() now knows precisely the read ahead that happened,
when updates fs.entry->next_read index. This reduces number of
hardfaults by a tiny fraction (measured building world tree).
The new sendfile() has a stronger KPI, that doesn't unbusy pages,
that sendfile() needs to be kept busied.
Also, the new KPI removes some ugly edges. E.g., since the old
KPI used to unbusy and free pages in the array in case of an
error, the pages could not be wired. However, there are places in
kernel where we want to page in into a wired page. These places
simply violated the assumption, relying on lack of errors in the
pager. Moreover, the swap pager has a special function to skip
wired pages, while doing the freeing sweep, to avoid hitting
assertion. That means passing wired pages to swapper is kinda
OK, while to any other pager it is not. So, we end up with
vm_pager_get_pages() being not pager agnostic, while it is
designed to be so. Now this is fixed.
Peter, if you can, please try the patch in your tests. I already
did that, but you are always better at this :)
[the new sendfile]
As already mentioned, Netflix runs new sendfile(2) in production,
and it is one of key components, that allows us to serve over
80 Gbit/s from a single box. We strongly want to contribute this
code and see it in FreeBSD 11.0-RELEASE. I believe, many FreeBSD
users, who run it as a content server, also want that. Although the
code was production ready back in 2014, it is still not in head.
The reason is the drama with vm_pager_get_pages() KPI. I was very
patient during the whole 2015. Sometimes I was waiting for a feedback
from guys in "To:" for several weeks. I was very gentle to not commit
anything to sys/vm without a review. Now we've got only 2 months left
before the 11.0-RELEASE cycle. And since I want the new sendfile be
there in 11.0, I'm going to push that strongly, putting off all my
patience and gentleness. I won't buy any dislikes on the KPI again,
since this is a third round of compromises from my side. I will wait
only one week for pre-commit reviews, and then all reviews and
asjustments are post-commit.
[first paragraph for arch subscribers, To: recepients may skip]
This patch is kinda a prerequisite for the non-blocking sendfile(2),
that was jointly developed by NGINX and Netflix in 2014 and has been
running in Netflix production for a year, serving 35% of the whole
North America (US, Canada, Mexico) Internet traffic.
Technically, the new sendfile(2) doesn't require the new
vm_pager_get_pages() KPI. We currently run it on the old KPI. However,
kib@ suggested that we are abusing the KPI, carefully using its
edge cases. To address this critic, back in spring, I suggested a KPI,
where vm_pager_get_pages() offers all-or-none approach to the array of
pages. Again, kib@ wasn't satisfied, as for "the main user" of
vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal.
The problem was slowly debated through the summer. And then in October
jeff@ suggested yet another extension of the KPI, which I have
implemented and it is described below.
[for those interested in new sendfile(2), skip to the last paragraph,
for those willing to review new pager KPI, read on]
The new KPI offers this prototype for vm_pager_get_pages():
int
vm_pager_get_pages(vm_object_t object, vm_page_t pages[], int count,
int *rbehind, in *rahead);
Where "count" stands for number of pages in the array. The rbehind
and rahead if not NULL specify how many pages the caller is willing to
allow the pager to pre-cache, if the pager can.
Pager doesn't promise to do any read behind or read ahead. If it does,
then only the pager is responsive for grabbing, busying, unbusying and
queueing these pages. It also writes the actual values of completed
read ahead and read behind back to the pointers.
Pager promises to page in "count" pages or fail. Pager expects the
pages to be busied, and returns them busied. For a multi page requests,
the pager demands that the region is a valid region, that exists in
the pager, which can be checked by preceding call to vm_pager_haspage().
For single page requests, there is no such demand.
The net result is a win for both vm_fault() and for new sendfile().
The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
which removes one I/O operation. The logic for read ahead/behind,
which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
we no longer do useless operations when having a fault on ZFS.
The vm_fault() now knows precisely the read ahead that happened,
when updates fs.entry->next_read index. This reduces number of
hardfaults by a tiny fraction (measured building world tree).
The new sendfile() has a stronger KPI, that doesn't unbusy pages,
that sendfile() needs to be kept busied.
Also, the new KPI removes some ugly edges. E.g., since the old
KPI used to unbusy and free pages in the array in case of an
error, the pages could not be wired. However, there are places in
kernel where we want to page in into a wired page. These places
simply violated the assumption, relying on lack of errors in the
pager. Moreover, the swap pager has a special function to skip
wired pages, while doing the freeing sweep, to avoid hitting
assertion. That means passing wired pages to swapper is kinda
OK, while to any other pager it is not. So, we end up with
vm_pager_get_pages() being not pager agnostic, while it is
designed to be so. Now this is fixed.
Peter, if you can, please try the patch in your tests. I already
did that, but you are always better at this :)
[the new sendfile]
As already mentioned, Netflix runs new sendfile(2) in production,
and it is one of key components, that allows us to serve over
80 Gbit/s from a single box. We strongly want to contribute this
code and see it in FreeBSD 11.0-RELEASE. I believe, many FreeBSD
users, who run it as a content server, also want that. Although the
code was production ready back in 2014, it is still not in head.
The reason is the drama with vm_pager_get_pages() KPI. I was very
patient during the whole 2015. Sometimes I was waiting for a feedback
from guys in "To:" for several weeks. I was very gentle to not commit
anything to sys/vm without a review. Now we've got only 2 months left
before the 11.0-RELEASE cycle. And since I want the new sendfile be
there in 11.0, I'm going to push that strongly, putting off all my
patience and gentleness. I won't buy any dislikes on the KPI again,
since this is a third round of compromises from my side. I will wait
only one week for pre-commit reviews, and then all reviews and
asjustments are post-commit.
--
Totus tuus, Glebius.
Totus tuus, Glebius.