new vm_pager_get_pages() KPI, round 3

Discussion:

Gleb Smirnoff

2015-12-05 05:29:40 UTC

Hi,

[first paragraph for arch subscribers, To: recepients may skip]

This patch is kinda a prerequisite for the non-blocking sendfile(2),
that was jointly developed by NGINX and Netflix in 2014 and has been
running in Netflix production for a year, serving 35% of the whole
North America (US, Canada, Mexico) Internet traffic.
Technically, the new sendfile(2) doesn't require the new
vm_pager_get_pages() KPI. We currently run it on the old KPI. However,
kib@ suggested that we are abusing the KPI, carefully using its
edge cases. To address this critic, back in spring, I suggested a KPI,
where vm_pager_get_pages() offers all-or-none approach to the array of
pages. Again, kib@ wasn't satisfied, as for "the main user" of
vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal.
The problem was slowly debated through the summer. And then in October
jeff@ suggested yet another extension of the KPI, which I have
implemented and it is described below.

[for those interested in new sendfile(2), skip to the last paragraph,
for those willing to review new pager KPI, read on]

The new KPI offers this prototype for vm_pager_get_pages():

int
vm_pager_get_pages(vm_object_t object, vm_page_t pages[], int count,
int *rbehind, in *rahead);

Where "count" stands for number of pages in the array. The rbehind
and rahead if not NULL specify how many pages the caller is willing to
allow the pager to pre-cache, if the pager can.

Pager doesn't promise to do any read behind or read ahead. If it does,
then only the pager is responsive for grabbing, busying, unbusying and
queueing these pages. It also writes the actual values of completed
read ahead and read behind back to the pointers.

Pager promises to page in "count" pages or fail. Pager expects the
pages to be busied, and returns them busied. For a multi page requests,
the pager demands that the region is a valid region, that exists in
the pager, which can be checked by preceding call to vm_pager_haspage().
For single page requests, there is no such demand.

The net result is a win for both vm_fault() and for new sendfile().

The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
which removes one I/O operation. The logic for read ahead/behind,
which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
we no longer do useless operations when having a fault on ZFS.

The vm_fault() now knows precisely the read ahead that happened,
when updates fs.entry->next_read index. This reduces number of
hardfaults by a tiny fraction (measured building world tree).

The new sendfile() has a stronger KPI, that doesn't unbusy pages,
that sendfile() needs to be kept busied.

Also, the new KPI removes some ugly edges. E.g., since the old
KPI used to unbusy and free pages in the array in case of an
error, the pages could not be wired. However, there are places in
kernel where we want to page in into a wired page. These places
simply violated the assumption, relying on lack of errors in the
pager. Moreover, the swap pager has a special function to skip
wired pages, while doing the freeing sweep, to avoid hitting
assertion. That means passing wired pages to swapper is kinda
OK, while to any other pager it is not. So, we end up with
vm_pager_get_pages() being not pager agnostic, while it is
designed to be so. Now this is fixed.

Peter, if you can, please try the patch in your tests. I already
did that, but you are always better at this :)

[the new sendfile]
As already mentioned, Netflix runs new sendfile(2) in production,
and it is one of key components, that allows us to serve over
80 Gbit/s from a single box. We strongly want to contribute this
code and see it in FreeBSD 11.0-RELEASE. I believe, many FreeBSD
users, who run it as a content server, also want that. Although the
code was production ready back in 2014, it is still not in head.
The reason is the drama with vm_pager_get_pages() KPI. I was very
patient during the whole 2015. Sometimes I was waiting for a feedback
from guys in "To:" for several weeks. I was very gentle to not commit
anything to sys/vm without a review. Now we've got only 2 months left
before the 11.0-RELEASE cycle. And since I want the new sendfile be
there in 11.0, I'm going to push that strongly, putting off all my
patience and gentleness. I won't buy any dislikes on the KPI again,
since this is a third round of compromises from my side. I will wait
only one week for pre-commit reviews, and then all reviews and
asjustments are post-commit.

--
Totus tuus, Glebius.

Peter Holm

2015-12-08 08:55:54 UTC

Permalink

Post by Gleb Smirnoff
Hi,
[first paragraph for arch subscribers, To: recepients may skip]
This patch is kinda a prerequisite for the non-blocking sendfile(2),
that was jointly developed by NGINX and Netflix in 2014 and has been
running in Netflix production for a year, serving 35% of the whole
North America (US, Canada, Mexico) Internet traffic.
Technically, the new sendfile(2) doesn't require the new
vm_pager_get_pages() KPI. We currently run it on the old KPI. However,
edge cases. To address this critic, back in spring, I suggested a KPI,
where vm_pager_get_pages() offers all-or-none approach to the array of
vm_pager_get_pages, the vm_fault(), all-or-none approach isn't optimal.
The problem was slowly debated through the summer. And then in October
implemented and it is described below.
[for those interested in new sendfile(2), skip to the last paragraph,
for those willing to review new pager KPI, read on]
int
vm_pager_get_pages(vm_object_t object, vm_page_t pages[], int count,
int *rbehind, in *rahead);
Where "count" stands for number of pages in the array. The rbehind
and rahead if not NULL specify how many pages the caller is willing to
allow the pager to pre-cache, if the pager can.
Pager doesn't promise to do any read behind or read ahead. If it does,
then only the pager is responsive for grabbing, busying, unbusying and
queueing these pages. It also writes the actual values of completed
read ahead and read behind back to the pointers.
Pager promises to page in "count" pages or fail. Pager expects the
pages to be busied, and returns them busied. For a multi page requests,
the pager demands that the region is a valid region, that exists in
the pager, which can be checked by preceding call to vm_pager_haspage().
For single page requests, there is no such demand.
The net result is a win for both vm_fault() and for new sendfile().
The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
which removes one I/O operation. The logic for read ahead/behind,
which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
we no longer do useless operations when having a fault on ZFS.
The vm_fault() now knows precisely the read ahead that happened,
when updates fs.entry->next_read index. This reduces number of
hardfaults by a tiny fraction (measured building world tree).
The new sendfile() has a stronger KPI, that doesn't unbusy pages,
that sendfile() needs to be kept busied.
Also, the new KPI removes some ugly edges. E.g., since the old
KPI used to unbusy and free pages in the array in case of an
error, the pages could not be wired. However, there are places in
kernel where we want to page in into a wired page. These places
simply violated the assumption, relying on lack of errors in the
pager. Moreover, the swap pager has a special function to skip
wired pages, while doing the freeing sweep, to avoid hitting
assertion. That means passing wired pages to swapper is kinda
OK, while to any other pager it is not. So, we end up with
vm_pager_get_pages() being not pager agnostic, while it is
designed to be so. Now this is fixed.
Peter, if you can, please try the patch in your tests. I already
did that, but you are always better at this :)

I ran all test that I have on amd64 / 24 CPU and 64GB.
Ran a "-j 25" buildworld on amd64 / 24 CPU and 2GB.
Ran a buildkernel on i386 / 1 CPU and 256MB.

No problems seen.

- Peter

Gleb Smirnoff

2015-12-08 16:31:35 UTC

Permalink

On Tue, Dec 08, 2015 at 09:55:54AM +0100, Peter Holm wrote:
P> > Peter, if you can, please try the patch in your tests. I already
P> > did that, but you are always better at this :)
P> >
P>
P> I ran all test that I have on amd64 / 24 CPU and 64GB.
P> Ran a "-j 25" buildworld on amd64 / 24 CPU and 2GB.
P> Ran a buildkernel on i386 / 1 CPU and 256MB.
P>
P> No problems seen.

Thanks a lot, Peter!

--
Totus tuus, Glebius.

Konstantin Belousov

2015-12-14 11:13:35 UTC

Permalink

I fail to understand how the case of count > 1 and non-contiguous blocks
in the non-readahead case is handled by new vnode_pager_generic_getpages().

I do not understand how a hole somewhere in the requested range is handled.
Code has a comment that a hole must not appear in the range.

Both issues mean that vm_pager_has_page() still must be called before
pagein, for count > 1 use. E.g. the exec_map_first_page() uses *after
value returned from has_pages() to calculate count, which is an advisory
and not the contract. Same issue prevents converting GEM and TTM (and
probably md) to use the count > 1 KPI.

Same is true for swap pager, and this prevents the removal of the loop
in vm_thread_swaping().

Code assumes that the partially valid page may only appear in the last
position of the page run for the local pager, which again requires
pre-validation of the vm_pager_get_pages() on the caller side.

Overall this is not an KPI that was discussed. It seemingly does not
change semantic for count == 1 case, but is not what it should be for
count > 1. As discussed, new vm_pager_get_pages() was support to just
work for any count, doing the loop over the non-contig ranges or short
reads, and guaranteeing that all existing (or hole-filled) pages are
read until EOF is met. This KPI was supposed to:
- fix my compaints about short reads
- avoid excessive VOP_BMAP() call from has_pages before get_pages()
- allowed to remove the loops from all current get_pages() consumers,
it vm_thread_swapin(), GEM/TTM, image activator
- supposedly served your needs for sendfile(2)
in one go. What is present in your patch might only satisfy the last
item of the list.

Post by Gleb Smirnoff
The net result is a win for both vm_fault() and for new sendfile().
The vm_fault() no longer needs to do prepatory vm_pager_haspage(),
which removes one I/O operation. The logic for read ahead/behind,
which is strongly UFS/EXT-centric, moves into vnode_pager.c. So
we no longer do useless operations when having a fault on ZFS.
The vm_fault() now knows precisely the read ahead that happened,
when updates fs.entry->next_read index. This reduces number of
hardfaults by a tiny fraction (measured building world tree).
The new sendfile() has a stronger KPI, that doesn't unbusy pages,
that sendfile() needs to be kept busied.
Also, the new KPI removes some ugly edges. E.g., since the old
KPI used to unbusy and free pages in the array in case of an
error, the pages could not be wired. However, there are places in
kernel where we want to page in into a wired page. These places
simply violated the assumption, relying on lack of errors in the
pager. Moreover, the swap pager has a special function to skip
wired pages, while doing the freeing sweep, to avoid hitting
assertion. That means passing wired pages to swapper is kinda
OK, while to any other pager it is not. So, we end up with
vm_pager_get_pages() being not pager agnostic, while it is
designed to be so. Now this is fixed.

Gleb Smirnoff

2015-12-14 17:50:46 UTC

Permalink

On Mon, Dec 14, 2015 at 01:13:35PM +0200, Konstantin Belousov wrote:
K> I fail to understand how the case of count > 1 and non-contiguous blocks
K> in the non-readahead case is handled by new vnode_pager_generic_getpages().
K>
K> I do not understand how a hole somewhere in the requested range is handled.
K> Code has a comment that a hole must not appear in the range.
K>
K> Both issues mean that vm_pager_has_page() still must be called before
K> pagein, for count > 1 use. E.g. the exec_map_first_page() uses *after
K> value returned from has_pages() to calculate count, which is an advisory
K> and not the contract. Same issue prevents converting GEM and TTM (and
K> probably md) to use the count > 1 KPI.

The *after and *before are now not advisory, but a contract. Those consumers,
who want to utilize count > 1, must preceed the call to vm_pager_get_pages()
with call to vm_pager_haspage(). Only region approved by vm_pager_haspage()
or smaller will succeed.

K> Same is true for swap pager, and this prevents the removal of the loop
K> in vm_thread_swaping().
K>
K> Code assumes that the partially valid page may only appear in the last
K> position of the page run for the local pager, which again requires
K> pre-validation of the vm_pager_get_pages() on the caller side.

Yes, asking for page in into a valid page is a risk of data corruption.

K> Overall this is not an KPI that was discussed. It seemingly does not
K> change semantic for count == 1 case, but is not what it should be for
K> count > 1. As discussed, new vm_pager_get_pages() was support to just
K> work for any count, doing the loop over the non-contig ranges or short
K> reads, and guaranteeing that all existing (or hole-filled) pages are
K> read until EOF is met. This KPI was supposed to:
K> - fix my compaints about short reads

I will not take your complaints about short reads. The get pages KPI
is not a complement to VOP_READ(), neither of a read(2) syscall. If
a underlying filesystem has problems in it, it must deal with these
problems on its own, doing multiple I/Os per VOP_GETPAGES().

K> - avoid excessive VOP_BMAP() call from has_pages before get_pages()

It is now avoided for count == 1.

K> - allowed to remove the loops from all current get_pages() consumers,
K> it vm_thread_swapin(), GEM/TTM, image activator

This wasn't discussed at all. I like this idea, that can be done later.

--
Totus tuus, Glebius.

Konstantin Belousov

2015-12-16 13:42:20 UTC

Permalink

Post by Gleb Smirnoff
K> I fail to understand how the case of count > 1 and non-contiguous blocks
K> in the non-readahead case is handled by new vnode_pager_generic_getpages().
K>
K> I do not understand how a hole somewhere in the requested range is handled.
K> Code has a comment that a hole must not appear in the range.
K>
K> Both issues mean that vm_pager_has_page() still must be called before
K> pagein, for count > 1 use. E.g. the exec_map_first_page() uses *after
K> value returned from has_pages() to calculate count, which is an advisory
K> and not the contract. Same issue prevents converting GEM and TTM (and
K> probably md) to use the count > 1 KPI.
The *after and *before are now not advisory, but a contract. Those consumers,
who want to utilize count > 1, must preceed the call to vm_pager_get_pages()
with call to vm_pager_haspage(). Only region approved by vm_pager_haspage()
or smaller will succeed.

Let me repeat what I discussed:

int vm_pager_get_pages(vm_object_t obj, vm_page_t *mrun, int count,
int before, int after);
mrun contains exactly count contiguous busy pages owned by obj. All mrun
pages must be returned busy. All mrun pages must be returned valid, or
pager must return an error.

Advisory, up to 'before' pages befire mrun[0] and up to 'after' pages
after mrun[count - 1] might be read by pager, if it succeeds in finding
the pages on queues, or allocates them, and no pages outside the mrun
are busy etc. Pager is allowed to ignore hints for any reason, typical
cause for the VOP_BMAP() backed pager would be the end of contig run
on disk. The additional pages are not owned by the called thread
after vm_pager_get_pages() return.

Post by Gleb Smirnoff
K> Same is true for swap pager, and this prevents the removal of the loop
K> in vm_thread_swaping().
K>
K> Code assumes that the partially valid page may only appear in the last
K> position of the page run for the local pager, which again requires
K> pre-validation of the vm_pager_get_pages() on the caller side.
Yes, asking for page in into a valid page is a risk of data corruption.

This is an internal pager detail. In the interface I discussed, the pages
returned should be valid. If they are already valid at the call time,
it is the pager job to avoid corruption by not doing i/o.

Post by Gleb Smirnoff
K> Overall this is not an KPI that was discussed. It seemingly does not
K> change semantic for count == 1 case, but is not what it should be for
K> count > 1. As discussed, new vm_pager_get_pages() was support to just
K> work for any count, doing the loop over the non-contig ranges or short
K> reads, and guaranteeing that all existing (or hole-filled) pages are
K> - fix my compaints about short reads
I will not take your complaints about short reads. The get pages KPI
is not a complement to VOP_READ(), neither of a read(2) syscall. If
a underlying filesystem has problems in it, it must deal with these
problems on its own, doing multiple I/Os per VOP_GETPAGES().
K> - avoid excessive VOP_BMAP() call from has_pages before get_pages()
It is now avoided for count == 1.
K> - allowed to remove the loops from all current get_pages() consumers,
K> it vm_thread_swapin(), GEM/TTM, image activator
This wasn't discussed at all. I like this idea, that can be done later.
--
Totus tuus, Glebius.