Discussion:
more strict KPI for vm_pager_get_pages()
Gleb Smirnoff
2015-04-30 14:24:08 UTC
Permalink
Hi!

The reason to write down this patch emerges from the
projects/sendfile branch, where vm_pager_get_pages() is
used in the sendfile(2) system call. Although the new
sendfile works flawlessly, it makes some assumptions
about vnode_pager that theoretically may not be valid,
however always hold in our current code.

Going deeper in the problem I have found more important
points, which yielded in the suggested patch. To start,
let me display the current KPI assumptions:

1) vm_pager_get_pages() works on an array of consequtive
array of pages. Pindex of (n+1)-th pages must be pindex
of n-th + 1. One page is special, it is called reqpage.
2) vm_pager_get_pages() guarantees to swapin only the reqpage,
and may skip or fail other pages for different reasons, that
may vary from pager to pager.
3) There also is function vm_pager_has_page(), which reports
availability of a page at given index in the pager, and also
provides hints on how many consequtive pages before this one
and after this one can be swapped in in single pager request.
Most pagers return zeros in these hints. The vnode pager for
UFS returns a strong promise, that one can later utilize in
vm_pager_get_pages().
4) All pages must be busied on enter. On exit only reqpage
will be left busied. The KPI doesn't guarantee that rest
of the pages is still in place. The pager usually calls
vm_page_readahead_finish() on them, which can either free,
or put the page on active/inactive queue, using quite
a strange approach to choose a queue.
5) The pages must not be wired, since vm_page_free() may be
called on them. However, this is violated by several
consumers of KPI, relying on lack of errors in the pager.
Moreover, the swap pager has a special function to skip
wired pages, while doing the sweep, to avoid this problem.
So, passing wired pages to swapper is OK, while to the
reset is not.
6) Pagers may replace a page in the object with a new one.
The sg_pager actually does that. To protect from this
event, consumers of vm_pager_get_pages() always run
vm_page_lookup() over the array of pages to relookup the pages.
However, not all consumers do this.

Speaking of pagers and their consumers:
- 11 consumers request array of size 1, a single page
- 3 consumers actually request array

My suggestion is to change the KPI assumptions to the following:

1) There is no reqpage. All pages are entered busied, all pages
are returned busied and validated. If pager fails to validate
all pages it must return error.
2) The consumer (not the pager!) is to decide what to do with the
pages: vm_page_active, vm_page_deactivate, vm_page_flash or just
vm_page_free them. The consumer also unbusies pages, if it
wants to. The consumer is free to wire pages before the call.
3) Consumers must first query the pager via vm_pager_has_page(),
and use the after/before hints to limit the size of the
requested pages array.
4) In case if pager replaces pages, it must also update the array,
so that consumer doesn't need to do relookup.

Doing this sweep, I also noticed that all pagers have a copy-pasted
code of zeroing invalid regions of partially valid pages. Also,
many pagers got a set of assertions copy and pasted from each
other. So, I decided to un-inline the vm_pager_get_pages(), bring
it to the vm_pager.c file and gather all these copy-pastes
into one place.

The suggested patch is attached. As expected, it simplifies and
removes quite a lot of code.

Right now it is tested on UFS only, testing NFS and ZFS is on my list.
There is one panic known, but it seems unrelated, and Peter pho@ says
that once it has been seen before.
--
Totus tuus, Glebius.
Konstantin Belousov
2015-05-04 08:24:26 UTC
Permalink
Post by Gleb Smirnoff
Hi!
The reason to write down this patch emerges from the
projects/sendfile branch, where vm_pager_get_pages() is
used in the sendfile(2) system call. Although the new
sendfile works flawlessly, it makes some assumptions
about vnode_pager that theoretically may not be valid,
however always hold in our current code.
Going deeper in the problem I have found more important
points, which yielded in the suggested patch. To start,
1) vm_pager_get_pages() works on an array of consequtive
array of pages. Pindex of (n+1)-th pages must be pindex
of n-th + 1. One page is special, it is called reqpage.
2) vm_pager_get_pages() guarantees to swapin only the reqpage,
and may skip or fail other pages for different reasons, that
may vary from pager to pager.
3) There also is function vm_pager_has_page(), which reports
availability of a page at given index in the pager, and also
provides hints on how many consequtive pages before this one
and after this one can be swapped in in single pager request.
Most pagers return zeros in these hints. The vnode pager for
UFS returns a strong promise, that one can later utilize in
vm_pager_get_pages().
4) All pages must be busied on enter. On exit only reqpage
will be left busied. The KPI doesn't guarantee that rest
of the pages is still in place. The pager usually calls
vm_page_readahead_finish() on them, which can either free,
or put the page on active/inactive queue, using quite
a strange approach to choose a queue.
5) The pages must not be wired, since vm_page_free() may be
called on them. However, this is violated by several
consumers of KPI, relying on lack of errors in the pager.
Moreover, the swap pager has a special function to skip
wired pages, while doing the sweep, to avoid this problem.
So, passing wired pages to swapper is OK, while to the
reset is not.
6) Pagers may replace a page in the object with a new one.
The sg_pager actually does that. To protect from this
event, consumers of vm_pager_get_pages() always run
vm_page_lookup() over the array of pages to relookup the pages.
However, not all consumers do this.
- 11 consumers request array of size 1, a single page
- 3 consumers actually request array
1) There is no reqpage. All pages are entered busied, all pages
are returned busied and validated. If pager fails to validate
all pages it must return error.
2) The consumer (not the pager!) is to decide what to do with the
pages: vm_page_active, vm_page_deactivate, vm_page_flash or just
vm_page_free them. The consumer also unbusies pages, if it
wants to. The consumer is free to wire pages before the call.
3) Consumers must first query the pager via vm_pager_has_page(),
and use the after/before hints to limit the size of the
requested pages array.
4) In case if pager replaces pages, it must also update the array,
so that consumer doesn't need to do relookup.
Doing this sweep, I also noticed that all pagers have a copy-pasted
code of zeroing invalid regions of partially valid pages. Also,
many pagers got a set of assertions copy and pasted from each
other. So, I decided to un-inline the vm_pager_get_pages(), bring
it to the vm_pager.c file and gather all these copy-pastes
into one place.
The suggested patch is attached. As expected, it simplifies and
removes quite a lot of code.
Right now it is tested on UFS only, testing NFS and ZFS is on my list.
that once it has been seen before.
Below is the summary of my part of the internal discussion about the changes.

Traditionally, Unix allows the filesystems to perform the short reads.
Most fundamental change in the patch removes this freedom from the
filesystem implementation, and I think that only local filesystems could
be compliant with the proposed strictness.

IMO, the response from vm_pager_haspages() is only advisory, since
filesystem might not control the external entities which are the source
of the required data.
Gleb Smirnoff
2015-05-04 09:11:37 UTC
Permalink
On Mon, May 04, 2015 at 11:24:26AM +0300, Konstantin Belousov wrote:
K> Below is the summary of my part of the internal discussion about the changes.

Quite short. Is it truncated?

K> Traditionally, Unix allows the filesystems to perform the short reads.
K> Most fundamental change in the patch removes this freedom from the
K> filesystem implementation, and I think that only local filesystems could
K> be compliant with the proposed strictness.
K>
K> IMO, the response from vm_pager_haspages() is only advisory, since
K> filesystem might not control the external entities which are the source
K> of the required data.

That's why remote filesystems use vop_stdbmap() (or similar), which
always return zeroes for "after" and "before" hints.
--
Totus tuus, Glebius.
Konstantin Belousov
2015-05-04 09:51:16 UTC
Permalink
Post by Gleb Smirnoff
K> Below is the summary of my part of the internal discussion about the changes.
Quite short. Is it truncated?
No. IMO, I pointed out the most important point about the patch. If
other changes in the patch are unrelated, they must be extracted and
discussed (and committed) separately. Due to the fundamental nature of
the code being changed, the extra work to make it easier to bisect and
detect regressions worth it.
Post by Gleb Smirnoff
--
Totus tuus, Glebius.
K> Traditionally, Unix allows the filesystems to perform the short reads.
K> Most fundamental change in the patch removes this freedom from the
K> filesystem implementation, and I think that only local filesystems could
K> be compliant with the proposed strictness.
K>
K> IMO, the response from vm_pager_haspages() is only advisory, since
K> filesystem might not control the external entities which are the source
K> of the required data.
That's why remote filesystems use vop_stdbmap() (or similar), which
always return zeroes for "after" and "before" hints.
Which precludes useful optimizations, at all, in the future.
Gleb Smirnoff
2015-05-04 18:38:52 UTC
Permalink
On Mon, May 04, 2015 at 12:51:16PM +0300, Konstantin Belousov wrote:
K> > On Mon, May 04, 2015 at 11:24:26AM +0300, Konstantin Belousov wrote:
K> > K> Below is the summary of my part of the internal discussion about the changes.
K> >
K> > Quite short. Is it truncated?
K> No. IMO, I pointed out the most important point about the patch. If
K> other changes in the patch are unrelated, they must be extracted and
K> discussed (and committed) separately. Due to the fundamental nature of
K> the code being changed, the extra work to make it easier to bisect and
K> detect regressions worth it.

Of course, I'm not going to commit it as one. Some parts can be separated.

K> > K> Traditionally, Unix allows the filesystems to perform the short reads.
K> > K> Most fundamental change in the patch removes this freedom from the
K> > K> filesystem implementation, and I think that only local filesystems could
K> > K> be compliant with the proposed strictness.
K> > K>
K> > K> IMO, the response from vm_pager_haspages() is only advisory, since
K> > K> filesystem might not control the external entities which are the source
K> > K> of the required data.
K> >
K> > That's why remote filesystems use vop_stdbmap() (or similar), which
K> > always return zeroes for "after" and "before" hints.
K>
K> Which precludes useful optimizations, at all, in the future.

If in the future there appears a new consumer, who is fine with partial
success, and if in the future there would appear a pager, that may fail
to satisfy its own hints, then we will simply:

1) Relax assertions at the end of vm_pager_get_pages() and thus allow
partial success.
2) Let the new consumer scan returned pages and check their 'valid' bits.
3) All current consumers, who prefer all or none result, would still look at
the return value, as they do now. No change for them.

So, I'd insist that patch doesn't preclude future optimizations. Instead,
the current 'one page valid & busy, all pages unbusied' approach blocks
different optimizations. The consumer usually has better idea on what to
do with the pages.

P.S. Meanwhile Peter pho@ reported 32 hour successful test run with the patch.
--
Totus tuus, Glebius.
Julian Elischer
2015-05-04 09:50:39 UTC
Permalink
Post by Konstantin Belousov
Post by Gleb Smirnoff
Hi!
The reason to write down this patch emerges from the
projects/sendfile branch, where vm_pager_get_pages() is
used in the sendfile(2) system call. Although the new
sendfile works flawlessly, it makes some assumptions
about vnode_pager that theoretically may not be valid,
however always hold in our current code.
Going deeper in the problem I have found more important
points, which yielded in the suggested patch. To start,
1) vm_pager_get_pages() works on an array of consequtive
array of pages. Pindex of (n+1)-th pages must be pindex
of n-th + 1. One page is special, it is called reqpage.
2) vm_pager_get_pages() guarantees to swapin only the reqpage,
and may skip or fail other pages for different reasons, that
may vary from pager to pager.
3) There also is function vm_pager_has_page(), which reports
availability of a page at given index in the pager, and also
provides hints on how many consequtive pages before this one
and after this one can be swapped in in single pager request.
Most pagers return zeros in these hints. The vnode pager for
UFS returns a strong promise, that one can later utilize in
vm_pager_get_pages().
4) All pages must be busied on enter. On exit only reqpage
will be left busied. The KPI doesn't guarantee that rest
of the pages is still in place. The pager usually calls
vm_page_readahead_finish() on them, which can either free,
or put the page on active/inactive queue, using quite
a strange approach to choose a queue.
5) The pages must not be wired, since vm_page_free() may be
called on them. However, this is violated by several
consumers of KPI, relying on lack of errors in the pager.
Moreover, the swap pager has a special function to skip
wired pages, while doing the sweep, to avoid this problem.
So, passing wired pages to swapper is OK, while to the
reset is not.
6) Pagers may replace a page in the object with a new one.
The sg_pager actually does that. To protect from this
event, consumers of vm_pager_get_pages() always run
vm_page_lookup() over the array of pages to relookup the pages.
However, not all consumers do this.
- 11 consumers request array of size 1, a single page
- 3 consumers actually request array
1) There is no reqpage. All pages are entered busied, all pages
are returned busied and validated. If pager fails to validate
all pages it must return error.
2) The consumer (not the pager!) is to decide what to do with the
pages: vm_page_active, vm_page_deactivate, vm_page_flash or just
vm_page_free them. The consumer also unbusies pages, if it
wants to. The consumer is free to wire pages before the call.
3) Consumers must first query the pager via vm_pager_has_page(),
and use the after/before hints to limit the size of the
requested pages array.
4) In case if pager replaces pages, it must also update the array,
so that consumer doesn't need to do relookup.
Doing this sweep, I also noticed that all pagers have a copy-pasted
code of zeroing invalid regions of partially valid pages. Also,
many pagers got a set of assertions copy and pasted from each
other. So, I decided to un-inline the vm_pager_get_pages(), bring
it to the vm_pager.c file and gather all these copy-pastes
into one place.
The suggested patch is attached. As expected, it simplifies and
removes quite a lot of code.
Right now it is tested on UFS only, testing NFS and ZFS is on my list.
that once it has been seen before.
Below is the summary of my part of the internal discussion about the changes.
Traditionally, Unix allows the filesystems to perform the short reads.
Most fundamental change in the patch removes this freedom from the
filesystem implementation, and I think that only local filesystems could
be compliant with the proposed strictness.
IMO, the response from vm_pager_haspages() is only advisory, since
filesystem might not control the external entities which are the source
of the required data.
Also since the backing object is not locked, a truncate() may be performed
between the operations making the prior return information invalid.
Certainly in remote filesystems, possibly on local ones too.
Post by Konstantin Belousov
_______________________________________________
http://lists.freebsd.org/mailman/listinfo/freebsd-arch
Gleb Smirnoff
2015-05-04 09:54:59 UTC
Permalink
On Mon, May 04, 2015 at 05:50:39PM +0800, Julian Elischer wrote:
J> > IMO, the response from vm_pager_haspages() is only advisory, since
J> > filesystem might not control the external entities which are the source
J> > of the required data.
J>
J> Also since the backing object is not locked, a truncate() may be performed
J> between the operations making the prior return information invalid.
J> Certainly in remote filesystems, possibly on local ones too.

Of course, the object shouldn't be unlocked between vm_pager_haspage()
and vm_pager_get_pages(). All current consumers do not unlock.
--
Totus tuus, Glebius.
Gleb Smirnoff
2015-05-06 11:45:49 UTC
Permalink
Hi!

I'm splitting the patch into a serie. This is step 1:

Pagers are responsible to update the array of pages in
case if they replace pages in an object. All array entries
must be valid, if pager returns VM_PAGER_OK.

Note: the only pager that replaces pages is sg_pager, and
it does that correctly.
--
Totus tuus, Glebius.
Alan Cox
2015-05-06 16:59:51 UTC
Permalink
Post by Gleb Smirnoff
Hi!
Pagers are responsible to update the array of pages in
case if they replace pages in an object. All array entries
must be valid, if pager returns VM_PAGER_OK.
Note: the only pager that replaces pages is sg_pager, and
it does that correctly.
No, look again, e.g., device_pager.
Gleb Smirnoff
2015-05-06 17:16:11 UTC
Permalink
On Wed, May 06, 2015 at 11:59:51AM -0500, Alan Cox wrote:
A> > Pagers are responsible to update the array of pages in
A> > case if they replace pages in an object. All array entries
A> > must be valid, if pager returns VM_PAGER_OK.
A> >
A> > Note: the only pager that replaces pages is sg_pager, and
A> > it does that correctly.
A> >
A>
A> No, look again, e.g., device_pager.

Yes, now I see it, it goes trough cdev op.

But it appears also to work correctly - it updates the page in the array.
--
Totus tuus, Glebius.
Gleb Smirnoff
2015-06-12 20:44:42 UTC
Permalink
Alan,

[cc arch@ back]

On Fri, Jun 12, 2015 at 02:29:21PM -0500, Alan Cox wrote:
A> I'm fine with changing the rules or expectations concerning the
A> *requested* page. In fact, there are inconsistencies among the callers
A> in whether they believe that the requested page could disappear.
A> Specifically, some callers (e.g., kern_exec.c) handle the possibility of
A> the requested page disappearing and treat its disappearance as an I/O
A> error, while other callers (e.g., tmpfs_subr.c) would crash if the
A> requested disappeared. Since we also expect the requested page to
A> remain busy until we return to the caller, I don't see the point in
A> handling the possibility that the requested page disappeared. In other
A> words, error or no error, the request page remains allocated and busy;
A> moreover, we guarantee that the array entry references the correct page.
A>
A> On the other hand, I'm not ready to make a guarantee about the state of
A> the array entries for the non-request pages. In the general case, the
A> I/O completion handlers will unbusy and place the pages in a paging
A> queue. So, in principle, they could be reclaimed before control was
A> returned to vm_pager_get_pages()'s caller, and even if the pager updated
A> the array entries, they would no longer be valid. For example, the page
A> daemon could reclaim them, or another thread could simply decide to free
A> them for some arbitrary reason.
A>
A> In a nutshell, I'm fine with all of the changes except the one to
A> vm_thread_swapin(). The change to vm_thread_swapin() is only safe
A> because the pages have been wired and the pages are used in a particular
A> way, i.e., the implementation of a thread stack.

Replying to the last two paragraphs:

Yes, and this lack of guarantee is the inconsistency, that I'd like to
address. The patch committed is only a first step of a bigger
vm_pager_get_pages KPI strictening.

Let's get back to the patch that started this topic:

https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html
--
Totus tuus, Glebius.
Alan Cox
2015-06-12 22:42:12 UTC
Permalink
Post by Gleb Smirnoff
Alan,
A> I'm fine with changing the rules or expectations concerning the
A> *requested* page. In fact, there are inconsistencies among the callers
A> in whether they believe that the requested page could disappear.
A> Specifically, some callers (e.g., kern_exec.c) handle the possibility of
A> the requested page disappearing and treat its disappearance as an I/O
A> error, while other callers (e.g., tmpfs_subr.c) would crash if the
A> requested disappeared. Since we also expect the requested page to
A> remain busy until we return to the caller, I don't see the point in
A> handling the possibility that the requested page disappeared. In other
A> words, error or no error, the request page remains allocated and busy;
A> moreover, we guarantee that the array entry references the correct page.
A>
A> On the other hand, I'm not ready to make a guarantee about the state of
A> the array entries for the non-request pages. In the general case, the
A> I/O completion handlers will unbusy and place the pages in a paging
A> queue. So, in principle, they could be reclaimed before control was
A> returned to vm_pager_get_pages()'s caller, and even if the pager updated
A> the array entries, they would no longer be valid. For example, the page
A> daemon could reclaim them, or another thread could simply decide to free
A> them for some arbitrary reason.
A>
A> In a nutshell, I'm fine with all of the changes except the one to
A> vm_thread_swapin(). The change to vm_thread_swapin() is only safe
A> because the pages have been wired and the pages are used in a particular
A> way, i.e., the implementation of a thread stack.
Yes, and this lack of guarantee is the inconsistency, that I'd like to
address. The patch committed is only a first step of a bigger
vm_pager_get_pages KPI strictening.
https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html
I'm not sure that I understand what inconsistency you're referring to
here. That the request page is handled differently from the non-request
pages?

Again, I'm happy with the changes to the handling of the request page.
However, I'm still on the fence about the other proposed changes, and I
feel like the change to vm_thread_swapin() in the patch we are
discussing is qualitatively different from the other changes in that
same patch. In particular, it is the only part of that patch that
touches non-request pages. As such, I didn't think it belongs.
Gleb Smirnoff
2015-06-15 11:31:00 UTC
Permalink
Alan,

On Fri, Jun 12, 2015 at 05:42:12PM -0500, Alan Cox wrote:
A> > A> On the other hand, I'm not ready to make a guarantee about the state of
A> > A> the array entries for the non-request pages. In the general case, the
A> > A> I/O completion handlers will unbusy and place the pages in a paging
A> > A> queue. So, in principle, they could be reclaimed before control was
A> > A> returned to vm_pager_get_pages()'s caller, and even if the pager updated
A> > A> the array entries, they would no longer be valid. For example, the page
A> > A> daemon could reclaim them, or another thread could simply decide to free
A> > A> them for some arbitrary reason.
A> > A>
A> > A> In a nutshell, I'm fine with all of the changes except the one to
A> > A> vm_thread_swapin(). The change to vm_thread_swapin() is only safe
A> > A> because the pages have been wired and the pages are used in a particular
A> > A> way, i.e., the implementation of a thread stack.
A> >
A> > Replying to the last two paragraphs:
A> >
A> > Yes, and this lack of guarantee is the inconsistency, that I'd like to
A> > address. The patch committed is only a first step of a bigger
A> > vm_pager_get_pages KPI strictening.
A> >
A> > Let's get back to the patch that started this topic:
A> >
A> > https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html
A>
A> I'm not sure that I understand what inconsistency you're referring to
A> here. That the request page is handled differently from the non-request
A> pages?
A>
A> Again, I'm happy with the changes to the handling of the request page.
A> However, I'm still on the fence about the other proposed changes, and I
A> feel like the change to vm_thread_swapin() in the patch we are
A> discussing is qualitatively different from the other changes in that
A> same patch. In particular, it is the only part of that patch that
A> touches non-request pages. As such, I didn't think it belongs.

The vm_thread_swapin() is different from other consumers of
vm_pager_get_pages() since it is interested in non-request pages as much
as it is interested in the request page. The vm_thread_swapin() is rewritten
to utilize the new KPI. Before rewrite, it could request for pages, that
couldn't be read. The older KPI allowed that, since succeeding only with the
reqpage meant success. The new KPI would fail on such request. Now, if we
want to do multiple page request to vm_pager_get_pages(), we first need to
determine how many pages can we ask for via vm_pager_has_page().
--
Totus tuus, Glebius.
Gleb Smirnoff
2015-08-07 13:38:44 UTC
Permalink
Hi!

This is followup on older email:

https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html

The preparatory commits were already checked in, and I'm going
to push next week the rest, since the whole story is already
several months due.

Planned changes:

o vm_pager_get_pages() accepts array of pages, and treats all pages
equally. Notion of reqpage goes away.
o The array span validity must be checked before with vm_pager_has_page().
o All pages must be xbusied on enter.
o All pages will be left xbusied on exit. This closes possible races,
allows to pass in wired pages (for any pager). And it leaves
the caller to decide what to do with pages: vm_page_active,
vm_page_deactivate, vm_page_flash or just vm_page_free them.

The patch has been tested by me and pho@ with his stress2 test.

I know, that there are two comments from kib@ on the patch.

1) There could be a user of KPI who would be fine with partial success.

My answer: right now there is none, and if one emerges, the code can
be easily adopted to return VM_PAGER_ERROR, but still mark validated
pages as valid. The user of KPI then can scan the array and take valid
pages. So, the patch doesn't put any obstacles on appearance of such
user.

2) Filesystems can do short reads by design, and thus fail to validate
the entire array.

My answer: yes, that's true. By design NFS, SMBFS and FUSE should be
able to return short reads. However, the VOP_GETPAGES methods for all
three FSes right now do not have any code that would support that. So,
it looks like there is an open issue with these filesystems, not related
to my patch. When this issue is addressed in any of aforementioned FSes,
the VOP_GETPAGES should be fixed to do several I/Os in case of short
reads.
--
Totus tuus, Glebius.
Konstantin Belousov
2015-08-08 08:41:21 UTC
Permalink
Post by Gleb Smirnoff
Hi!
https://lists.freebsd.org/pipermail/freebsd-arch/2015-April/017154.html
The preparatory commits were already checked in, and I'm going
to push next week the rest, since the whole story is already
several months due.
o vm_pager_get_pages() accepts array of pages, and treats all pages
equally. Notion of reqpage goes away.
o The array span validity must be checked before with vm_pager_has_page().
o All pages must be xbusied on enter.
o All pages will be left xbusied on exit. This closes possible races,
allows to pass in wired pages (for any pager). And it leaves
the caller to decide what to do with pages: vm_page_active,
vm_page_deactivate, vm_page_flash or just vm_page_free them.
These were not comments, but objections.
Post by Gleb Smirnoff
1) There could be a user of KPI who would be fine with partial success.
My answer: right now there is none, and if one emerges, the code can
be easily adopted to return VM_PAGER_ERROR, but still mark validated
pages as valid. The user of KPI then can scan the array and take valid
pages. So, the patch doesn't put any obstacles on appearance of such
user.
The vm_fault.c is already fine with the partial success, it only cares
that the requested page was validated and no error from pager is
returned.
Post by Gleb Smirnoff
2) Filesystems can do short reads by design, and thus fail to validate
the entire array.
My answer: yes, that's true. By design NFS, SMBFS and FUSE should be
able to return short reads. However, the VOP_GETPAGES methods for all
three FSes right now do not have any code that would support that. So,
it looks like there is an open issue with these filesystems, not related
to my patch. When this issue is addressed in any of aforementioned FSes,
the VOP_GETPAGES should be fixed to do several I/Os in case of short
reads.
And this is a bug in the networking filesystems (most likely). Rick was
asked about NFS, but he did not responded. You are proposing to make the
bug a part of the interface.

I object against this change. It is wrong philosophically, and it
encodes the incomplete or accidental behaviour of several filesystems at
the interface level.

In fact, you are taking some rather secondary feature of the current
interface, that the non-requested pages are made busy for the duration
of the paging request, to the level of the fundamental property (would
the non-mreq pages be not busied, proposed change immediately causes
applications segfault on parallel file truncation, or makes user data
corrupted, for example).

All this rototilling is because you do not want to code the proper FSA
in your reworked sendfile patch.

I object against the change and against the reasoning behind it.
Gleb Smirnoff
2015-08-10 14:30:31 UTC
Permalink
Konstantin,

On Sat, Aug 08, 2015 at 11:41:21AM +0300, Konstantin Belousov wrote:
K> > 1) There could be a user of KPI who would be fine with partial success.
K> >
K> > My answer: right now there is none, and if one emerges, the code can
K> > be easily adopted to return VM_PAGER_ERROR, but still mark validated
K> > pages as valid. The user of KPI then can scan the array and take valid
K> > pages. So, the patch doesn't put any obstacles on appearance of such
K> > user.
K> The vm_fault.c is already fine with the partial success, it only cares
K> that the requested page was validated and no error from pager is
K> returned.

vm_fault.c can be easily adopted to partial success, but I don't see
a good reason to do this, since right now no pager does return partial
success.

K> > 2) Filesystems can do short reads by design, and thus fail to validate
K> > the entire array.
K> >
K> > My answer: yes, that's true. By design NFS, SMBFS and FUSE should be
K> > able to return short reads. However, the VOP_GETPAGES methods for all
K> > three FSes right now do not have any code that would support that. So,
K> > it looks like there is an open issue with these filesystems, not related
K> > to my patch. When this issue is addressed in any of aforementioned FSes,
K> > the VOP_GETPAGES should be fixed to do several I/Os in case of short
K> > reads.
K>
K> And this is a bug in the networking filesystems (most likely). Rick was
K> asked about NFS, but he did not responded. You are proposing to make the
K> bug a part of the interface.

Yes, this is likely a bug in current filesystems. Note that the bug is
already here, and it is not a part of my patch. Also note that the bug
has no evidence, we only modeled it in a thought experiment.

And I don't propose to make a bug part of interface. I propose that if FS
maintainers find time to analyze this issue, then they should fix it,
hiding internal peculiarities of their FS into its implementation, instead
of escalating it up to the pager interface.

K> I object against this change. It is wrong philosophically, and it
K> encodes the incomplete or accidental behaviour of several filesystems at
K> the interface level.

I object to your objection. The current KPI is wrong philosophically.
It is wrong that caller passes an array of locked entities (busied pages)
and gets partially unlocked array on return. This doesn't allow to write
robust code safe from race conditions without introducting workarounds.

K> In fact, you are taking some rather secondary feature of the current
K> interface, that the non-requested pages are made busy for the duration
K> of the paging request, to the level of the fundamental property (would
K> the non-mreq pages be not busied, proposed change immediately causes
K> applications segfault on parallel file truncation, or makes user data
K> corrupted, for example).

I really don't get this paragraph. "would the non-mreq pages be not busied" --
but this isn't the case. They are busied and this is intentional.

K> All this rototilling is because you do not want to code the proper FSA
K> in your reworked sendfile patch.

You see, a proper FSA can be built around even more brain damaged
interface. Alternatively, I can just say that if the array KPI is
so fragile, I will do pages one by one in a busy loop. So, yes, the
"problem" is solvable without touching KPI, but the solution isn't right
one.
--
Totus tuus, Glebius.
Loading...