Discussion:
Refactoring asynchronous I/O
John Baldwin
2016-01-27 01:39:03 UTC
Permalink
You may have noticed some small cleanups and fixes going into the AIO code
recently. I have been working on a minor overhaul of the AIO code recently
and the recent changes have been trying to reduce the diff down to the truly
meaty changes so they are easier to review. I think things are far enough
along to start on the meaty bits.

The current AIO code is a bit creaky and not very extensible. It forces
all requests down via the existing fo_read/fo_write file ops so that all
requests are inherently synchronous even if the underlying file descriptor
could support async operation. This also makes cancellation more fragile
as you can't cancel a job that is stuck sleeping in one of the AIO daemons.

The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
suited to this type of workflow. Various efforts in the past have tried
using page flipping (the old ZERO_COPY_SOCKETS code which required custom
ti(4) firmware) which only works if you can get things lined up "just right"
(e.g. page-aligned and sized buffers, custom firmware on your NIC, etc.) or
introducing new APIs that replace read/write (such as IO-Lite). The primary
issue with read() of course is that the NIC DMAs data to one place and later
userland comes along and tells the kernel where it wants the data. The issue
with introducing a new API like IO-Lite is convincing software to use it.
However, aio_read() is an existing API that can be used to queue user buffers
in advance. In particular, you can use two buffers to ping-pong similar to
the BPF zero-copy code where you queue two buffers at the start and requeue
each completed buffer after consuming its data. In theory the Chelsio driver
can "peek" into the request queue for a socket and schedule the pending
requests for zero copy receive. However, doing that requires a way for
allowing the driver to "claim" certain requests and support cancelling them,
completing them, etc.

To facilitate this use case I decided to rework the AIO code to use a model
closer to the I/O Request Packets (Irps) that Windows drivers use. In
particular, when a Windows driver decides to queue a request so that it can
be completed later, it has to install a cancel routine that is responsible
for cancelling a queued request.

To this end, I have reworked the AIO code as such:

1) File descriptor types are now the "drivers" for AIO requests rather than
the AIO code. When an AIO request for an fd is queued (via aio_read/write,
etc.) a new file op (fo_aio_queue()) is called to queue or handle the
request. This method is responsible for completeing the request or
queueing it to be completed later. Currently, a default implementation of
this method which queues the job to the existing AIO daemon pool for
fo_read/fo_write is provided, but file types can override that with more
specific behavior if desired.

2) The AIO code is now a library of calls for manipulating AIO requests.
Functions to manage cancel routines, mark AIO requests as cancelled or
completed, and schedule handler functions to run in an AIO daemon context
are provided.

3) Operations that choose to queue an AIO request while waiting for a
suitable resource to service it (CPU time, data to arrive on a socket,
etc.) are required to install a cancel routine to handle cancellation of
a request due to aio_cancel() or the exit or exec of the owning process.
This allows the type-specific queueing logic to be self-contained
without the AIO code having to know about all the possible queue states
of an AIO request.

In my current implementation I use the "default" fo_aio_queue method for most
file types. However, sockets now use a private pool of AIO kprocs, and they
also service sockets (rather than jobs). This means that when a socket
becomes ready for either read or write, it queues a task for that socket
buffer to the socket AIO daemon pool. That task will complete as many
requests as possible for that socket buffer (ensuring that there are no
concurrent AIO operations on a given socket). It is also able to use
MSG_NOWAIT to avoid blocking even for blocking sockets.

One thing I have not yet done is move the physio fast-path out of vfs_aio.c
and into the devfs-specific fileops, but that can easily be done with a
custom fo_aio_queue op for the devfs file ops.

I believe that this approach also permits other file types to provide more
suitable AIO handling when suitable.

For the Chelsio use case I have added a protocol hook to allow a given
protocol to claim AIO requests instead of letting them be handled by the
generic socket AIO fileop. This ends up being a very small change, and
the Chelsio-specific logic can live in the TOM module using the AIO library
calls to service the AIO requests.

My current WIP (not including the Chelsio bits, they need to be forward
ported from an earlier prototype) is available on the 'aio_rework' branch
of freebsd in my github space:

https://github.com/freebsd/freebsd/compare/master...bsdjhb:aio_rework

Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.

I'd like to hear what people think of this design. It needs some additional
cleanup before it is a commit candidate (and I'll see if I can't split it up
some more if we go this route).
--
John Baldwin
Adrian Chadd
2016-01-27 03:17:33 UTC
Permalink
On 26 January 2016 at 17:39, John Baldwin <***@freebsd.org> wrote:

[snip]
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.
I'd like to hear what people think of this design. It needs some additional
cleanup before it is a commit candidate (and I'll see if I can't split it up
some more if we go this route).
So this doesn't change the direct dispatch read/write to a block device, right?
That strategy path is pretty damned cheap.

Also, why's it become mandatory? I thought we had support for optional
fileops...


-a
John Baldwin
2016-01-27 17:23:56 UTC
Permalink
Post by Adrian Chadd
[snip]
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.
I'd like to hear what people think of this design. It needs some additional
cleanup before it is a commit candidate (and I'll see if I can't split it up
some more if we go this route).
So this doesn't change the direct dispatch read/write to a block device, right?
That strategy path is pretty damned cheap.
No, that is the physio code that could be moved to be devfs specific.
However, by exposing the aio queueing point to the fileops directly it allows
for other fileops to implement direct dispatch without having to expose
internals to the AIO code.
Post by Adrian Chadd
Also, why's it become mandatory? I thought we had support for optional
fileops...
Some fileops are optional in that not all file descriptors implement them
(e.g. not all fileops can be monitored by kevent() or mmap()ed), however we
do not support kldloading a fileop. If you leave soo_aio_queue() in
sys_socket.c then the kernel would not link without vfs_aio.c since the
socket aio routine needs things like aio_complete, etc.

You could perhaps split the system calls themselves out of vfs_aio.c into a
separate sys_aio.c and make that loadable, but that would be fairly small.
It also seems fairly pointless. The module is loadable because it was
experimental and unfinished when it was first implemented. The current
version is certainly not perfect, but it is much further along than the
original code from 1997. I know folks have used the current code in
production.
--
John Baldwin
Slawa Olhovchenkov
2016-01-27 10:52:05 UTC
Permalink
Post by John Baldwin
The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
I undertuns that not you work, but: what about (teoretical) async
open/close/unlink/etc?
John Baldwin
2016-01-27 17:52:12 UTC
Permalink
Post by Slawa Olhovchenkov
Post by John Baldwin
The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
I undertuns that not you work, but: what about (teoretical) async
open/close/unlink/etc?
Implementing more asynchronous operations is orthogonal to this. It
would perhaps be a bit simpler to implement these in the new model
since most of the logic would live in a vnode-specific aio_queue
method in vfs_vnops.c. However, the current AIO approach is to add a
new system call for each async system call (e.g. aio_open()). You
would then create an internal LIO opcode (e.g. LIO_OPEN). The vnode
aio hook would then have to support LIO_OPEN requests and return the
opened fd via aio_complete(). Async stat / open might be nice for
network filesystems in particular. I've known of programs forking
separate threads just to do open/fstat of NFS files to achieve the
equivalent of aio_open() / aio_stat().
--
John Baldwin
Slawa Olhovchenkov
2016-01-27 21:04:20 UTC
Permalink
Post by John Baldwin
Post by Slawa Olhovchenkov
Post by John Baldwin
The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
I undertuns that not you work, but: what about (teoretical) async
open/close/unlink/etc?
Implementing more asynchronous operations is orthogonal to this. It
would perhaps be a bit simpler to implement these in the new model
since most of the logic would live in a vnode-specific aio_queue
method in vfs_vnops.c. However, the current AIO approach is to add a
new system call for each async system call (e.g. aio_open()). You
would then create an internal LIO opcode (e.g. LIO_OPEN). The vnode
aio hook would then have to support LIO_OPEN requests and return the
opened fd via aio_complete(). Async stat / open might be nice for
network filesystems in particular. I've known of programs forking
separate threads just to do open/fstat of NFS files to achieve the
equivalent of aio_open() / aio_stat().
Some problem exist for open()/unlink/rename/etc -- you can't use
fd-related semantic.
John Baldwin
2016-01-27 21:16:37 UTC
Permalink
Post by Slawa Olhovchenkov
Post by John Baldwin
Post by Slawa Olhovchenkov
Post by John Baldwin
The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
I undertuns that not you work, but: what about (teoretical) async
open/close/unlink/etc?
Implementing more asynchronous operations is orthogonal to this. It
would perhaps be a bit simpler to implement these in the new model
since most of the logic would live in a vnode-specific aio_queue
method in vfs_vnops.c. However, the current AIO approach is to add a
new system call for each async system call (e.g. aio_open()). You
would then create an internal LIO opcode (e.g. LIO_OPEN). The vnode
aio hook would then have to support LIO_OPEN requests and return the
opened fd via aio_complete(). Async stat / open might be nice for
network filesystems in particular. I've known of programs forking
separate threads just to do open/fstat of NFS files to achieve the
equivalent of aio_open() / aio_stat().
Some problem exist for open()/unlink/rename/etc -- you can't use
fd-related semantic.
Mmmm. We have an aio_mlock(). aio_open() would require more of a special
case like aio_mlock(). It's still doable, but it would not go via the
fileop, yes. fstat could go via the fileop, but a path-based stat would
be akin to aio_open().
--
John Baldwin
Slawa Olhovchenkov
2016-01-27 23:18:17 UTC
Permalink
Post by John Baldwin
Post by Slawa Olhovchenkov
Post by John Baldwin
Post by Slawa Olhovchenkov
Post by John Baldwin
The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
I undertuns that not you work, but: what about (teoretical) async
open/close/unlink/etc?
Implementing more asynchronous operations is orthogonal to this. It
would perhaps be a bit simpler to implement these in the new model
since most of the logic would live in a vnode-specific aio_queue
method in vfs_vnops.c. However, the current AIO approach is to add a
new system call for each async system call (e.g. aio_open()). You
would then create an internal LIO opcode (e.g. LIO_OPEN). The vnode
aio hook would then have to support LIO_OPEN requests and return the
opened fd via aio_complete(). Async stat / open might be nice for
network filesystems in particular. I've known of programs forking
separate threads just to do open/fstat of NFS files to achieve the
equivalent of aio_open() / aio_stat().
Some problem exist for open()/unlink/rename/etc -- you can't use
fd-related semantic.
Mmmm. We have an aio_mlock(). aio_open() would require more of a special
case like aio_mlock(). It's still doable, but it would not go via the
fileop, yes. fstat could go via the fileop, but a path-based stat would
be akin to aio_open().
aio_rename require yet more of special handling.
As I see this is can't be packed in current structures (aiocb and
perhaps sigevent). I am don't see space for multiple paths. I am don't
see space for fd return.

Need to change some semantics (dissalow some notifications, for
examples, only SIGEV_THREAD will be allowed? How pass information
about called aio operation?).

Also, may be some problems inside kernel for fd-less operations?
John Baldwin
2016-01-28 00:44:28 UTC
Permalink
Post by Slawa Olhovchenkov
Post by John Baldwin
Post by Slawa Olhovchenkov
Post by John Baldwin
Post by Slawa Olhovchenkov
Post by John Baldwin
The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
I undertuns that not you work, but: what about (teoretical) async
open/close/unlink/etc?
Implementing more asynchronous operations is orthogonal to this. It
would perhaps be a bit simpler to implement these in the new model
since most of the logic would live in a vnode-specific aio_queue
method in vfs_vnops.c. However, the current AIO approach is to add a
new system call for each async system call (e.g. aio_open()). You
would then create an internal LIO opcode (e.g. LIO_OPEN). The vnode
aio hook would then have to support LIO_OPEN requests and return the
opened fd via aio_complete(). Async stat / open might be nice for
network filesystems in particular. I've known of programs forking
separate threads just to do open/fstat of NFS files to achieve the
equivalent of aio_open() / aio_stat().
Some problem exist for open()/unlink/rename/etc -- you can't use
fd-related semantic.
Mmmm. We have an aio_mlock(). aio_open() would require more of a special
case like aio_mlock(). It's still doable, but it would not go via the
fileop, yes. fstat could go via the fileop, but a path-based stat would
be akin to aio_open().
aio_rename require yet more of special handling.
As I see this is can't be packed in current structures (aiocb and
perhaps sigevent). I am don't see space for multiple paths. I am don't
see space for fd return.
Need to change some semantics (dissalow some notifications, for
examples, only SIGEV_THREAD will be allowed? How pass information
about called aio operation?).
Also, may be some problems inside kernel for fd-less operations?
The kernel side of aiocb is free to hold additional information, and you
could always pass additional information via arguments, e.g.

aio_rename(aiocb, from, to)

(Alternatively, you could define aio_buf as pointing to a structure
that holds arguments.)
--
John Baldwin
Jilles Tjoelker
2016-01-31 23:02:14 UTC
Permalink
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
--
Jilles Tjoelker
John Baldwin
2016-02-01 20:14:28 UTC
Permalink
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
Mmm, I don't currently fix this for pipes, but my changes do fix this for
sockets (right now if you queue multiple reads for a socket both are woken up
when data arrives and if the data only satifies the first read request, the
second will block an AIO daemon forever).

However, having fo_aio_queue() should make this fixable for pipes as they
could use their own queueing logic and handling function. It may be that
pipes need to work more like sockets (where the handling is object-centric
rather than request-centric, so a pipe would queue a task when it was ready
and would drain any requests queued to that pipe). Pipes could either share
the same AIO daemon pool as sockets or use a private pool. I'd probably like
to avoid an explosion of daemon pools though.

I considered having a single AIO pool, but it's kind of messy to keep the
per-process job limits in the request-centric pool while also permitting
object-centric handlers on the internal job queue. OTOH, if enough backends
end up using object-centric handlers then the job limits might end up
meaningless and we could just drop them.
--
John Baldwin
John Baldwin
2016-02-05 20:21:52 UTC
Permalink
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
One thing I could do is split vfs_aio.c into two files: kern_aio.c that
holds the "library" such as aio_aqueue() / aio_complete(), etc. and a
sys_aio.c that is just the system calls. kern_aio.c would be standard,
but sys_aio.c could still be optional and aio.ko would contain it.
This would still make AIO optional, though aio.ko would be fairly small,
so not having it probably wouldn't save much in terms of size.

Does this seem reasonable or is a trivial aio.ko not worth it?
--
John Baldwin
Jilles Tjoelker
2016-02-05 21:32:12 UTC
Permalink
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that
the AIO code now becomes mandatory (rather than optional). We
could perhaps make the system calls continue to be optional if
people really need that, but the guts of the code will now need to
always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
One thing I could do is split vfs_aio.c into two files: kern_aio.c that
holds the "library" such as aio_aqueue() / aio_complete(), etc. and a
sys_aio.c that is just the system calls. kern_aio.c would be standard,
but sys_aio.c could still be optional and aio.ko would contain it.
This would still make AIO optional, though aio.ko would be fairly small,
so not having it probably wouldn't save much in terms of size.
Does this seem reasonable or is a trivial aio.ko not worth it?
It is one possible option. Another option is to refuse AIO for kinds of
file that have not been vetted to avoid blocking an AIO daemon for too
long. "Kinds of file" is about the code that will be executed to do I/O,
so that, for example, /dev/klog and /dev/ttyv0 are different kinds.
Depending on whether this restriction breaks existing code, it may need
to be a sysctl.

I don't expect a full AIO implementation for all kinds of file any time
soon, so putting AIO syscalls into a kernel module will still leave the
standard system without AIO for a long time, while still paying for the
footprint of the AIO code.

All of this would not be a problem if PCATCH worked in AIO daemons
(treat aio_cancel like a pending signal in a user process) but the last
time I looked this appeared quite hard to fix.
--
Jilles Tjoelker
John Baldwin
2016-02-08 19:22:25 UTC
Permalink
Post by Jilles Tjoelker
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that
the AIO code now becomes mandatory (rather than optional). We
could perhaps make the system calls continue to be optional if
people really need that, but the guts of the code will now need to
always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
One thing I could do is split vfs_aio.c into two files: kern_aio.c that
holds the "library" such as aio_aqueue() / aio_complete(), etc. and a
sys_aio.c that is just the system calls. kern_aio.c would be standard,
but sys_aio.c could still be optional and aio.ko would contain it.
This would still make AIO optional, though aio.ko would be fairly small,
so not having it probably wouldn't save much in terms of size.
Does this seem reasonable or is a trivial aio.ko not worth it?
It is one possible option. Another option is to refuse AIO for kinds of
file that have not been vetted to avoid blocking an AIO daemon for too
long. "Kinds of file" is about the code that will be executed to do I/O,
so that, for example, /dev/klog and /dev/ttyv0 are different kinds.
Depending on whether this restriction breaks existing code, it may need
to be a sysctl.
This doesn't seem to be very easy to implement, especially if it should
vary by character device. It sounds like what you would do today is
to allow mlock() and AIO on sockets and maybe the fast physio path
and disable everything else by default for now? This would be slightly
more feasible than something more fine-grained. A sysctl would allow
"full" AIO use that would disable these checks?
Post by Jilles Tjoelker
All of this would not be a problem if PCATCH worked in AIO daemons
(treat aio_cancel like a pending signal in a user process) but the last
time I looked this appeared quite hard to fix.
The cancel routines approach is what I'm doing to try to resolve this,
but it means handling the case explicitly in the various backends.
--
John Baldwin
Jilles Tjoelker
2016-02-10 22:58:44 UTC
Permalink
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that
the AIO code now becomes mandatory (rather than optional). We
could perhaps make the system calls continue to be optional if
people really need that, but the guts of the code will now need to
always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
One thing I could do is split vfs_aio.c into two files: kern_aio.c that
holds the "library" such as aio_aqueue() / aio_complete(), etc. and a
sys_aio.c that is just the system calls. kern_aio.c would be standard,
but sys_aio.c could still be optional and aio.ko would contain it.
This would still make AIO optional, though aio.ko would be fairly small,
so not having it probably wouldn't save much in terms of size.
Does this seem reasonable or is a trivial aio.ko not worth it?
It is one possible option. Another option is to refuse AIO for kinds of
file that have not been vetted to avoid blocking an AIO daemon for too
long. "Kinds of file" is about the code that will be executed to do I/O,
so that, for example, /dev/klog and /dev/ttyv0 are different kinds.
Depending on whether this restriction breaks existing code, it may need
to be a sysctl.
This doesn't seem to be very easy to implement, especially if it should
vary by character device. It sounds like what you would do today is
to allow mlock() and AIO on sockets and maybe the fast physio path
and disable everything else by default for now? This would be slightly
more feasible than something more fine-grained. A sysctl would allow
"full" AIO use that would disable these checks?
It does not seem that complicated to me. A check can be inserted before
placing a job into aio_jobs (in aio_queue_file() in the patched code).
If the code can detect some safe cases, they could be allowed;
otherwise, the AIO daemons will not be used. I'm assuming that the code
in the new fo_aio_queue file ops behaves properly and will not cause AIO
daemons to get stuck.

Note that this means there is no easy way to do kernel AIO on a kind of
file without specific support in the code for that kind of file. I think
the risk of a stuck process (that may even cause shutdown to hang
indefinitely) is bad enough to forbid it by default.
Post by John Baldwin
Post by Jilles Tjoelker
All of this would not be a problem if PCATCH worked in AIO daemons
(treat aio_cancel like a pending signal in a user process) but the last
time I looked this appeared quite hard to fix.
The cancel routines approach is what I'm doing to try to resolve this,
but it means handling the case explicitly in the various backends.
OK.
--
Jilles Tjoelker
John Baldwin
2016-02-10 23:19:12 UTC
Permalink
Post by Jilles Tjoelker
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that
the AIO code now becomes mandatory (rather than optional). We
could perhaps make the system calls continue to be optional if
people really need that, but the guts of the code will now need to
always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
One thing I could do is split vfs_aio.c into two files: kern_aio.c that
holds the "library" such as aio_aqueue() / aio_complete(), etc. and a
sys_aio.c that is just the system calls. kern_aio.c would be standard,
but sys_aio.c could still be optional and aio.ko would contain it.
This would still make AIO optional, though aio.ko would be fairly small,
so not having it probably wouldn't save much in terms of size.
Does this seem reasonable or is a trivial aio.ko not worth it?
It is one possible option. Another option is to refuse AIO for kinds of
file that have not been vetted to avoid blocking an AIO daemon for too
long. "Kinds of file" is about the code that will be executed to do I/O,
so that, for example, /dev/klog and /dev/ttyv0 are different kinds.
Depending on whether this restriction breaks existing code, it may need
to be a sysctl.
This doesn't seem to be very easy to implement, especially if it should
vary by character device. It sounds like what you would do today is
to allow mlock() and AIO on sockets and maybe the fast physio path
and disable everything else by default for now? This would be slightly
more feasible than something more fine-grained. A sysctl would allow
"full" AIO use that would disable these checks?
It does not seem that complicated to me. A check can be inserted before
placing a job into aio_jobs (in aio_queue_file() in the patched code).
If the code can detect some safe cases, they could be allowed;
otherwise, the AIO daemons will not be used. I'm assuming that the code
in the new fo_aio_queue file ops behaves properly and will not cause AIO
daemons to get stuck.
Note that this means there is no easy way to do kernel AIO on a kind of
file without specific support in the code for that kind of file. I think
the risk of a stuck process (that may even cause shutdown to hang
indefinitely) is bad enough to forbid it by default.
Ok. One issue is that some software may assume that aio will "work" if
modfind("aio") works and might be surprised by it not working. I'm not sure
how real that is. I don't plan on merging this to 10 so we will have some
testing time in 11 to figure that out. If it does prove problematic we can
revert to splitting the syscalls out into aio.ko. For now I think the two
cases to enable by default would be sockets (which use a fo_aio_queue method)
and physio. We could let the VFS_AIO option change the sysctl's default
value to allow all AIO for compat, though that seems a bit clumsy as the
name doesn't really make sense for that.

(I also still don't like the vfs_aio.c filename as aio is not VFS-specific.)
--
John Baldwin
John Baldwin
2016-02-11 00:21:30 UTC
Permalink
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Post by Jilles Tjoelker
Post by John Baldwin
Note that binding the AIO support to a new fileop does mean that
the AIO code now becomes mandatory (rather than optional). We
could perhaps make the system calls continue to be optional if
people really need that, but the guts of the code will now need to
always be in the kernel.
Enabling this by default is OK with me as long as the easy ways to get a
stuck process are at least disabled by default. Currently, a process
gets stuck forever if it has an AIO request from or to a pipe that will
never complete. An AIO daemon should not be allowed to perform an
unbounded sleep such as for a pipe (NFS server should be OK).
One thing I could do is split vfs_aio.c into two files: kern_aio.c that
holds the "library" such as aio_aqueue() / aio_complete(), etc. and a
sys_aio.c that is just the system calls. kern_aio.c would be standard,
but sys_aio.c could still be optional and aio.ko would contain it.
This would still make AIO optional, though aio.ko would be fairly small,
so not having it probably wouldn't save much in terms of size.
Does this seem reasonable or is a trivial aio.ko not worth it?
It is one possible option. Another option is to refuse AIO for kinds of
file that have not been vetted to avoid blocking an AIO daemon for too
long. "Kinds of file" is about the code that will be executed to do I/O,
so that, for example, /dev/klog and /dev/ttyv0 are different kinds.
Depending on whether this restriction breaks existing code, it may need
to be a sysctl.
This doesn't seem to be very easy to implement, especially if it should
vary by character device. It sounds like what you would do today is
to allow mlock() and AIO on sockets and maybe the fast physio path
and disable everything else by default for now? This would be slightly
more feasible than something more fine-grained. A sysctl would allow
"full" AIO use that would disable these checks?
It does not seem that complicated to me. A check can be inserted before
placing a job into aio_jobs (in aio_queue_file() in the patched code).
If the code can detect some safe cases, they could be allowed;
otherwise, the AIO daemons will not be used. I'm assuming that the code
in the new fo_aio_queue file ops behaves properly and will not cause AIO
daemons to get stuck.
Note that this means there is no easy way to do kernel AIO on a kind of
file without specific support in the code for that kind of file. I think
the risk of a stuck process (that may even cause shutdown to hang
indefinitely) is bad enough to forbid it by default.
Ok. One issue is that some software may assume that aio will "work" if
modfind("aio") works and might be surprised by it not working. I'm not sure
how real that is. I don't plan on merging this to 10 so we will have some
testing time in 11 to figure that out. If it does prove problematic we can
revert to splitting the syscalls out into aio.ko. For now I think the two
cases to enable by default would be sockets (which use a fo_aio_queue method)
and physio. We could let the VFS_AIO option change the sysctl's default
value to allow all AIO for compat, though that seems a bit clumsy as the
name doesn't really make sense for that.
(I also still don't like the vfs_aio.c filename as aio is not VFS-specific.)
I've pushed a new version with a vfs.aio.enable_unsafe sysctl (defaults to
off). If you think this is ok then I'll work on cleaning up some of the
module bits (removing the unload support mostly). I figure marking the
syscalls as STD and removing all the helper stuff can be done as a followup
commit to reduce extra noise in this diff (which is large enough on its own).

https://github.com/freebsd/freebsd/compare/master...bsdjhb:aio_rework

(Also, for the inital commit I will leave out the socket protocol hook. That
can be added later.)
--
John Baldwin
Jilles Tjoelker
2016-02-11 22:32:38 UTC
Permalink
Post by John Baldwin
I've pushed a new version with a vfs.aio.enable_unsafe sysctl
(defaults to off). If you think this is ok then I'll work on cleaning
up some of the module bits (removing the unload support mostly). I
figure marking the syscalls as STD and removing all the helper stuff
can be done as a followup commit to reduce extra noise in this diff
(which is large enough on its own).
https://github.com/freebsd/freebsd/compare/master...bsdjhb:aio_rework
(Also, for the inital commit I will leave out the socket protocol
hook. That can be added later.)
Looks good to me, except that the error should be [ENOTSUP] instead of
the overloaded [EINVAL].
--
Jilles Tjoelker
John Baldwin
2016-02-16 01:10:19 UTC
Permalink
Post by Jilles Tjoelker
Post by John Baldwin
I've pushed a new version with a vfs.aio.enable_unsafe sysctl
(defaults to off). If you think this is ok then I'll work on cleaning
up some of the module bits (removing the unload support mostly). I
figure marking the syscalls as STD and removing all the helper stuff
can be done as a followup commit to reduce extra noise in this diff
(which is large enough on its own).
https://github.com/freebsd/freebsd/compare/master...bsdjhb:aio_rework
(Also, for the inital commit I will leave out the socket protocol
hook. That can be added later.)
Looks good to me, except that the error should be [ENOTSUP] instead of
the overloaded [EINVAL].
Ok. I've done some further cleanups and posted a commit candidate at
https://reviews.freebsd.org/D5289 for anyone else who is interested.
--
John Baldwin
Loading...