Refactoring asynchronous I/O

Discussion:

John Baldwin

2016-01-27 01:39:03 UTC

You may have noticed some small cleanups and fixes going into the AIO code
recently. I have been working on a minor overhaul of the AIO code recently
and the recent changes have been trying to reduce the diff down to the truly
meaty changes so they are easier to review. I think things are far enough
along to start on the meaty bits.

The current AIO code is a bit creaky and not very extensible. It forces
all requests down via the existing fo_read/fo_write file ops so that all
requests are inherently synchronous even if the underlying file descriptor
could support async operation. This also makes cancellation more fragile
as you can't cancel a job that is stuck sleeping in one of the AIO daemons.

The original motivation for my changes is to support efficient zero-copy
receive for TOE using Chelsio T4/T5 adapters. However, read() is ill
suited to this type of workflow. Various efforts in the past have tried
using page flipping (the old ZERO_COPY_SOCKETS code which required custom
ti(4) firmware) which only works if you can get things lined up "just right"
(e.g. page-aligned and sized buffers, custom firmware on your NIC, etc.) or
introducing new APIs that replace read/write (such as IO-Lite). The primary
issue with read() of course is that the NIC DMAs data to one place and later
userland comes along and tells the kernel where it wants the data. The issue
with introducing a new API like IO-Lite is convincing software to use it.
However, aio_read() is an existing API that can be used to queue user buffers
in advance. In particular, you can use two buffers to ping-pong similar to
the BPF zero-copy code where you queue two buffers at the start and requeue
each completed buffer after consuming its data. In theory the Chelsio driver
can "peek" into the request queue for a socket and schedule the pending
requests for zero copy receive. However, doing that requires a way for
allowing the driver to "claim" certain requests and support cancelling them,
completing them, etc.

To facilitate this use case I decided to rework the AIO code to use a model
closer to the I/O Request Packets (Irps) that Windows drivers use. In
particular, when a Windows driver decides to queue a request so that it can
be completed later, it has to install a cancel routine that is responsible
for cancelling a queued request.

To this end, I have reworked the AIO code as such:

1) File descriptor types are now the "drivers" for AIO requests rather than
the AIO code. When an AIO request for an fd is queued (via aio_read/write,
etc.) a new file op (fo_aio_queue()) is called to queue or handle the
request. This method is responsible for completeing the request or
queueing it to be completed later. Currently, a default implementation of
this method which queues the job to the existing AIO daemon pool for
fo_read/fo_write is provided, but file types can override that with more
specific behavior if desired.

2) The AIO code is now a library of calls for manipulating AIO requests.
Functions to manage cancel routines, mark AIO requests as cancelled or
completed, and schedule handler functions to run in an AIO daemon context
are provided.

3) Operations that choose to queue an AIO request while waiting for a
suitable resource to service it (CPU time, data to arrive on a socket,
etc.) are required to install a cancel routine to handle cancellation of
a request due to aio_cancel() or the exit or exec of the owning process.
This allows the type-specific queueing logic to be self-contained
without the AIO code having to know about all the possible queue states
of an AIO request.

In my current implementation I use the "default" fo_aio_queue method for most
file types. However, sockets now use a private pool of AIO kprocs, and they
also service sockets (rather than jobs). This means that when a socket
becomes ready for either read or write, it queues a task for that socket
buffer to the socket AIO daemon pool. That task will complete as many
requests as possible for that socket buffer (ensuring that there are no
concurrent AIO operations on a given socket). It is also able to use
MSG_NOWAIT to avoid blocking even for blocking sockets.

One thing I have not yet done is move the physio fast-path out of vfs_aio.c
and into the devfs-specific fileops, but that can easily be done with a
custom fo_aio_queue op for the devfs file ops.

I believe that this approach also permits other file types to provide more
suitable AIO handling when suitable.

For the Chelsio use case I have added a protocol hook to allow a given
protocol to claim AIO requests instead of letting them be handled by the
generic socket AIO fileop. This ends up being a very small change, and
the Chelsio-specific logic can live in the TOM module using the AIO library
calls to service the AIO requests.

My current WIP (not including the Chelsio bits, they need to be forward
ported from an earlier prototype) is available on the 'aio_rework' branch
of freebsd in my github space:

https://github.com/freebsd/freebsd/compare/master...bsdjhb:aio_rework

Note that binding the AIO support to a new fileop does mean that the AIO code
now becomes mandatory (rather than optional). We could perhaps make the
system calls continue to be optional if people really need that, but the guts
of the code will now need to always be in the kernel.

I'd like to hear what people think of this design. It needs some additional
cleanup before it is a commit candidate (and I'll see if I can't split it up
some more if we go this route).

--
John Baldwin

Adrian Chadd

2016-01-27 03:17:33 UTC