geom->access problem and workaround

Discussion:

Andriy Gapon

2018-03-12 13:17:48 UTC

According to Poul-Henning (phk@), the principal author of GEOM, a GEOM class's
access method was intended to be a light-weight operation involving mostly
access counts. That is, it should be (have been) close in spirit to what
g_access() function does. The method is only called from g_access and it is
always done under the GEOM topology lock (like with most GEOM "control plane"
methods). The lock ensures that the method and the function operate on a
consistent state of the topology and all geoms in it.

In reality, many classes have their access method do a lot more than just
checking and modifying access bits. And often, what the method does is
incompatible with the topology lock.

Some examples.
g_mirror_access() has to drop and reacquire the topology lock to avoid a LOR
(deadlock) because the method needs to use the class's internal sc_lock.

zvol_geom_access() also has to drop and reacquire the topology lock when it
interacts with ZFS internals involving many locks. The main issue here is that
ZFS is both above the GEOM when ZFS uses GEOM for the storage access and it is
"below" the GEOM when ZFS is accessed through the ZVOL provider.

g_disk_access() -> daopen(). In this case the topology lock is never dropped,
but the operation issues multiple SCSI commands and waits for their completion.
So, if something goes wrong and takes a long time to complete then the whole
topology will be frozen for all that time.
[Perhaps doing the lock dance would be a better alternative]

But, of course, dropping the lock does not come free.
It opens races where two (at least) sets of incompatible access counts may get
granted. Or a special action, that should be done only on a first access to a
geom, could be executed more than once.

Bringing everything to conformance with the original design would be an ideal
solution, but it will take a lot of work both in the individual nonconforming
classes and in at least some of their consumers. It seems to require moving all
the complex operations from access methods to the GEOM "data plane". E.g, doing
those things upon the first I/O operation. Or having a new special BIO_GETATTR
(kind of) operation that could be executed after g_access() but before the
actual I/O is allowed.

I am proposing an interim solution, so really a workaround, for the problem of
dropping the topology lock:

https://reviews.freebsd.org/D14533

That workaround cannot guarantee, of course, the complete stability of the
topology, but it prevents concurrent calls to access methods.
The idea is very simple. Before calling a geom's access method the geom is
marked with a special flag unless the flag is already set in which case the code
waits until the flag is cleared. The flag is cleared after the call, of course.
The topology lock is released while waiting for the flag.

I think that having this new flag may help to get more visibility into the problem.

P.S.
The workaround does not help daopen() at all.

--
Andriy Gapon

Warner Losh

2018-03-12 17:11:40 UTC

Permalink

Post by Andriy Gapon
access method was intended to be a light-weight operation involving mostly
access counts. That is, it should be (have been) close in spirit to what
g_access() function does. The method is only called from g_access and it is
always done under the GEOM topology lock (like with most GEOM "control plane"
methods). The lock ensures that the method and the function operate on a
consistent state of the topology and all geoms in it.
In reality, many classes have their access method do a lot more than just
checking and modifying access bits. And often, what the method does is
incompatible with the topology lock.
Some examples.
g_mirror_access() has to drop and reacquire the topology lock to avoid a LOR
(deadlock) because the method needs to use the class's internal sc_lock.
zvol_geom_access() also has to drop and reacquire the topology lock when it
interacts with ZFS internals involving many locks. The main issue here is that
ZFS is both above the GEOM when ZFS uses GEOM for the storage access and it is
"below" the GEOM when ZFS is accessed through the ZVOL provider.
g_disk_access() -> daopen(). In this case the topology lock is never dropped,
but the operation issues multiple SCSI commands and waits for their completion.
So, if something goes wrong and takes a long time to complete then the whole
topology will be frozen for all that time.
[Perhaps doing the lock dance would be a better alternative]
But, of course, dropping the lock does not come free.
It opens races where two (at least) sets of incompatible access counts may get
granted. Or a special action, that should be done only on a first access to a
geom, could be executed more than once.
Bringing everything to conformance with the original design would be an ideal
solution, but it will take a lot of work both in the individual nonconforming
classes and in at least some of their consumers. It seems to require moving all
the complex operations from access methods to the GEOM "data plane". E.g, doing
those things upon the first I/O operation. Or having a new special BIO_GETATTR
(kind of) operation that could be executed after g_access() but before the
actual I/O is allowed.
I am proposing an interim solution, so really a workaround, for the problem of
https://reviews.freebsd.org/D14533
That workaround cannot guarantee, of course, the complete stability of the
topology, but it prevents concurrent calls to access methods.
The idea is very simple. Before calling a geom's access method the geom is
marked with a special flag unless the flag is already set in which case the code
waits until the flag is cleared. The flag is cleared after the call, of course.
The topology lock is released while waiting for the flag.
I think that having this new flag may help to get more visibility into the problem.
P.S.
The workaround does not help daopen() at all.

The storage layer generally doesn't expect higher-level locks around calls
to it, and feels that it's free to sleep in the open routine for resources
to become available. This is true across most 'open' routines (eg, tty will
wait for the right signals, etc). In a world of removable media, I'm not
sure that one can avoid this.

But I'm not sure that calling open on the underlying device is at all
compatible with the design goal of access being cheap. I think you can't
have both: either you open the device, and cope with the fact that open may
sleep, or it looks like you'll have broken code. Once we've updated the
access counts, we can drop the topology lock to call open. If it succeeds,
all is good. If it fails, then we have to reacquire it to "unaccess" the
device after the failure... However, that doesn't help with the concurrent
attempts to do first open for the device. g_disk_access will still have
issues of sleeping indefinitely, which of course can lead to deadlock in
complicated geometry situations (I say of course, but I'm not 100% sure).

The whole reason that daopen may (but not always) sleep is that it may need
to do I/O to the device to get it's media-status / size / SN if it' s a
removable device... Just like with the RO flag, we'd want the open routine
to fail if it can't reasonably access the device.

Warner

Poul-Henning Kamp

2018-03-12 18:07:02 UTC

Permalink

--------

Post by Warner Losh
The storage layer generally doesn't expect higher-level locks around calls
to it, and feels that it's free to sleep in the open routine for resources
to become available. This is true across most 'open' routines (eg, tty will
wait for the right signals, etc). In a world of removable media, I'm not
sure that one can avoid this.

The original intent was that we would.

Things would probably have been clearer if I had called it
g_reserve() instead of g_access().

Removable media state was supposed to be a job for the driver(s
background polling), and the geom event queue was supposed to
do what needed to be done as a result of the g_access() calls.

The primary reason is that messing around with the geom topology
is a global operation, in order to keep things simple[1], and we
really don't want global locks held for any amount of time and
certainly not for mechanical-movement / failure-retry kinds of time.

The secondary reason was to be able to present a consistent
and precise view of the system *without* opening devices, so
that disk maintenance tools would not spin up all disks,
rattle all drawers and bang all doors before telling you
what you wanted to know.

Post by Warner Losh
But I'm not sure that calling open on the underlying device is at all
compatible with the design goal of access being cheap. I think you can't
have both: either you open the device, and cope with the fact that open may
sleep, or it looks like you'll have broken code. Once we've updated the
access counts, we can drop the topology lock to call open.

So this is where it gets slightly tricky: When you open /dev/foobar,
do you open the media or only a drivemechanism that *may* hold a media ?

For any normal "hard-disk", there is no difference.

But for floppies, CDROMs, ZIP drives, WORM drives, Robots-with-ATA-disks
and other interesting hardware, which were relevant when GEOM was
designed, and to some extent still are, you only open the drive,
and will have to find out next if it has a media in it or not.

In particular CDROMs forced this design decision, because the ioctls
to open & close the tray on CDROM drives operated on the media access
device node, and too many ports knew about that.

Compare that with a tape-changer, which has one device node for the
robotic parts and another for (each of) the tape drive(s).

If we want to have an architectural sound way to do slow operations
before any "user-I/O" is initiated, the right way to do so is to
define new BIO_OPEN and BIO_CLOSE operation, and insist via asserts
than all BIO_{READ|WRITE|DELETE} are wrapped in these.

BIO_GETATTR should probably not require a BIO_OPEN/BIO_CLOSE.

Poul-Henning

[1] The alternative would be to have different sub-trees, each of
which can be locked individually, but that requires a LOT of
housekeeping and class-complexity in order to find out what
those sub-trees actually are.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Andriy Gapon

2018-03-23 10:38:44 UTC

Permalink

Post by Poul-Henning Kamp
If we want to have an architectural sound way to do slow operations
before any "user-I/O" is initiated, the right way to do so is to
define new BIO_OPEN and BIO_CLOSE operation, and insist via asserts
than all BIO_{READ|WRITE|DELETE} are wrapped in these.

In support of this proposal I want to add another example.
The problem is not only with doing I/O in access, but also with doing blocking
I/O in the normal I/O path.
The following happened when a userland program tried to read from a CD-ROM while
its tray was open:

panic: sleepq_add: td 0xfffff80008e1c000 to sleep on wchan 0xfffff801e58b8048
with sleeping prohibited
cpuid = 0
curthread: 0xfffff80008e1c000
time = 1521652565
KDB: stack backtrace:
db_trace_self_wrapper() at 0xffffffff804670bb = db_trace_self_wrapper+0x2b/frame
0xfffffe07e1886750
kdb_backtrace() at 0xffffffff806d4c79 = kdb_backtrace+0x39/frame 0xfffffe07e1886800
vpanic() at 0xffffffff80699da6 = vpanic+0x166/frame 0xfffffe07e1886840
kassert_panic() at 0xffffffff80699adb = kassert_panic+0x16b/frame 0xfffffe07e18868b0
sleepq_add() at 0xffffffff806e1902 = sleepq_add+0x342/frame 0xfffffe07e1886910
_sleep() at 0xffffffff806a41f7 = _sleep+0x2b7/frame 0xfffffe07e18869b0
cam_periph_ccbwait() at 0xffffffff802b53ad = cam_periph_ccbwait+0x5d/frame
0xfffffe07e18869e0
cam_periph_runccb() at 0xffffffff802b51d8 = cam_periph_runccb+0xb8/frame
0xfffffe07e1886a40
cdrunccb() at 0xffffffff802d52fc = cdrunccb+0x3c/frame 0xfffffe07e1886a60
cdprevent() at 0xffffffff802d4beb = cdprevent+0x7b/frame 0xfffffe07e1886aa0
cdcheckmedia() at 0xffffffff802d48b2 = cdcheckmedia+0x22/frame 0xfffffe07e1886af0
cdstrategy() at 0xffffffff802d2a36 = cdstrategy+0x56/frame 0xfffffe07e1886b20
g_disk_start() at 0xffffffff80608330 = g_disk_start+0x170/frame 0xfffffe07e1886b70
g_io_schedule_down() at 0xffffffff8060c1db = g_io_schedule_down+0x1eb/frame
0xfffffe07e1886b90
g_down_procbody() at 0xffffffff8060cbdd = g_down_procbody+0x6d/frame
0xfffffe07e1886ba0
fork_exit() at 0xffffffff8065e410 = fork_exit+0xd0/frame 0xfffffe07e1886bf0
fork_trampoline() at 0xffffffff808e4aee = fork_trampoline+0xe/frame
0xfffffe07e1886bf0

But maybe I am overgeneralizing.
Maybe it's just silly that cdstrategy tries perform media manipulations instead
of just failing a request. Apparently the device was already opened, but the
prevent-allow bit was changed by something else (e.g. via the pass device) and
that allowed to get the tray opened.

What do you think?

--
Andriy Gapon

Poul-Henning Kamp

2018-03-23 09:30:03 UTC

Permalink

--------

Post by Andriy Gapon

What do you think?

I don't see that changing anything...

GEOM rests on a set of assumptions, if you violate them, you get panics.

Warner Losh

2018-03-23 13:26:19 UTC

Permalink

Post by Poul-Henning Kamp
--------

Post by Andriy Gapon

What do you think?

I don't see that changing anything...
GEOM rests on a set of assumptions, if you violate them, you get panics.

I agree. The cd panic is a problem in the cd driver where it bogusly uses
runccb in a context that can't sleep.

Warner

Warner Losh

2018-03-23 13:31:14 UTC

Permalink

Post by Andriy Gapon

cdstrategy shouldn't be sleeping. It can start I/O, but it can't wait for
it to finish. strategy has been a no-sleep-zone since at least v6. The fact
it's waiting for prevent is the bug here.

Waner

Andriy Gapon

2018-03-27 14:09:06 UTC

Permalink

Post by Andriy Gapon

I guess that what you suggest is that cdstrategy should place the incoming
request on a queue and then start a state machine for issuing management
commands and handling their completions. After the state machine goes back to
the normal state only then the driver would start processing queued I/O requests.
Something like that?

--
Andriy Gapon

Warner Losh

2018-03-27 17:17:08 UTC

Permalink

Post by Andriy Gapon

0xfffff801e58b8048

Post by Andriy Gapon
with sleeping prohibited
cdstrategy shouldn't be sleeping. It can start I/O, but it can't wait for it to
finish. strategy has been a no-sleep-zone since at least v6. The fact it's
waiting for prevent is the bug here.

I guess that what you suggest is that cdstrategy should place the incoming
request on a queue and then start a state machine for issuing management
commands and handling their completions. After the state machine goes back
to
the normal state only then the driver would start processing queued I/O
requests.
Something like that?

Yes. Something like that. If we have to restart the discovery state
machine, that has to freeze the queues until that is done.

Warner