Discussion:
Robust mutexes implementation
Konstantin Belousov
2016-05-05 13:10:29 UTC
Permalink
I implemented robust mutexes for our libthr. A robust mutex is
guaranteed to be cleared by the system upon either thread or process
owner termination while the mutex is held. The next mutex locker is
then notified about inconsistent mutex state and can execute (or
abandon) corrective actions.

The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended
to iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.

The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.

Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. Initially I tried
to extend TCBs with this data, so only a single syscall at the
threading library initialization would be needed: for any thread the
location of TCB is known by kernel, and the syscall would pass
offsets. Unfortunately, on some architectures the size of TCB is part
of the fixed ABI and cannot be changed. Instead, when a thread touches
a robust mutex for the first time, a new umtx op syscall is issued which
informs about location of lists heads.

The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust. I do not understand the reasoning behind this
POSIX requirement.

Patch passes all glibc tests for robust mutexes I found in the nptl/
directory. See https://github.com/kostikbel/glibc-robust-tests .

Patch is available at https://kib.kiev.ua/kib/pshared/robust.1.patch
(beware of self-signed root certificate in the chain). Work was
sponsored by The FreeBSD Foundation.

Unrelated things in the patch:

1. Style. Since I had to re-read whole sys/kern/kern_umtx.c,
lib/libthr/thread/thr_umtx.h and lib/libthr/thread/thr_umtx.c, I
started fixing the numerous style violations in these files, which
actually made my eyes bleed.

2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.

3. Removal of the struct pthread_mutex m_owner field. I cannot see
why it is useful. The only optimization it provides is the
possibility to avoid clearing UMUTEX_CONTESTED bit when reading
m_lock.m_owner. The disadvantages of having this duplicated field
is that kernel does not know about pthread_mutex, so cannot fix the
dup value. Overall it is less work to clear UMUTEX_CONTESTED when
checking owner, then to try and handle inconsistencies.

I added the PMUTEX_OWNER_ID() macro to simplify code.

4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Apparently, there is real code around which expects the following
to work:
- mmap a file, create a shared mutex in the mapping;
- the process exits;
- another process starts, mmaps the same file and expects that the
previously initialized mutex is still usable.

The knob changes the lifetime of such shared off-page from the
'destroy on last unmap' to either 'until vnode is reclaimed' or
until 'pthread_mutex_destroy' called, whatever comes first.
Martin Simmons
2016-05-05 17:20:35 UTC
Permalink
There is a potential bug in enqueue_mutex when it tests m1 == NULL and m1 !=
NULL. These tests only work because m_lock is the first slot in struct
pthread_mutex and hence 0 in curthread->robust_list is converted to NULL
(rather than a negative value).

Also, is it safe to assume memory ordering between the assignments of
m->m_lock.m_rb_lnk and curthread->robust_list? Maybe it is OK because the
kernel will never read curthread->robust_list until after the CPU executing
enqueue_mutex has passed a memory barrier?

__Martin
Konstantin Belousov
2016-05-05 18:58:10 UTC
Permalink
Post by Martin Simmons
There is a potential bug in enqueue_mutex when it tests m1 == NULL and m1 !=
NULL. These tests only work because m_lock is the first slot in struct
pthread_mutex and hence 0 in curthread->robust_list is converted to NULL
(rather than a negative value).
Yes. In fact I wrote the __containerof stuff only later, initial
version of the patch did relied on the fact that m_lock is at the
beginning of pthread_mutex.

I rewrote the code in enqueue, and there is one more similar check in
dequeue_mutex().

Updated patch is at https://kib.kiev.ua/kib/pshared/robust.2.patch .
It also fixed the lack of userland split for non-robust pp/robust pp queues.
Post by Martin Simmons
Also, is it safe to assume memory ordering between the assignments of
m->m_lock.m_rb_lnk and curthread->robust_list? Maybe it is OK because the
kernel will never read curthread->robust_list until after the CPU executing
enqueue_mutex has passed a memory barrier?
Inter-CPU ordering (I suppose you mean this, because you mention
barriers) only matter when we consider multi-threaded interaction.
In case of dequeue_mutex, paired to corresponding enqueue_mutex(),
the calls occur in the same thread, and the thread is always
self-consistent.

WRT possible reordering, we in fact only care that enqueue writes do
not become visible before lock is obtained, and dequeue must finish
for external observers before lock is released. This is ensured by
the umutex lock semantic, which has neccessary acquire barrier on
lock and release barrier on unlock.
Martin Simmons
2016-05-06 13:00:48 UTC
Permalink
Post by Konstantin Belousov
Yes. In fact I wrote the __containerof stuff only later, initial
version of the patch did relied on the fact that m_lock is at the
beginning of pthread_mutex.
I rewrote the code in enqueue, and there is one more similar check in
dequeue_mutex().
Updated patch is at https://kib.kiev.ua/kib/pshared/robust.2.patch .
OK.
Post by Konstantin Belousov
Post by Martin Simmons
Also, is it safe to assume memory ordering between the assignments of
m->m_lock.m_rb_lnk and curthread->robust_list? Maybe it is OK because the
kernel will never read curthread->robust_list until after the CPU executing
enqueue_mutex has passed a memory barrier?
Inter-CPU ordering (I suppose you mean this, because you mention
barriers) only matter when we consider multi-threaded interaction.
In case of dequeue_mutex, paired to corresponding enqueue_mutex(),
the calls occur in the same thread, and the thread is always
self-consistent.
Agreed.
Post by Konstantin Belousov
WRT possible reordering, we in fact only care that enqueue writes do
not become visible before lock is obtained, and dequeue must finish
for external observers before lock is released. This is ensured by
the umutex lock semantic, which has neccessary acquire barrier on
lock and release barrier on unlock.
I meant the case where CPU 1 claims the lock, executing this from
enqueue_mutex:

m->m_lock.m_rb_lnk = 0; /* A */
*rl = (uintptr_t)&m->m_lock; /* B */

and then the thread dies and CPU 2 cleans up the dead thread executing this
from umtx_handle_rb:

error = copyin((void *)rbp, &m, sizeof(m)); /* C */
*rb_list = m.m_rb_lnk; /* D */

where rbp is the value stored into *rl at B by CPU 1.

What ensures that CPU 1's store at A is seen by CPU 2 when it reads the value
at D? I'm hoping this is covered by something that I don't understand, but I
was worried by the comment "consistent even in the case of thread termination
at arbitrary moment" above enqueue_mutex.

__Martin
Konstantin Belousov
2016-05-06 13:20:44 UTC
Permalink
Post by Martin Simmons
Post by Konstantin Belousov
WRT possible reordering, we in fact only care that enqueue writes do
not become visible before lock is obtained, and dequeue must finish
for external observers before lock is released. This is ensured by
the umutex lock semantic, which has neccessary acquire barrier on
lock and release barrier on unlock.
I meant the case where CPU 1 claims the lock, executing this from
m->m_lock.m_rb_lnk = 0; /* A */
*rl = (uintptr_t)&m->m_lock; /* B */
and then the thread dies and CPU 2 cleans up the dead thread executing this
error = copyin((void *)rbp, &m, sizeof(m)); /* C */
*rb_list = m.m_rb_lnk; /* D */
where rbp is the value stored into *rl at B by CPU 1.
What ensures that CPU 1's store at A is seen by CPU 2 when it reads the value
at D? I'm hoping this is covered by something that I don't understand, but I
was worried by the comment "consistent even in the case of thread termination
at arbitrary moment" above enqueue_mutex.
The cleanup is performed by the same thread which did the lock, in the
kernel mode. Thread can only die by explicit kernel action, and this
action is executed by the dying thread.

In fact, there is one (or two) cases when the cleanup is performed by
thread other than the lock owner. Namely, when we execve(2) (or _exit(2)
the whole process, but the mechanics is same), we first put the process
into so-called single-threading state, where all threads other than the
executing the syscall, are put into sleep at the safe moment. Then,
the old address space if destroyed and new image activated. Only after
that, other threads are killed. It is done to allow the error return,
where we need to keep threads around on failed execve.

And, right before the old address space is destroyed, robust mutexes for
all threads are terminated, since we need access to usermode VA to operate
on locks. But there, the single-threading mechanics contains neccessary
barriers to ensure visibility in right order. In essence, there are so
many locks/unlocks with barrier semantic on single-threading that this is
not a problem.
Martin Simmons
2016-05-06 14:43:46 UTC
Permalink
Post by Konstantin Belousov
The cleanup is performed by the same thread which did the lock, in the
kernel mode. Thread can only die by explicit kernel action, and this
action is executed by the dying thread.
Thanks, that's the part I didn't know about.

__Martin
Jilles Tjoelker
2016-05-06 23:30:11 UTC
Permalink
Post by Konstantin Belousov
I implemented robust mutexes for our libthr. A robust mutex is
guaranteed to be cleared by the system upon either thread or process
owner termination while the mutex is held. The next mutex locker is
then notified about inconsistent mutex state and can execute (or
abandon) corrective actions.
The patch mostly consists of small changes here and there, adding
neccessary checks for the inconsistent and abandoned conditions into
existing paths. Additionally, the thread exit handler was extended
to iterate over the userspace-maintained list of owned robust mutexes,
unlocking and marking as terminated each of them.
The list of owned robust mutexes cannot be maintained atomically
synchronous with the mutex lock state (it is possible in kernel, but
is too expensive). Instead, for the duration of lock or unlock
operation, the current mutex is remembered in a special slot that is
also checked by the kernel at thread termination.
Kernel must be aware about the per-thread location of the heads of
robust mutex lists and the current active mutex slot. Initially I tried
to extend TCBs with this data, so only a single syscall at the
threading library initialization would be needed: for any thread the
location of TCB is known by kernel, and the syscall would pass
offsets. Unfortunately, on some architectures the size of TCB is part
of the fixed ABI and cannot be changed. Instead, when a thread touches
a robust mutex for the first time, a new umtx op syscall is issued which
informs about location of lists heads.
The umtx sleep queues for PP and PI mutexes are split between
non-robust and robust. I do not understand the reasoning behind this
POSIX requirement.
Patch passes all glibc tests for robust mutexes I found in the nptl/
directory. See https://github.com/kostikbel/glibc-robust-tests .
Patch is available at https://kib.kiev.ua/kib/pshared/robust.1.patch
(beware of self-signed root certificate in the chain). Work was
sponsored by The FreeBSD Foundation.
1. Style. Since I had to re-read whole sys/kern/kern_umtx.c,
lib/libthr/thread/thr_umtx.h and lib/libthr/thread/thr_umtx.c, I
started fixing the numerous style violations in these files, which
actually made my eyes bleed.
2. The fix for proper tdfind() call use in umtxq_sleep_pi() for shared
pi mutexes.
3. Removal of the struct pthread_mutex m_owner field. I cannot see
why it is useful. The only optimization it provides is the
possibility to avoid clearing UMUTEX_CONTESTED bit when reading
m_lock.m_owner. The disadvantages of having this duplicated field
is that kernel does not know about pthread_mutex, so cannot fix the
dup value. Overall it is less work to clear UMUTEX_CONTESTED when
checking owner, then to try and handle inconsistencies.
I added the PMUTEX_OWNER_ID() macro to simplify code.
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Apparently, there is real code around which expects the following
- mmap a file, create a shared mutex in the mapping;
- the process exits;
- another process starts, mmaps the same file and expects that the
previously initialized mutex is still usable.
The knob changes the lifetime of such shared off-page from the
'destroy on last unmap' to either 'until vnode is reclaimed' or
until 'pthread_mutex_destroy' called, whatever comes first.
The 'until vnode is reclaimed' bit sounds like a recipe for hard to
reproduce bugs.

I do think it is related to the robust mutex patch, though.

Without robust mutexes and assuming threads do not unmap the memory
while having a mutex locked or while waiting on a condition variable, it
is sufficient to create the off-page mutex/condvar automatically in its
initial state when pthread_mutex_lock() or pthread_cond_*wait() are
called and no off-page object exists.

With robust mutexes, we need to store somewhere whether the next thread
should receive [EOWNERDEAD] or not, and this should persist even if no
processes have the memory mapped. This might be done by replacing
THR_PSHARED_PTR with a different value in the pthread_mutex_t. I'm not
sure I like that memory write being done from the kernel though.

The below is a review of https://kib.kiev.ua/kib/pshared/robust.2.patch
Post by Konstantin Belousov
diff --git a/lib/libc/gen/Symbol.map b/lib/libc/gen/Symbol.map
It is not necessary to export _pthread_mutex_consistent,
_pthread_mutexattr_getrobust and _pthread_mutexattr_setrobust (under
FBSDprivate_1.0 symbol version). They are not used by name outside the
DSO, only via the jump table.

The same thing is true of many other FBSDprivate_1.0 symbols but there
is a difference between adding new unnecessary exports and removing
existing unnecessary exports.
Post by Konstantin Belousov
+ if ((error2 == 0 || error2 == EOWNERDEAD) && cancel)
_thr_testcancel(curthread);
I don't think [EOWNERDEAD] should be swept under the carpet like this.
The cancellation cleanup handler will use the protected state and unlock
the mutex without making the state consistent and the state will be
unrecoverable.
Post by Konstantin Belousov
+void
+_mutex_leave_robust(struct pthread *curthread, struct pthread_mutex *m)
+{
+
+ if (!is_robust_mutex(m))
+ return;
This accesses the mutex after writing a value to the lock
word which allows other threads to lock it. A use after free may result,
since it is valid for another thread to lock, unlock and destroy the
mutex (assuming the mutex is not used otherwise later).

Memory ordering may permit the load of m->m_lock.m_flags to be moved
before the actual unlock, so this issue may not actually appear.

Given that the overhead of a system call on every robust mutex unlock is
not desired, the kernel's unlock of a terminated thread's mutexes will
unavoidably have this use after free. However, for non-robust mutexes
the previous guarantees should be kept.
Post by Konstantin Belousov
+ int defered, error;
Typo, should be "deferred".
Post by Konstantin Belousov
+.It Bq Er EOWNERDEAD
+The argument
+.Fa mutex
+points to the robust mutex and the previous owning thread terminated
+while holding the mutex lock.
+The lock was granted to the caller and it is up to the new owner
+to make the state consistent.
"points to a robust mutex".

Perhaps add a See .Xr pthread_mutex_consistent 3 here.
Post by Konstantin Belousov
diff --git a/share/man/man3/pthread_mutex_consistent.3 b/share/man/man3/pthread_mutex_consistent.3
This man page should mention that pthread_mutex_consistent() can be
called when a mutex lock or condition variable wait failed with
[EOWNERDEAD].
Post by Konstantin Belousov
@@ -37,7 +37,11 @@ struct umutex {
+ __uintptr_t m_rb_lnk; /* Robust linkage */
Although Linux also stores the robust list nodes in the mutexes like
this, I think it increases the chance of strange memory corruption.
Making the robust list an array in the thread's memory would be more
reliable. If the maximum number of owned robust mutexes can be small,
this can have a fixed size; otherwise, it needs to grow as needed (which
does add an allocation that may fail to the pthread_mutex_lock path,
bad).
Post by Konstantin Belousov
+ * The umutex.m_lock values and bits. The m_owner is the word which
+ * serves as the lock. It's high bit is the contention indicator,
+ * rest of bits records the owner TID. TIDs values start with PID_MAX
+ * + 2 and ends by INT32_MAX, the low range [1..PID_MAX] is guaranteed
+ * to be useable as the special markers.
Typo, "It's" should be "Its" and "ends" should be "end".

Bruce Evans would probably complain about comma splice (two times).
Post by Konstantin Belousov
+#define UMTX_OP_ROBUST 26
The name is rather vague. Perhaps this can be something like
UMTX_OP_INIT_ROBUST.
--
Jilles Tjoelker
Konstantin Belousov
2016-05-07 16:59:56 UTC
Permalink
Post by Jilles Tjoelker
Post by Konstantin Belousov
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Apparently, there is real code around which expects the following
- mmap a file, create a shared mutex in the mapping;
- the process exits;
- another process starts, mmaps the same file and expects that the
previously initialized mutex is still usable.
The knob changes the lifetime of such shared off-page from the
'destroy on last unmap' to either 'until vnode is reclaimed' or
until 'pthread_mutex_destroy' called, whatever comes first.
The 'until vnode is reclaimed' bit sounds like a recipe for hard to
reproduce bugs.
I do think it is related to the robust mutex patch, though.
Without robust mutexes and assuming threads do not unmap the memory
while having a mutex locked or while waiting on a condition variable, it
is sufficient to create the off-page mutex/condvar automatically in its
initial state when pthread_mutex_lock() or pthread_cond_*wait() are
called and no off-page object exists.
With robust mutexes, we need to store somewhere whether the next thread
should receive [EOWNERDEAD] or not, and this should persist even if no
processes have the memory mapped. This might be done by replacing
THR_PSHARED_PTR with a different value in the pthread_mutex_t. I'm not
sure I like that memory write being done from the kernel though.
In principle, I agree with this. Note that if we go with something
like THR_OWNERDEAD_PTR, the kernel write to set the value would be not
much different from the kernel write to unlock robust mutex with inlined
lock structures.

Still, I would prefer to to implement this now. For the local purposes,
the knob was enough, and default value will be 'disabled'.
Post by Jilles Tjoelker
The below is a review of https://kib.kiev.ua/kib/pshared/robust.2.patch
Post by Konstantin Belousov
diff --git a/lib/libc/gen/Symbol.map b/lib/libc/gen/Symbol.map
It is not necessary to export _pthread_mutex_consistent,
_pthread_mutexattr_getrobust and _pthread_mutexattr_setrobust (under
FBSDprivate_1.0 symbol version). They are not used by name outside the
DSO, only via the jump table.
Removed from the export.
Post by Jilles Tjoelker
The same thing is true of many other FBSDprivate_1.0 symbols but there
is a difference between adding new unnecessary exports and removing
existing unnecessary exports.
Post by Konstantin Belousov
+ if ((error2 == 0 || error2 == EOWNERDEAD) && cancel)
_thr_testcancel(curthread);
I don't think [EOWNERDEAD] should be swept under the carpet like this.
The cancellation cleanup handler will use the protected state and unlock
the mutex without making the state consistent and the state will be
unrecoverable.
So your argument there is to return EOWNERDEAD and not cancelling,
am I right ? I reused part of your text as the comment.
Post by Jilles Tjoelker
Post by Konstantin Belousov
+void
+_mutex_leave_robust(struct pthread *curthread, struct pthread_mutex *m)
+{
+
+ if (!is_robust_mutex(m))
+ return;
This accesses the mutex after writing a value to the lock
word which allows other threads to lock it. A use after free may result,
since it is valid for another thread to lock, unlock and destroy the
mutex (assuming the mutex is not used otherwise later).
Memory ordering may permit the load of m->m_lock.m_flags to be moved
before the actual unlock, so this issue may not actually appear.
Given that the overhead of a system call on every robust mutex unlock is
not desired, the kernel's unlock of a terminated thread's mutexes will
unavoidably have this use after free. However, for non-robust mutexes
the previous guarantees should be kept.
I agree that this is a bug, and agree that the kernel accesses to the
curthread->inact_mtx are potentially unsafe. I also did not wanted to
issue a syscall for unlock of a robust mutex, as you noted.

I fixed the bug with the is_robust_mutex() test in _mutex_leave_robust()
by caching the robust status.

I was indeed worried by the kernel access issue you mentioned, but
kernel is immune to 'bad' memory accesses. What bothered me is the ill
ABA situation, where the lock memory is freed and repurposed, and then
the lock word is written with the one of two specific values which give
the same state as for locked mutex. This would cause kernel to 'unlock'
it (but not to follow the invalid m_rb_link).

But for this to happen, we must have a situation where a thread
is being terminated before mutex_unlock_common() reached the
_mutex_leave_robust() call. This is async thread termination, which
then must be either process termination (including execve()), or a call
to thr_exit() from a signal handler in our thread (including async
cancellation).

I am sure that the thr_exit() situation is non-conforming, so the only
concern is the process exit, and then, shared robust mutex, because for
private mutex, only the exiting process state is affected. I can verify
in umtx_handle_rb(), for instance, that for USYNC_PROCESS_SHARED object,
the underlying memory is backed by the umtx shm page. This would close
the race.

But this would interfere with libthr2, if that ever happen.
Post by Jilles Tjoelker
Post by Konstantin Belousov
+ int defered, error;
Typo, should be "deferred".
I also changed PMUTEX_FLAG_DEFERED.
Post by Jilles Tjoelker
Post by Konstantin Belousov
+.It Bq Er EOWNERDEAD
+The argument
+.Fa mutex
+points to the robust mutex and the previous owning thread terminated
+while holding the mutex lock.
+The lock was granted to the caller and it is up to the new owner
+to make the state consistent.
"points to a robust mutex".
Perhaps add a See .Xr pthread_mutex_consistent 3 here.
Both done.
Post by Jilles Tjoelker
Post by Konstantin Belousov
diff --git a/share/man/man3/pthread_mutex_consistent.3 b/share/man/man3/pthread_mutex_consistent.3
This man page should mention that pthread_mutex_consistent() can be
called when a mutex lock or condition variable wait failed with
[EOWNERDEAD].
I took introductory text from the POSIX page for the function.
Post by Jilles Tjoelker
Post by Konstantin Belousov
@@ -37,7 +37,11 @@ struct umutex {
+ __uintptr_t m_rb_lnk; /* Robust linkage */
Although Linux also stores the robust list nodes in the mutexes like
this, I think it increases the chance of strange memory corruption.
Making the robust list an array in the thread's memory would be more
reliable. If the maximum number of owned robust mutexes can be small,
this can have a fixed size; otherwise, it needs to grow as needed (which
does add an allocation that may fail to the pthread_mutex_lock path,
bad).
I gave this proposal some thought.

I very much dislike an idea of calling memory allocator on the lock, and
esp. the trylock, path. The later could need to obtain allocator locks
which (sometimes partially) defeat the trylock purpose.

I can use mmap(2) directly there, similarly how pthread_setspecific()
was changed recently, which would avoid the issue of calling userspace
allocator. Still, the problem of an addiitonal syscall, resulting
ENOMEM and also the time to copy the current robust owned list to grown
location are there (I do not see that using chunked allocations is
reasonable, since it would be the same list as current m_rb_lnk, but at
different level.

I prefer to keep the robust linked list for these reasons.

In fact, the deciding argument would be actual application usage of the
robustness. I thought, when writing the patch, when and how would I use
the feature, and did not see compelling arguments to even try to use it.
My stumbling block is the user data consistency recovery: for instance,
I tried to write a loop which would increment shared variable N times,
and I was not able to end up with any simple recovery mechanism from
the aborted iteration, except writing iteration in assembly and have a
parallel tick variable which enumerate each iteration action.
Post by Jilles Tjoelker
Post by Konstantin Belousov
+ * The umutex.m_lock values and bits. The m_owner is the word which
+ * serves as the lock. It's high bit is the contention indicator,
+ * rest of bits records the owner TID. TIDs values start with PID_MAX
+ * + 2 and ends by INT32_MAX, the low range [1..PID_MAX] is guaranteed
+ * to be useable as the special markers.
Typo, "It's" should be "Its" and "ends" should be "end".
Bruce Evans would probably complain about comma splice (two times).
I tried to reword this.
Post by Jilles Tjoelker
Post by Konstantin Belousov
+#define UMTX_OP_ROBUST 26
The name is rather vague. Perhaps this can be something like
UMTX_OP_INIT_ROBUST.
I renamed this to UMTX_OP_ROBUST_LISTS, together with the parameters
structure.

Updated patch is at https://kib.kiev.ua/kib/pshared/robust.3.patch
I did not added the check for umtx shm into the umtx_handle_rb() yet,
waiting for your opinion.
Jilles Tjoelker
2016-05-08 12:52:22 UTC
Permalink
Post by Konstantin Belousov
Post by Jilles Tjoelker
Post by Konstantin Belousov
4. The sysctl kern.ipc.umtx_vnode_persistent is added, which controls
the lifetime of the shared mutex associated with a vnode' page.
Apparently, there is real code around which expects the following
- mmap a file, create a shared mutex in the mapping;
- the process exits;
- another process starts, mmaps the same file and expects that the
previously initialized mutex is still usable.
The knob changes the lifetime of such shared off-page from the
'destroy on last unmap' to either 'until vnode is reclaimed' or
until 'pthread_mutex_destroy' called, whatever comes first.
The 'until vnode is reclaimed' bit sounds like a recipe for hard to
reproduce bugs.
I do think it is related to the robust mutex patch, though.
Without robust mutexes and assuming threads do not unmap the memory
while having a mutex locked or while waiting on a condition variable, it
is sufficient to create the off-page mutex/condvar automatically in its
initial state when pthread_mutex_lock() or pthread_cond_*wait() are
called and no off-page object exists.
With robust mutexes, we need to store somewhere whether the next thread
should receive [EOWNERDEAD] or not, and this should persist even if no
processes have the memory mapped. This might be done by replacing
THR_PSHARED_PTR with a different value in the pthread_mutex_t. I'm not
sure I like that memory write being done from the kernel though.
In principle, I agree with this. Note that if we go with something
like THR_OWNERDEAD_PTR, the kernel write to set the value would be not
much different from the kernel write to unlock robust mutex with inlined
lock structures.
Still, I would prefer to to implement this now. For the local purposes,
the knob was enough, and default value will be 'disabled'.
OK. The patch still initializes umtx_shm_vnobj_persistent to 1 though.
There is also a leak if umtx_shm_vnobj_persistent is toggled from 1 to 0
while an unmapped object with an off-page is active.

I forgot two issues in my analysis.

Firstly, the automatic re-initialization (from r297185, in fact) allows
the next thread to lock the mutex even if the previous thread terminated
while holding the lock. Normally, one would expect that non-robust
mutexes cause the lock attempt to hang (or fail if it's a trylock),
except if a thread ID is reused (in which case the lock attempt may
instead succeed recursively or fail).

Secondly, the initialization attributes are lost. If the next thread
after last unmap sees THR_PSHARED_PTR, it has no idea whether it should
be robust or not, which type it should have and whether it should be
PP/PI. A mutex that was originally initialized as non-robust should not
be re-initialized as robust since the code will not handle the
[EOWNERDEAD] and [ENOTRECOVERABLE] errors (in the worst case, errors are
completely ignored and it is as if the mutex did not exist).

Special-casing robust to store in the THR_PSHARED_PTR location seems
strange since the other attributes are still lost.

All this is POSIX-compliant since POSIX specifies that the state of
synchronization objects becomes undefined on last unmap, and our
implementation fundamentally depends on that possibility. Unfortunately,
Linux and Solaris do not need the possibility. The automatic
re-initialization and umtx_vnode_persistent are just hacks that make
certain applications almost always work (but not always, and in such
cases it will be hard to debug).

Another issue with umtx_vnode_persistent is that it can hide high memory
usage. Filling up a page with pthread_mutex_t will create many pages
full of actual mutexes. This memory usage is only visible as long as it
is still mapped somewhere.

Apart from that, umtx_vnode_persistent can (at least conceptually) work
fully reliably for shared memory objects and tmpfs files, which do not
have persistent storage.
Post by Konstantin Belousov
Post by Jilles Tjoelker
The below is a review of https://kib.kiev.ua/kib/pshared/robust.2.patch
Post by Konstantin Belousov
+ if ((error2 == 0 || error2 == EOWNERDEAD) && cancel)
_thr_testcancel(curthread);
I don't think [EOWNERDEAD] should be swept under the carpet like this.
The cancellation cleanup handler will use the protected state and unlock
the mutex without making the state consistent and the state will be
unrecoverable.
So your argument there is to return EOWNERDEAD and not cancelling,
am I right ? I reused part of your text as the comment.
Yes.
Post by Konstantin Belousov
Post by Jilles Tjoelker
Post by Konstantin Belousov
+void
+_mutex_leave_robust(struct pthread *curthread, struct pthread_mutex *m)
+{
+
+ if (!is_robust_mutex(m))
+ return;
This accesses the mutex after writing a value to the lock
word which allows other threads to lock it. A use after free may result,
since it is valid for another thread to lock, unlock and destroy the
mutex (assuming the mutex is not used otherwise later).
Memory ordering may permit the load of m->m_lock.m_flags to be moved
before the actual unlock, so this issue may not actually appear.
Given that the overhead of a system call on every robust mutex unlock is
not desired, the kernel's unlock of a terminated thread's mutexes will
unavoidably have this use after free. However, for non-robust mutexes
the previous guarantees should be kept.
I agree that this is a bug, and agree that the kernel accesses to the
curthread->inact_mtx are potentially unsafe. I also did not wanted to
issue a syscall for unlock of a robust mutex, as you noted.
I fixed the bug with the is_robust_mutex() test in _mutex_leave_robust()
by caching the robust status.
I was indeed worried by the kernel access issue you mentioned, but
kernel is immune to 'bad' memory accesses. What bothered me is the ill
ABA situation, where the lock memory is freed and repurposed, and then
the lock word is written with the one of two specific values which give
the same state as for locked mutex. This would cause kernel to 'unlock'
it (but not to follow the invalid m_rb_link).
But for this to happen, we must have a situation where a thread
is being terminated before mutex_unlock_common() reached the
_mutex_leave_robust() call. This is async thread termination, which
then must be either process termination (including execve()), or a call
to thr_exit() from a signal handler in our thread (including async
cancellation).
I am sure that the thr_exit() situation is non-conforming, so the only
concern is the process exit, and then, shared robust mutex, because for
private mutex, only the exiting process state is affected. I can verify
in umtx_handle_rb(), for instance, that for USYNC_PROCESS_SHARED object,
the underlying memory is backed by the umtx shm page. This would close
the race.
But this would interfere with libthr2, if that ever happen.
Hmm, libthr2 or non-standard synchronization primitive implementations
seem a good reason to not check for umtx shm page.

However, the existing checks can be made stricter. The umtx_handle_rb()
from robust.3.patch will use m_rb_lnk with no validation at all except
that it is a valid pointer. However, if the UMUTEX_ROBUST flag is not
set, the mutex should not have been in this list at all and it is
probably safer to ignore m_rb_lnk.
Post by Konstantin Belousov
Post by Jilles Tjoelker
Post by Konstantin Belousov
@@ -37,7 +37,11 @@ struct umutex {
+ __uintptr_t m_rb_lnk; /* Robust linkage */
Although Linux also stores the robust list nodes in the mutexes like
this, I think it increases the chance of strange memory corruption.
Making the robust list an array in the thread's memory would be more
reliable. If the maximum number of owned robust mutexes can be small,
this can have a fixed size; otherwise, it needs to grow as needed (which
does add an allocation that may fail to the pthread_mutex_lock path,
bad).
I gave this proposal some thought.
I very much dislike an idea of calling memory allocator on the lock, and
esp. the trylock, path. The later could need to obtain allocator locks
which (sometimes partially) defeat the trylock purpose.
I can use mmap(2) directly there, similarly how pthread_setspecific()
was changed recently, which would avoid the issue of calling userspace
allocator. Still, the problem of an addiitonal syscall, resulting
ENOMEM and also the time to copy the current robust owned list to grown
location are there (I do not see that using chunked allocations is
reasonable, since it would be the same list as current m_rb_lnk, but at
different level.
I prefer to keep the robust linked list for these reasons.
There is a difference between chunked allocations and the current
m_rb_lnk in that the list would reside in local memory, not vulnerable
to other processes scribbling over it. This is probably not a major
issue since sharing a mutex already allows threads to block each other
indefinitely.

The X server has used shared memory across privilege levels for years
but not with anything like mutexes and condition variables. It uses X
protocol messages and xshmfence (custom synchronization object that does
not block the X server).
Post by Konstantin Belousov
In fact, the deciding argument would be actual application usage of the
robustness. I thought, when writing the patch, when and how would I use
the feature, and did not see compelling arguments to even try to use it.
My stumbling block is the user data consistency recovery: for instance,
I tried to write a loop which would increment shared variable N times,
and I was not able to end up with any simple recovery mechanism from
the aborted iteration, except writing iteration in assembly and have a
parallel tick variable which enumerate each iteration action.
One way of using this is to never call pthread_mutex_consistent() and
only use it to report the problem and avoid hangs and crashes. A
slightly extended version would tear down the whole structure including
shared memory and all involved processes and rebuild it.

When calling pthread_mutex_consistent(), some data loss will almost
unavoidably occur. In some cases, it may be possible to re-initialize
just the state protected by the mutex. Alternatively, a thread that
modifies the state may make a copy of it before modifying it (some
thread-local StoreStore fences are needed); the [EOWNERDEAD] handling
can then put back the copy if it exists. This way, robust mutexes can be
used to make updates transactional.

Approaches based on message passing will be easier to get right and more
reliable but they might be too slow.
Post by Konstantin Belousov
Updated patch is at https://kib.kiev.ua/kib/pshared/robust.3.patch
I did not added the check for umtx shm into the umtx_handle_rb() yet,
waiting for your opinion.
--
Jilles Tjoelker
Konstantin Belousov
2016-05-09 02:51:07 UTC
Permalink
Post by Jilles Tjoelker
OK. The patch still initializes umtx_shm_vnobj_persistent to 1 though.
There is also a leak if umtx_shm_vnobj_persistent is toggled from 1 to 0
while an unmapped object with an off-page is active.
I forgot two issues in my analysis.
Firstly, the automatic re-initialization (from r297185, in fact) allows
the next thread to lock the mutex even if the previous thread terminated
while holding the lock. Normally, one would expect that non-robust
mutexes cause the lock attempt to hang (or fail if it's a trylock),
except if a thread ID is reused (in which case the lock attempt may
instead succeed recursively or fail).
Secondly, the initialization attributes are lost. If the next thread
after last unmap sees THR_PSHARED_PTR, it has no idea whether it should
be robust or not, which type it should have and whether it should be
PP/PI. A mutex that was originally initialized as non-robust should not
be re-initialized as robust since the code will not handle the
[EOWNERDEAD] and [ENOTRECOVERABLE] errors (in the worst case, errors are
completely ignored and it is as if the mutex did not exist).
Special-casing robust to store in the THR_PSHARED_PTR location seems
strange since the other attributes are still lost.
Yes, these are all good points.
Post by Jilles Tjoelker
All this is POSIX-compliant since POSIX specifies that the state of
synchronization objects becomes undefined on last unmap, and our
implementation fundamentally depends on that possibility. Unfortunately,
Could you, please, point me to the exact place in the standard where
this is allowed ?
Post by Jilles Tjoelker
Linux and Solaris do not need the possibility. The automatic
re-initialization and umtx_vnode_persistent are just hacks that make
certain applications almost always work (but not always, and in such
cases it will be hard to debug).
Another issue with umtx_vnode_persistent is that it can hide high memory
usage. Filling up a page with pthread_mutex_t will create many pages
full of actual mutexes. This memory usage is only visible as long as it
is still mapped somewhere.
There is already a resource limit for the number of pshared locks per
uid, RLIMIT_UMTXP. When exceeded, user would get somewhat obscure
failure mode, but excessive memory consumption is not allowed. And I
think that vmstat -o would give enough info to diagnose, except that
users must know about it and be quialified enough to interpret the
output.
Post by Jilles Tjoelker
Apart from that, umtx_vnode_persistent can (at least conceptually) work
fully reliably for shared memory objects and tmpfs files, which do not
have persistent storage.
I changed defaults for the umtx_vnode_persistent to 0 in the published
patch.
Post by Jilles Tjoelker
Hmm, libthr2 or non-standard synchronization primitive implementations
seem a good reason to not check for umtx shm page.
However, the existing checks can be made stricter. The umtx_handle_rb()
from robust.3.patch will use m_rb_lnk with no validation at all except
that it is a valid pointer. However, if the UMUTEX_ROBUST flag is not
set, the mutex should not have been in this list at all and it is
probably safer to ignore m_rb_lnk.
Ok, I changed the code to consider lack of UMUTEX_ROBUST as a stopper
for the list walk. Also, I stop the walk if mutex is not owned by
the current thread, except when the mutex was stored in inact slot.
The same piece of changes hopefully fixes list walk for COMPAT32 on
big-endian machines.
Post by Jilles Tjoelker
There is a difference between chunked allocations and the current
m_rb_lnk in that the list would reside in local memory, not vulnerable
to other processes scribbling over it. This is probably not a major
issue since sharing a mutex already allows threads to block each other
indefinitely.
I would easily deletegate the chunked array to some future reimplementation
if not the ABI issue. Still, I do not like it.

Current updates to the patch https://kib.kiev.ua/kib/pshared/robust.4.patch
Jilles Tjoelker
2016-05-13 15:37:16 UTC
Permalink
Post by Jilles Tjoelker
OK. The patch still initializes umtx_shm_vnobj_persistent to 1 though.
There is also a leak if umtx_shm_vnobj_persistent is toggled from 1 to 0
while an unmapped object with an off-page is active.
[snip]
Post by Jilles Tjoelker
All this is POSIX-compliant since POSIX specifies that the state of
synchronization objects becomes undefined on last unmap, and our
implementation fundamentally depends on that possibility. Unfortunately,
Could you, please, point me to the exact place in the standard where
this is allowed ?
The mmap() page in POSIX.1-2008tc1 XSH 3 has:

] The state of synchronization objects such as mutexes, semaphores,
] barriers, and conditional variables placed in shared memory mapped
] with MAP_SHARED becomes undefined when the last region in any process
] containing the synchronization object is unmapped.

This is new in issue 7 (SUSv4):

] Austin Group Interpretations 1003.1-2001 #078 and #079 are applied,
] clarifying page alignment requirements and adding a note about the
] state of synchronization objects becoming undefined when a shared
] region is unmapped.
Post by Jilles Tjoelker
Linux and Solaris do not need the possibility. The automatic
re-initialization and umtx_vnode_persistent are just hacks that make
certain applications almost always work (but not always, and in such
cases it will be hard to debug).
Another issue with umtx_vnode_persistent is that it can hide high memory
usage. Filling up a page with pthread_mutex_t will create many pages
full of actual mutexes. This memory usage is only visible as long as it
is still mapped somewhere.
There is already a resource limit for the number of pshared locks per
uid, RLIMIT_UMTXP. When exceeded, user would get somewhat obscure
failure mode, but excessive memory consumption is not allowed. And I
think that vmstat -o would give enough info to diagnose, except that
users must know about it and be quialified enough to interpret the
output.
Hmm, OK.
Post by Jilles Tjoelker
Apart from that, umtx_vnode_persistent can (at least conceptually) work
fully reliably for shared memory objects and tmpfs files, which do not
have persistent storage.
I changed defaults for the umtx_vnode_persistent to 0 in the published
patch.
OK.
Post by Jilles Tjoelker
Hmm, libthr2 or non-standard synchronization primitive implementations
seem a good reason to not check for umtx shm page.
However, the existing checks can be made stricter. The umtx_handle_rb()
from robust.3.patch will use m_rb_lnk with no validation at all except
that it is a valid pointer. However, if the UMUTEX_ROBUST flag is not
set, the mutex should not have been in this list at all and it is
probably safer to ignore m_rb_lnk.
Ok, I changed the code to consider lack of UMUTEX_ROBUST as a stopper
for the list walk. Also, I stop the walk if mutex is not owned by
the current thread, except when the mutex was stored in inact slot.
The same piece of changes hopefully fixes list walk for COMPAT32 on
big-endian machines.
OK.
Post by Jilles Tjoelker
There is a difference between chunked allocations and the current
m_rb_lnk in that the list would reside in local memory, not vulnerable
to other processes scribbling over it. This is probably not a major
issue since sharing a mutex already allows threads to block each other
indefinitely.
I would easily deletegate the chunked array to some future reimplementation
if not the ABI issue. Still, I do not like it.
An array only works well for this if you know beforehand how long it
needs to be, and I don't think we can do this since Linux's limit is so
high that an array would waste a lot of memory.

The existence of some limit is, however, unavoidable and it could be
considered a bug that pthread_mutex_lock() for a robust mutex returns
success even if it will not fulfill its promise to do the EOWNERDEAD
thing.
Current updates to the patch https://kib.kiev.ua/kib/pshared/robust.4.patch
--
Jilles Tjoelker
Konstantin Belousov
2016-05-13 20:19:16 UTC
Permalink
Post by Jilles Tjoelker
] The state of synchronization objects such as mutexes, semaphores,
] barriers, and conditional variables placed in shared memory mapped
] with MAP_SHARED becomes undefined when the last region in any process
] containing the synchronization object is unmapped.
] Austin Group Interpretations 1003.1-2001 #078 and #079 are applied,
] clarifying page alignment requirements and adding a note about the
] state of synchronization objects becoming undefined when a shared
] region is unmapped.
Very interesting, thanks.

BTW, is there a chance of Austin Group get notified of, and possibly
adopting, MAP_EXCL flag ?
Post by Jilles Tjoelker
Post by Konstantin Belousov
Current updates to the patch https://kib.kiev.ua/kib/pshared/robust.4.patch
Do you have any further notes, or do you want to give the patch more time ?
If not, are you fine with 'Reviewed by' attribution ?

Thanks.
Jilles Tjoelker
2016-05-13 22:34:34 UTC
Permalink
Post by Konstantin Belousov
Post by Jilles Tjoelker
] The state of synchronization objects such as mutexes, semaphores,
] barriers, and conditional variables placed in shared memory mapped
] with MAP_SHARED becomes undefined when the last region in any process
] containing the synchronization object is unmapped.
] Austin Group Interpretations 1003.1-2001 #078 and #079 are applied,
] clarifying page alignment requirements and adding a note about the
] state of synchronization objects becoming undefined when a shared
] region is unmapped.
Very interesting, thanks.
BTW, is there a chance of Austin Group get notified of, and possibly
adopting, MAP_EXCL flag ?
You could send a message to the mailing list. MAP_EXCL is not used in
the FreeBSD base but Debian code search shows Chromium and Hurd ftpfs
mentioning it in comments.
Post by Konstantin Belousov
Post by Jilles Tjoelker
Post by Konstantin Belousov
Current updates to the patch
https://kib.kiev.ua/kib/pshared/robust.4.patch
Do you have any further notes, or do you want to give the patch more time ?
If not, are you fine with 'Reviewed by' attribution ?
I don't trust the if (umtx_shm_vnobj_persistent) in
sys/vm/vnode_pager.c:

+ if (umtx_shm_vnobj_persistent)
+ umtx_shm_object_terminated(obj);

If umtx_shm_vnobj_persistent is turned off via sysctl between the check
in sys/vm/vm_object.c and this one, don't we have a leak?

'Reviewed by: jilles' is fine otherwise.
--
Jilles Tjoelker
Konstantin Belousov
2016-05-14 10:10:44 UTC
Permalink
Post by Jilles Tjoelker
Post by Konstantin Belousov
BTW, is there a chance of Austin Group get notified of, and possibly
adopting, MAP_EXCL flag ?
You could send a message to the mailing list. MAP_EXCL is not used in
the FreeBSD base but Debian code search shows Chromium and Hurd ftpfs
mentioning it in comments.
It is not clear if they reference FreeBSD flag, or just come to this
semi-obvious idea independently. Anway, it is not used because feature
is not present anywhere (except us).

I do not want to present this on ag list as a stranger with bad language.
Post by Jilles Tjoelker
I don't trust the if (umtx_shm_vnobj_persistent) in
+ if (umtx_shm_vnobj_persistent)
+ umtx_shm_object_terminated(obj);
If umtx_shm_vnobj_persistent is turned off via sysctl between the check
in sys/vm/vm_object.c and this one, don't we have a leak?
Removed the test in the vnode_destroy_vobject(), doing the
umtx termination unconditionally on reclaim.
Post by Jilles Tjoelker
'Reviewed by: jilles' is fine otherwise.
Updated patch is at
https://www.kib.kiev.ua/kib/pshared/robust.5.patch

P.S. I will document UMUTEX_ROBUST/UMUTEX_RB_* and UMTX_OP_ROBUST_LISTS
after the feature is committed.

Loading...