Discussion:
zero-cost SDT probes
Mark Johnston
2015-11-22 02:45:42 UTC
Permalink
Hi,

For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
probe site expands to this:

if (func_ptr != NULL)
func_ptr(<probe args>);

When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to

func(<probe args>);

When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.

I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.

The main difficulty is in figuring out where the probe sites actually
are once the kernel is running. In my patch, a probe site is a call to
an externally-defined function which is defined in an
automatically-generated C file. At link time, we first perform a partial
link of all the kernel's object files. Then, a script uses the relocations
against the still-undefined probe functions to generate
1) stub functions for the probes, so that the kernel can actually be
linked, and
2) a linker set containing the offsets of each probe site relative to
the beginning of the text section.
The result is linked with the partially-linked kernel to generate the
final kernel file.

During boot, we iterate over the linker set, using the offsets plus the
address of btext to overwrite probe sites with NOPs. SDT probes in kernel
modules are handled differently (and more simply): the kernel linker just
has special handling for relocations against symbols named __dtrace_sdt_*;
this is how illumos/Solaris implements all of this.

My uncertainty revolves around the use of relocations in the
partially-linked kernel to determine the address of probe sites in the
running kernel. With the GNU ld in base, this happens to work because
the final link doesn't modify the text section. Is this something I can
rely upon? Will this assumption be false with the advent of lld and LTO?
Are there other, cleaner ways to implement what I described above?

Thanks,
-Mark

[1] https://people.freebsd.org/~markj/patches/sdt-zerocost/
Artem Belevich
2015-11-22 04:41:13 UTC
Permalink
Post by Mark Johnston
Hi,
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
I wonder how much of an overhead that currently adds. Do you have any
benchmark numbers comparing performance of no SDT, current SDT
implementation and "zero-cost" one.

--Artem

When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
Post by Mark Johnston
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled;
it's more complicated to implement though, which is why
Post by Mark Johnston
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
The main difficulty is in figuring out where the probe sites actually
are once the kernel is running. In my patch, a probe site is a call to
an externally-defined function which is defined in an
automatically-generated C file. At link time, we first perform a partial
link of all the kernel's object files. Then, a script uses the relocations
against the still-undefined probe functions to generate
1) stub functions for the probes, so that the kernel can actually be
linked, and
2) a linker set containing the offsets of each probe site relative to
the beginning of the text section.
The result is linked with the partially-linked kernel to generate the
final kernel file.
During boot, we iterate over the linker set, using the offsets plus the
address of btext to overwrite probe sites with NOPs. SDT probes in kernel
modules are handled differently (and more simply): the kernel linker just
has special handling for relocations against symbols named __dtrace_sdt_*;
this is how illumos/Solaris implements all of this.
My uncertainty revolves around the use of relocations in the
partially-linked kernel to determine the address of probe sites in the
running kernel. With the GNU ld in base, this happens to work because
the final link doesn't modify the text section. Is this something I can
rely upon? Will this assumption be false with the advent of lld and LTO?
Are there other, cleaner ways to implement what I described above?
Thanks,
-Mark
[1] https://people.freebsd.org/~markj/patches/sdt-zerocost/
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
Simon J. Gerraty
2015-11-22 06:29:37 UTC
Permalink
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
Would it be feasible to compile the probes into the kernel
as active calls to a registrar function?
That would eliminate all the complexity of finding PC's
though you'd probably need to pass extra args to convey the point of the
probe?

It would hurt boot time a little too - each probe point would make a
call to register itself (and get overwritten with nops as a reward) but
very simple?
Oliver Pinter
2015-11-22 12:12:54 UTC
Permalink
Post by Simon J. Gerraty
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
Would it be feasible to compile the probes into the kernel
as active calls to a registrar function?
That would eliminate all the complexity of finding PC's
though you'd probably need to pass extra args to convey the point of the
probe?
It would hurt boot time a little too - each probe point would make a
call to register itself (and get overwritten with nops as a reward) but
very simple?
In opBSD I have already a similar mechanism for SMAP:
https://github.com/opntr/opBSD/commits/op/gsoc2014/master .
Post by Simon J. Gerraty
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch
Mark Johnston
2015-11-23 00:15:11 UTC
Permalink
Post by Simon J. Gerraty
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
Would it be feasible to compile the probes into the kernel
as active calls to a registrar function?
That would eliminate all the complexity of finding PC's
though you'd probably need to pass extra args to convey the point of the
probe?
It would hurt boot time a little too - each probe point would make a
call to register itself (and get overwritten with nops as a reward) but
very simple?
I considered such an approach but didn't pursue it for a few reasons:
- We'd have to pass a unique probe site identifier as an argument, which
requires at least one extra instruction at the probe site.
- If the probe site is a tail call, how can the registrar find the
correct caller?
- If a probe site isn't patched until multiple CPUs have started, how do
we safely overwrite the call site in the face of the possibility that
another thread is executing the call at the same time? When it comes
to enabling or disabling a probe, we only need to write a single byte,
but overwriting multiple bytes seems unsafe.

I think the last point could possibly be addressed by overwriting the
first byte of the call with a breakpoint before overwriting the rest of
the call site with NOPs, using the breakpoint handler to fix up any
threads that reached the probe site as it was being modified. But this
detracts a bit from the simplicity of the approach.
Slawa Olhovchenkov
2015-11-22 10:50:31 UTC
Permalink
Post by Mark Johnston
Hi,
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
I am experimenting with overhead DTrace probes in userspace.
Total executing time program w/o any probes 3% less then program with
not enabled probes.
With enabled probes (conditional probes too) -- each probe add about 0.7us.
I am place DTrace probe inside inner loop, worst case.

=====
#include <stdio.h>
#include "probes.h"

long primes[1000000] = { 3 };
long primecount = 1;

int main(int argc, char **argv)
{
long divisor = 0;
long currentprime = 5;
long isprime = 1;

while (currentprime < 1000000)
{
isprime = 1;
PRIMES_PRIMECALC_START(currentprime);
for(divisor=0;divisor<primecount;divisor++)
{
PRIMES_PRIMECALC_ITER(divisor);
if (currentprime % primes[divisor] == 0)
{
isprime = 0;
break;
}
}
PRIMES_PRIMECALC_DONE(currentprime,isprime);
if (isprime)
{
primes[primecount++] = currentprime;
PRIMES_PRIMECALC_TABLESIZE(primecount);
// printf("%d is a prime\n",currentprime);
}
currentprime = currentprime + 2;
}
printf("Total %ld primes\n", primecount);

}
Jilles Tjoelker
2015-11-22 16:44:46 UTC
Permalink
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
I have not run any benchmarks but I expect that this removes only a
small part of the overhead of disabled probes. Saving and restoring
caller-save registers and setting up parameters certainly increases code
size and I-cache use. On the other hand, a branch that is always or
never taken will generally cost at most 2 cycles.

Avoiding this overhead would require not generating an ABI function call
but a point where the probe parameters can be calculated from the
registers and stack frame (like how a debugger prints local variables,
but with a guarantee that "optimized out" will not happen). This
requires compiler changes, though, and DTrace has generally not used
DWARF-like debug information.

For a fairer comparison, the five NOPs should be changed to one or two
longer NOPs, since many CPUs decode at most 3 or 4 instructions per
cycle. Some examples of longer NOPs are in
contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any
x86 CPU.
--
Jilles Tjoelker
Mark Johnston
2015-11-22 23:59:03 UTC
Permalink
Post by Jilles Tjoelker
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
I have not run any benchmarks but I expect that this removes only a
small part of the overhead of disabled probes. Saving and restoring
caller-save registers and setting up parameters certainly increases code
size and I-cache use. On the other hand, a branch that is always or
never taken will generally cost at most 2 cycles.
I've done some microbenchmarks using the lockstat probes on a Xeon
E5-2630 with SMT disabled. They just read the TSC and acquire/release a
lock in a loop, so there's no contention. In general I see at most a small
difference between the old and new SDT implementations and a kernel with
KDTRACE_HOOKS off altogether. For example, in my test a mtx lock/unlock
pair takes 52 cycles on average without probes; with probes, it's 54
cycles with both SDT implementations. rw read locks are 77 cycles
without probes, 79 with. rw write locks and sx exclusive locks don't
appear to show any differences, and sx shared locks show the same
timings without KDTRACE_HOOKS and with the new SDT implementation; the
current implementation adds a cycle per acquire/release pair.

None of this takes into account the cache effects of these probes. One
advantage of the proposed implementation is that we eliminate the data
access required to test if the probe is enabled in the first place. I'm
also a bit uncertain about the I-cache impact. My understanding is that
a fetch of an instruction will load the entire cache line containing
that instruction. So unless the argument-marshalling instructions for a
probe site spans at least one cache line, won't all they all be loaded
anyway?

Consider the disassemblies for __mtx_lock_flags() here:
https://people.freebsd.org/~markj/__mtx_lock_flags_disas.txt
Based on what I said above and assuming a 64-byte cache line size, I'd
expect all instructions between 0xffffffff806d1328 and 0xffffffff806d134e
to be loaded regardless of whether or not the branch is taken. Is that not
the case?

I'll also add that with this change the size of the kernel text shrinks
a fair bit: from 8425096 bytes to 7983496 bytes with a custom MINIMAL-like
kernel with lock inlining.

Finally, I should have noted in my first post that this work has other
motivations beyond possible performance improvements. In particular,
recording call sites allows us to finally fill in the function component
of SDT probes automatically. For example, with this work it becomes
possible to enable the udp:::receive probe in udp6_receive(), but not
the one in udp_receive(). Generally, DTrace probes that correspond to a
specific instruction are said to be "anchored"; DTrace implements various
bytecode operations differently depending on whether the probe is
anchored, and SDT probes are expected to be, but with the current
implementation they're not. As a result, some operations, such as
stack(), do not work correctly with SDT probes. r288363 is a workaround
for this problem; the change I proposed is a real solution. This is also
a step towards fixing lockstat(1)'s caller identification when locks are
not inlined.
Post by Jilles Tjoelker
Avoiding this overhead would require not generating an ABI function call
but a point where the probe parameters can be calculated from the
registers and stack frame (like how a debugger prints local variables,
but with a guarantee that "optimized out" will not happen). This
requires compiler changes, though, and DTrace has generally not used
DWARF-like debug information.
Integrating DWARF information into libdtrace has been something I've been
slowly working on, with the goal of being able to place probes on
arbitrary instructions instead of just function boundaries. But as you
point out, compiler support is needed for any of this to be reliably
useful for SDT.
Post by Jilles Tjoelker
For a fairer comparison, the five NOPs should be changed to one or two
longer NOPs, since many CPUs decode at most 3 or 4 instructions per
cycle. Some examples of longer NOPs are in
contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any
x86 CPU.
I'll try that, thanks. On amd64 at least, I think we'd have to use two
NOPs: a single-byte NOP that can be overwritten when the probe is
enabled, and then a four-byte NOP.
Bruce Evans
2015-11-23 03:48:14 UTC
Permalink
Post by Mark Johnston
Post by Jilles Tjoelker
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
I have not run any benchmarks but I expect that this removes only a
small part of the overhead of disabled probes. Saving and restoring
caller-save registers and setting up parameters certainly increases code
size and I-cache use. On the other hand, a branch that is always or
never taken will generally cost at most 2 cycles.
Hi resolution kernel profiling (which adds a function call and return to
every function without engrotting the source code with explicit calls)
has surprisingly little overhead when not in use. This depends partly
on it not using -finstrument-functions. -finstrument-functions produces
bloated code to pass 2 args (function pointer and frame pointer) in a
portable way.
(Unfortunately, -mprofiler-epilogue was broken (turned into a no-op)
in gcc-4.2.1 and is more broken (unsupported) in clang. It is
regression-tested by configuring LINT with -pp, but the test is
broken for gcc-4.2.1 by accepting -mprofiler-epilogue without actually
supporting it, and for clang by ifdefing the addition of
-mprofiler-epilogue. I normally don't notice the bug since I use
old versions of FreeBSD with a working gcc (3.3.3), and recently
fixed 4.2.1. The fix is fragile, and I just noticed that it doesn't
work for functions returning a value that don't have an explicit
return statement (this is permitted for main()). Old versions of
FreeBSD used -finstrument-functions during a previous round of
compiler breakage.)

-pg and -mprofiler-epilogue produce calls to mcount and .mexitcount
for every function (and most trap handlers and some jumps). These
functions execute "cmpl $ENABLE,enable_flag; jne done; ... done: ret $0"
when the feature is not enable. The branch is easy to predict since
there are only 2 instances of it. Scattered instances in callers might
bust the branch target cache. So might the scattered calls to .mcount
and .mexitcount. Calls are similar to unconditional branches so they
are easy to predict, but I think old x86's (ones not too old to have a
branch target cache) don't handle them well.

High resolution kernel profiling doesn't support SMP, except in my
version where it is too slow to use with SMP (due to enormous lock
contention). Low resolution kernel profiling has bad locking for
the SMP case. The bad locking first gives the enormous lock
contention even when profiling is not enabled, since it is done
before checking the flag. It also gives deadlock in ddb and some
other traps. In the non-SMP case, similar wrapping before checking
the flag makes the low-res case is a smaller pessimization.

A common function checking the flags wouldn't so well if func_ptr or
<probe args> depends on the call site. But if you can figure out
everything from the call site address then it could work similarly
to the breakpoint instruction.

Kernel profiling could use similar nop/breakpoint methods. It is
much easier to patch since there are only 2 functions and 2 args.
It already determines the args from the frame. This is the main
difference between it and -finstrument functions.
Post by Mark Johnston
I've done some microbenchmarks using the lockstat probes on a Xeon
E5-2630 with SMT disabled. They just read the TSC and acquire/release a
lock in a loop, so there's no contention. In general I see at most a small
difference between the old and new SDT implementations and a kernel with
KDTRACE_HOOKS off altogether. For example, in my test a mtx lock/unlock
pair takes 52 cycles on average without probes; with probes, it's 54
cycles with both SDT implementations. rw read locks are 77 cycles
without probes, 79 with. rw write locks and sx exclusive locks don't
appear to show any differences, and sx shared locks show the same
timings without KDTRACE_HOOKS and with the new SDT implementation; the
current implementation adds a cycle per acquire/release pair.
When high resolution profiling is enabled, it takes similar times. It
also uses the TSC (directly and not slowed down by synchronization) and
this use is a major difference between it and low resolution profiling.
Overheads on haswell: UP: 60 cycles for .mcount and 40 cycles for
.mexitcount; SMP: 88 cycles for .mcount and 60 cycles for .mexitcount.
IIRC, rdtsc takes 24 cycles on haswell -- more than half of the UP
.mexitcount time; synchronizing it takes another 12-24 cycles; it was
much slower on older Intel x86 (65 cycles on freefall with its previous
CPU) and much faster on older amd x86 (9 cycles on Athlon XP). This
slowness make sit not very useful to optimize other things in anything
that uses the TSC.

I can believe a mere 2 cycle difference between the highly optimized
version and the simple version. It seems hardly worth the effort to
optimize. Arg pushing and tests and/or calls are fast, especially
when they don't do anything. It might be possible to optimize them
better to reduce dependencies, so that they take zero time if they
can be executed in parallel. The 100-200 extra cyles per function
given by enabled kernel profiling make a difference of about 100%
(twice as slow). With giant locking, this scales with the number
of CPUs in the kernel (800%) (8-9 times slower) with 8 such CPUs.
Post by Mark Johnston
None of this takes into account the cache effects of these probes. One
advantage of the proposed implementation is that we eliminate the data
access required to test if the probe is enabled in the first place. I'm
Accesses to the same variable cost little.
Post by Mark Johnston
also a bit uncertain about the I-cache impact. My understanding is that
a fetch of an instruction will load the entire cache line containing
that instruction. So unless the argument-marshalling instructions for a
probe site spans at least one cache line, won't all they all be loaded
anyway?
This can be moved out of the way. I don't like __predict_*(), but one
thing it can do right is this. The generated code for an inline flags
test should look something like:

cmpl $ENABLE,enable_flag
je slow_path
back:
# main path here
ret
slow_path:
cmpl $NULL,func_ptr
je back
# load func_ptr and args
# ...
jmp back
Post by Mark Johnston
Post by Jilles Tjoelker
Post by Mark Johnston
if (func_ptr != NULL)
func_ptr(<probe args>);
With everything non-inline, arg loading is kept out of the way without
really trying:

call decoder
# main path here
ret

The call can be replaced by nops or a trap instruction + nops (or start
like that). 'decoder' starts with the flags test. Loading args is
difficult if they are local in the caller's frame. Oops, if they were
in registers, then they would have to be saved in the frame for the
call to decoder, and that is almost as slow as passing them. So the
inline flags method may even be better than the no-op method -- it
allows the compiler to do the register saving only in the slow path.
In fact, I think you can do everything with this method and no complex
decoding:

# The following is identical to the above except for this comment.
# To optimise this, replace the next 2 instructions by nops or
# a trap instruction and nops. This gives only a time optimization.
# Not large, but easy to do. When not enabled, space is wasted
# far away where it doesn't mess up the cache.
cmpl $ENABLE,enable_flag
je slow_path
back:
# main path here
ret
slow_path:
cmpl $NULL,func_ptr
je back
# load func_ptr and args
# ...
jmp back

High resolution profiling deals with this partially as follows:
- calling .mcount and .mexitcount tends to require more saving than
necessary. gcc is not smart about placing these calls in the best
position. The main thing done wrong is frame pointer handling.
- but it is arranged that .mexitcount is called without saving the
return register(s). The callee preserves them if necessary (only
necessary if profiling is enabled).
Post by Mark Johnston
https://people.freebsd.org/~markj/__mtx_lock_flags_disas.txt
Based on what I said above and assuming a 64-byte cache line size, I'd
expect all instructions between 0xffffffff806d1328 and 0xffffffff806d134e
to be loaded regardless of whether or not the branch is taken. Is that not
the case?
They are large and ugly already. It probably doesn't matter for them.
Lots of instructions can be loaded (and excecuted speculatively) during
the slow locked instruction. Perhaps during the many earlier instructions.
The instructions at the branch target are more important. I think the
previous branch is usually taken (for no contention) and the branche at
...1328 is unimportant. Both branches are into another cache line and
there is no padding to align the branch targets because such padding
would not be useful for the target arch (or the optimizations are not
complete). I think it is only important that enough instructions in
the branch target are in the same cache line as the target. That usually
happens accidentally when the target is the function epilogue.
Post by Mark Johnston
I'll also add that with this change the size of the kernel text shrinks
a fair bit: from 8425096 bytes to 7983496 bytes with a custom MINIMAL-like
kernel with lock inlining.
Kernel profiling also bloats the kernel a lot. I was happy to get back
to the lesser bloat given by .mcount/.mexitcount instead of
__cyg_profile_name_too_long_to_remember(). But the text space needed is
small compared with the data space which is allocated at runtime (the
allocation is stupid and allocates space even if profiling is never used).
Post by Mark Johnston
Finally, I should have noted in my first post that this work has other
motivations beyond possible performance improvements. In particular,
recording call sites allows us to finally fill in the function component
of SDT probes automatically. For example, with this work it becomes
possible to enable the udp:::receive probe in udp6_receive(), but not
the one in udp_receive().
Patching inline "cmpl $ENABLE,enable flag" also allows this very easily,
unless you want to vary the args for a single call site.

Yet another variation, which is easier to implement since it only requires
modifying data:

call decoder
testl %rax,%rax
je slow_path
# rest as above

'decoder' now just checks a table of enable flags indexed by the call
site and returns true/false. All the args passing remains inline (but
far away). (First check a global enable flag to optimize for the
non-enabled case).
Post by Mark Johnston
Post by Jilles Tjoelker
Avoiding this overhead would require not generating an ABI function call
but a point where the probe parameters can be calculated from the
registers and stack frame (like how a debugger prints local variables,
but with a guarantee that "optimized out" will not happen). This
requires compiler changes, though, and DTrace has generally not used
DWARF-like debug information.
I use gcc -O1 -fno-inline-functions-called-once, and usually i386 and
-march=i386, to prevent bogus optimizations. Maximal compiler
optimizations rarely gain as much as 1% in kernel code, even in
micro-benchmarks. Compilers like to produce large code that gives
negative optimizations except in loops, and kernels don't have many
loops.

The above flags make ddb and stack traces almost correct on i386 for
public functions. Static functions are often called with args in
registers even on i386. To debug these, or to make them available
using "call", they must be declared with a regparm attribute (cdecl
only does enough for public functions where it is the default anyway).
Post by Mark Johnston
Post by Jilles Tjoelker
For a fairer comparison, the five NOPs should be changed to one or two
longer NOPs, since many CPUs decode at most 3 or 4 instructions per
cycle. Some examples of longer NOPs are in
contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any
x86 CPU.
I'll try that, thanks. On amd64 at least, I think we'd have to use two
NOPs: a single-byte NOP that can be overwritten when the probe is
enabled, and then a four-byte NOP.
I think some arches need more than 1 byte for a trap/breakpoint instruction.

I jut remembered what the "ret $0" in .mcount does. It is because
plain ret doesn't work so well at a branch target, even when the target
is aligned. IIRC, "nop; ret" is just as good a fix on the arches where
plain ret is slower, but "ret $0" is better on older in-order arches
where nop is not so special. Extra nops or larger nop instructions
should also be sprinkled for alignment. It is too hard to know the
best number to use in asm code. Compilers generate magic amounts
depending on the arch and how far away the instruction pointer is from
an alignment boundary (so as to not align if too far away). I don't
trust compilers to get this right either, even if the arch matches
exactly. In the too-far-away cases, it is likely that a small change
nearby moves closer and thus gives faster code.

Bruce
Mark Johnston
2015-11-25 00:07:22 UTC
Permalink
Post by Bruce Evans
Post by Mark Johnston
Post by Jilles Tjoelker
Post by Mark Johnston
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
I have not run any benchmarks but I expect that this removes only a
small part of the overhead of disabled probes. Saving and restoring
caller-save registers and setting up parameters certainly increases code
size and I-cache use. On the other hand, a branch that is always or
never taken will generally cost at most 2 cycles.
Hi resolution kernel profiling (which adds a function call and return to
every function without engrotting the source code with explicit calls)
has surprisingly little overhead when not in use. This depends partly
on it not using -finstrument-functions. -finstrument-functions produces
bloated code to pass 2 args (function pointer and frame pointer) in a
portable way.
(Unfortunately, -mprofiler-epilogue was broken (turned into a no-op)
in gcc-4.2.1 and is more broken (unsupported) in clang. It is
regression-tested by configuring LINT with -pp, but the test is
broken for gcc-4.2.1 by accepting -mprofiler-epilogue without actually
supporting it, and for clang by ifdefing the addition of
-mprofiler-epilogue. I normally don't notice the bug since I use
old versions of FreeBSD with a working gcc (3.3.3), and recently
fixed 4.2.1. The fix is fragile, and I just noticed that it doesn't
work for functions returning a value that don't have an explicit
return statement (this is permitted for main()). Old versions of
FreeBSD used -finstrument-functions during a previous round of
compiler breakage.)
-pg and -mprofiler-epilogue produce calls to mcount and .mexitcount
for every function (and most trap handlers and some jumps). These
functions execute "cmpl $ENABLE,enable_flag; jne done; ... done: ret $0"
when the feature is not enable. The branch is easy to predict since
there are only 2 instances of it. Scattered instances in callers might
bust the branch target cache. So might the scattered calls to .mcount
and .mexitcount. Calls are similar to unconditional branches so they
are easy to predict, but I think old x86's (ones not too old to have a
branch target cache) don't handle them well.
High resolution kernel profiling doesn't support SMP, except in my
version where it is too slow to use with SMP (due to enormous lock
contention). Low resolution kernel profiling has bad locking for
the SMP case. The bad locking first gives the enormous lock
contention even when profiling is not enabled, since it is done
before checking the flag. It also gives deadlock in ddb and some
other traps. In the non-SMP case, similar wrapping before checking
the flag makes the low-res case is a smaller pessimization.
A common function checking the flags wouldn't so well if func_ptr or
<probe args> depends on the call site. But if you can figure out
everything from the call site address then it could work similarly
to the breakpoint instruction.
Kernel profiling could use similar nop/breakpoint methods. It is
much easier to patch since there are only 2 functions and 2 args.
It already determines the args from the frame. This is the main
difference between it and -finstrument functions.
Post by Mark Johnston
I've done some microbenchmarks using the lockstat probes on a Xeon
E5-2630 with SMT disabled. They just read the TSC and acquire/release a
lock in a loop, so there's no contention. In general I see at most a small
difference between the old and new SDT implementations and a kernel with
KDTRACE_HOOKS off altogether. For example, in my test a mtx lock/unlock
pair takes 52 cycles on average without probes; with probes, it's 54
cycles with both SDT implementations. rw read locks are 77 cycles
without probes, 79 with. rw write locks and sx exclusive locks don't
appear to show any differences, and sx shared locks show the same
timings without KDTRACE_HOOKS and with the new SDT implementation; the
current implementation adds a cycle per acquire/release pair.
When high resolution profiling is enabled, it takes similar times. It
also uses the TSC (directly and not slowed down by synchronization) and
this use is a major difference between it and low resolution profiling.
Overheads on haswell: UP: 60 cycles for .mcount and 40 cycles for
.mexitcount; SMP: 88 cycles for .mcount and 60 cycles for .mexitcount.
IIRC, rdtsc takes 24 cycles on haswell -- more than half of the UP
.mexitcount time; synchronizing it takes another 12-24 cycles; it was
much slower on older Intel x86 (65 cycles on freefall with its previous
CPU) and much faster on older amd x86 (9 cycles on Athlon XP). This
slowness make sit not very useful to optimize other things in anything
that uses the TSC.
I can believe a mere 2 cycle difference between the highly optimized
version and the simple version. It seems hardly worth the effort to
optimize. Arg pushing and tests and/or calls are fast, especially
when they don't do anything. It might be possible to optimize them
better to reduce dependencies, so that they take zero time if they
can be executed in parallel. The 100-200 extra cyles per function
given by enabled kernel profiling make a difference of about 100%
(twice as slow). With giant locking, this scales with the number
of CPUs in the kernel (800%) (8-9 times slower) with 8 such CPUs.
Post by Mark Johnston
None of this takes into account the cache effects of these probes. One
advantage of the proposed implementation is that we eliminate the data
access required to test if the probe is enabled in the first place. I'm
Accesses to the same variable cost little.
Note that the current SDT implementation uses a different variable for
each probe, so the resulting cache pollution gets worse as probes are
added to various paths in the kernel.

The current and proposed implementations allow one to have KLDs create
and register probes automatically when they're loaded, which can't
really be accomplished with a single variable and static set of flags.
Post by Bruce Evans
[...]
In fact, I think you can do everything with this method and no complex
# The following is identical to the above except for this comment.
# To optimise this, replace the next 2 instructions by nops or
# a trap instruction and nops. This gives only a time optimization.
# Not large, but easy to do. When not enabled, space is wasted
# far away where it doesn't mess up the cache.
cmpl $ENABLE,enable_flag
je slow_path
# main path here
ret
cmpl $NULL,func_ptr
je back
# load func_ptr and args
# ...
jmp back
- calling .mcount and .mexitcount tends to require more saving than
necessary. gcc is not smart about placing these calls in the best
position. The main thing done wrong is frame pointer handling.
- but it is arranged that .mexitcount is called without saving the
return register(s). The callee preserves them if necessary (only
necessary if profiling is enabled).
Post by Mark Johnston
https://people.freebsd.org/~markj/__mtx_lock_flags_disas.txt
Based on what I said above and assuming a 64-byte cache line size, I'd
expect all instructions between 0xffffffff806d1328 and 0xffffffff806d134e
to be loaded regardless of whether or not the branch is taken. Is that not
the case?
They are large and ugly already. It probably doesn't matter for them.
Lots of instructions can be loaded (and excecuted speculatively) during
the slow locked instruction. Perhaps during the many earlier instructions.
The instructions at the branch target are more important. I think the
previous branch is usually taken (for no contention) and the branche at
...1328 is unimportant. Both branches are into another cache line and
there is no padding to align the branch targets because such padding
would not be useful for the target arch (or the optimizations are not
complete). I think it is only important that enough instructions in
the branch target are in the same cache line as the target. That usually
happens accidentally when the target is the function epilogue.
Post by Mark Johnston
I'll also add that with this change the size of the kernel text shrinks
a fair bit: from 8425096 bytes to 7983496 bytes with a custom MINIMAL-like
kernel with lock inlining.
Kernel profiling also bloats the kernel a lot. I was happy to get back
to the lesser bloat given by .mcount/.mexitcount instead of
__cyg_profile_name_too_long_to_remember(). But the text space needed is
small compared with the data space which is allocated at runtime (the
allocation is stupid and allocates space even if profiling is never used).
My approach also has this problem. The call sites are stored in a linker
set and end up taking up 16 bytes per probe site. There is also some
runtime-allocated memory for SDT probes, but this is small relative to
everything else. Overall, the kernel's size doesn't end up changing
significantly.
Post by Bruce Evans
Post by Mark Johnston
Finally, I should have noted in my first post that this work has other
motivations beyond possible performance improvements. In particular,
recording call sites allows us to finally fill in the function component
of SDT probes automatically. For example, with this work it becomes
possible to enable the udp:::receive probe in udp6_receive(), but not
the one in udp_receive().
Patching inline "cmpl $ENABLE,enable flag" also allows this very easily,
unless you want to vary the args for a single call site.
It's easy if you know the address of the test, but how can one figure
that out programatically? The reason that my implementation is able to
distinguish between the probes in udp_receive() and udp6_receive() is
because it uses relocations against the probe stubs to discover the
probe sites. This would seem to suggest an approach where a probe macro
expands to

if (__predict_false(sdt_enabled))
<per-probe dtrace stub>(<probe args>);

That is, use a single flag variable that is set when _any_ SDT probe is
enabled to determine whether to enter the slow path, which is hopefully
kept out of the I-cache in most cases when the probes are not enabled.
The downside is that enabling any probe will pessimize the code for all
of them, but not in a way that's worse than what we have today.
Post by Bruce Evans
Yet another variation, which is easier to implement since it only requires
call decoder
testl %rax,%rax
je slow_path
# rest as above
'decoder' now just checks a table of enable flags indexed by the call
site and returns true/false. All the args passing remains inline (but
far away). (First check a global enable flag to optimize for the
non-enabled case).
I think this has a similar cache pollution problem. And it seems
difficult to handle the case where a KLD load/unload may add or remove
entries from the table, unless I'm misunderstanding your proposal.
Post by Bruce Evans
Post by Mark Johnston
Post by Jilles Tjoelker
Avoiding this overhead would require not generating an ABI function call
but a point where the probe parameters can be calculated from the
registers and stack frame (like how a debugger prints local variables,
but with a guarantee that "optimized out" will not happen). This
requires compiler changes, though, and DTrace has generally not used
DWARF-like debug information.
I use gcc -O1 -fno-inline-functions-called-once, and usually i386 and
-march=i386, to prevent bogus optimizations. Maximal compiler
optimizations rarely gain as much as 1% in kernel code, even in
micro-benchmarks. Compilers like to produce large code that gives
negative optimizations except in loops, and kernels don't have many
loops.
The above flags make ddb and stack traces almost correct on i386 for
public functions. Static functions are often called with args in
registers even on i386. To debug these, or to make them available
using "call", they must be declared with a regparm attribute (cdecl
only does enough for public functions where it is the default anyway).
Post by Mark Johnston
Post by Jilles Tjoelker
For a fairer comparison, the five NOPs should be changed to one or two
longer NOPs, since many CPUs decode at most 3 or 4 instructions per
cycle. Some examples of longer NOPs are in
contrib/llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
X86AsmBackend::writeNopData(). The two-byte NOP 0x66, 0x90 works on any
x86 CPU.
I'll try that, thanks. On amd64 at least, I think we'd have to use two
NOPs: a single-byte NOP that can be overwritten when the probe is
enabled, and then a four-byte NOP.
I think some arches need more than 1 byte for a trap/breakpoint instruction.
I jut remembered what the "ret $0" in .mcount does. It is because
plain ret doesn't work so well at a branch target, even when the target
is aligned. IIRC, "nop; ret" is just as good a fix on the arches where
plain ret is slower, but "ret $0" is better on older in-order arches
where nop is not so special. Extra nops or larger nop instructions
should also be sprinkled for alignment. It is too hard to know the
best number to use in asm code. Compilers generate magic amounts
depending on the arch and how far away the instruction pointer is from
an alignment boundary (so as to not align if too far away). I don't
trust compilers to get this right either, even if the arch matches
exactly. In the too-far-away cases, it is likely that a small change
nearby moves closer and thus gives faster code.
Bruce
Ryan Stone
2015-11-23 01:10:36 UTC
Permalink
Post by Jilles Tjoelker
I have not run any benchmarks but I expect that this removes only a
small part of the overhead of disabled probes. Saving and restoring
caller-save registers and setting up parameters certainly increases code
size and I-cache use. On the other hand, a branch that is always or
never taken will generally cost at most 2 cycles.
The original Solaris implementation side-stepped this by trying to place
SDT probes next to existing function calls to minimize this overhead. I
don't think that we in FreeBSD has been nearly as careful about this. It
would be a good project for somebody to go through the existing SDT probes
and see if they could be relocated slightly to produce the same semantics
but less overhead.

Avoiding this overhead would require not generating an ABI function call
Post by Jilles Tjoelker
but a point where the probe parameters can be calculated from the
registers and stack frame (like how a debugger prints local variables,
but with a guarantee that "optimized out" will not happen). This
requires compiler changes, though, and DTrace has generally not used
DWARF-like debug information.
Compiler support would be nice but is obviously a lot more complicated.
I've long thought that a DTrace probe that expanded to something like the
following would be ideal:

jmp skip_dtrace
# load arguments
int 3
skip_dtrace:


But in order to implement something like that, you'd need support from both
the compiler and the linker.
Konstantin Belousov
2015-11-23 11:35:11 UTC
Permalink
Post by Mark Johnston
Hi,
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
The main difficulty is in figuring out where the probe sites actually
are once the kernel is running. In my patch, a probe site is a call to
an externally-defined function which is defined in an
automatically-generated C file. At link time, we first perform a partial
link of all the kernel's object files. Then, a script uses the relocations
against the still-undefined probe functions to generate
1) stub functions for the probes, so that the kernel can actually be
linked, and
2) a linker set containing the offsets of each probe site relative to
the beginning of the text section.
The result is linked with the partially-linked kernel to generate the
final kernel file.
During boot, we iterate over the linker set, using the offsets plus the
address of btext to overwrite probe sites with NOPs. SDT probes in kernel
modules are handled differently (and more simply): the kernel linker just
has special handling for relocations against symbols named __dtrace_sdt_*;
this is how illumos/Solaris implements all of this.
My uncertainty revolves around the use of relocations in the
partially-linked kernel to determine the address of probe sites in the
running kernel. With the GNU ld in base, this happens to work because
the final link doesn't modify the text section. Is this something I can
rely upon? Will this assumption be false with the advent of lld and LTO?
Are there other, cleaner ways to implement what I described above?
You could consider using a cheap instruction which is conditionally
converted into the trap, instead. E.g., you could have global page frame
in KVA allocated, and for the normal operations, keep the page mapped
with backing by a scratch page. The probe would be a volatile read from
the page.

When probes are activated, the page is unmapped, which converts the read
into the page fault. This is similar to the write barriers implemented
in some garbare collectors.

There are two issues with this scheme:
- The cost of probe is relatively large, even if the low level trap
handler is further modified to recognize the probes by special
address access.
- The arguments passed to the probes should be put into some predefined
place, e.g. somwhere in the *curthread, since trap handler cannot fetch
them using the ABI conventions.

As I mentioned above, this scheme is used by several implementations of
the language runtimes, but there gc pauses are rare, and slightly larger
cost of the even stopping the mutator is justified even by negligible
cost reduction for normal flow. I am not sure if this approach worths
the complications and overhead for probes.
Mark Johnston
2015-11-25 00:11:36 UTC
Permalink
Post by Konstantin Belousov
Post by Mark Johnston
Hi,
For the past while I've been experimenting with various ways to
implement "zero-cost" SDT DTrace probes. Basically, at the moment an SDT
if (func_ptr != NULL)
func_ptr(<probe args>);
When the probe is enabled, func_ptr is set to dtrace_probe(); otherwise
it's NULL. With zero-cost probes, the SDT_PROBE macros expand to
func(<probe args>);
When the kernel is running, each probe site has been overwritten with
NOPs. When a probe is enabled, one of the NOPs is overwritten with a
breakpoint, and the handler uses the PC to figure out which probe fired.
This approach has the benefit of incurring less overhead when the probe
is not enabled; it's more complicated to implement though, which is why
this hasn't already been done.
I have a working implementation of this for amd64 and i386[1]. Before
adding support for the other arches, I'd like to get some idea as to
whether the approach described below is sound and acceptable.
The main difficulty is in figuring out where the probe sites actually
are once the kernel is running. In my patch, a probe site is a call to
an externally-defined function which is defined in an
automatically-generated C file. At link time, we first perform a partial
link of all the kernel's object files. Then, a script uses the relocations
against the still-undefined probe functions to generate
1) stub functions for the probes, so that the kernel can actually be
linked, and
2) a linker set containing the offsets of each probe site relative to
the beginning of the text section.
The result is linked with the partially-linked kernel to generate the
final kernel file.
During boot, we iterate over the linker set, using the offsets plus the
address of btext to overwrite probe sites with NOPs. SDT probes in kernel
modules are handled differently (and more simply): the kernel linker just
has special handling for relocations against symbols named __dtrace_sdt_*;
this is how illumos/Solaris implements all of this.
My uncertainty revolves around the use of relocations in the
partially-linked kernel to determine the address of probe sites in the
running kernel. With the GNU ld in base, this happens to work because
the final link doesn't modify the text section. Is this something I can
rely upon? Will this assumption be false with the advent of lld and LTO?
Are there other, cleaner ways to implement what I described above?
You could consider using a cheap instruction which is conditionally
converted into the trap, instead. E.g., you could have global page frame
in KVA allocated, and for the normal operations, keep the page mapped
with backing by a scratch page. The probe would be a volatile read from
the page.
When probes are activated, the page is unmapped, which converts the read
into the page fault. This is similar to the write barriers implemented
in some garbare collectors.
- The cost of probe is relatively large, even if the low level trap
handler is further modified to recognize the probes by special
address access.
- The arguments passed to the probes should be put into some predefined
place, e.g. somwhere in the *curthread, since trap handler cannot fetch
them using the ABI conventions.
As I mentioned above, this scheme is used by several implementations of
the language runtimes, but there gc pauses are rare, and slightly larger
cost of the even stopping the mutator is justified even by negligible
cost reduction for normal flow. I am not sure if this approach worths
the complications and overhead for probes.
If I understood correctly, each probe site would require a separate page
in KVA to be able to enable and disable individual probes in the manner
that I described in a previous reply. Today, a kernel with lock inlining
has thousands of probe sites; wouldn't the requirement of allocating KVA
for each of them be prohibitive on 32-bit architectures?
Konstantin Belousov
2015-11-25 13:15:33 UTC
Permalink
Post by Mark Johnston
If I understood correctly, each probe site would require a separate page
in KVA to be able to enable and disable individual probes in the manner
that I described in a previous reply. Today, a kernel with lock inlining
has thousands of probe sites; wouldn't the requirement of allocating KVA
for each of them be prohibitive on 32-bit architectures?
Several variations of the approach allow to control each probe site
individually, while still avoiding jumps and reducing the cache consumption.
And, of course, the biggest advantage is avoiding the need to change the
text at runtime.

E.g., you could have a byte allocated somewhere for each probe, with usual
boolean values true/false for enabled/disabled state. Also, somewhere,
you have two KVA pages allocated, say, starting at address p, the first
page is mapped, the second page is not. The pages are shared between all
probes. Then, the following code sequence would trigger the page fault
only for enabled probe:
movzbl this_probe_enable_byte, %eax
movl (p + PAGE_SIZE - 4)(%eax), %eax
This approach is quite portable and can be expressed in C.

If expected count of probes is thousands, as you mentioned, then you
would pay only for several KB of memory for enable control bytes.

Another variant is possible with the use of INTO instruction, which
has relatively low latency when not trapping, according to the Agner
Fog tables.
Mark Johnston
2015-11-25 23:25:24 UTC
Permalink
Post by Konstantin Belousov
Post by Mark Johnston
If I understood correctly, each probe site would require a separate page
in KVA to be able to enable and disable individual probes in the manner
that I described in a previous reply. Today, a kernel with lock inlining
has thousands of probe sites; wouldn't the requirement of allocating KVA
for each of them be prohibitive on 32-bit architectures?
Several variations of the approach allow to control each probe site
individually, while still avoiding jumps and reducing the cache consumption.
And, of course, the biggest advantage is avoiding the need to change the
text at runtime.
E.g., you could have a byte allocated somewhere for each probe, with usual
boolean values true/false for enabled/disabled state. Also, somewhere,
you have two KVA pages allocated, say, starting at address p, the first
page is mapped, the second page is not. The pages are shared between all
probes. Then, the following code sequence would trigger the page fault
movzbl this_probe_enable_byte, %eax
movl (p + PAGE_SIZE - 4)(%eax), %eax
This approach is quite portable and can be expressed in C.
If expected count of probes is thousands, as you mentioned, then you
would pay only for several KB of memory for enable control bytes.
Another variant is possible with the use of INTO instruction, which
has relatively low latency when not trapping, according to the Agner
Fog tables.
I see. I think this could be made to work, but there's still the
complication of passing arguments to the probe. Copying them into some
block in curthread is one way to do this, but it seems more expensive
than the standard calling convention on amd64 at least.
Andriy Gapon
2015-11-26 07:30:51 UTC
Permalink
Post by Mark Johnston
Post by Konstantin Belousov
Several variations of the approach allow to control each probe site
individually, while still avoiding jumps and reducing the cache consumption.
And, of course, the biggest advantage is avoiding the need to change the
text at runtime.
[snip]
Post by Mark Johnston
I see. I think this could be made to work, but there's still the
complication of passing arguments to the probe. Copying them into some
block in curthread is one way to do this, but it seems more expensive
than the standard calling convention on amd64 at least.
Besides, the FBT probes are not going anywhere and they require the run-time
text modification.
--
Andriy Gapon
Loading...