__read_only in the kernel

Post by Mateusz Guzik
One of the standard problems in mpsafe kernels is false sharing.
The somewhat standard way of combating parts of it for frequently read
and rarely (if ever) modified variables is an annotation which puts
them in a dedicated part of the binary and the somewhat standard name
for a macro doing the work is __read_mostly.
The FreeBSD kernel still does not have it and I think it's long overdue.
Now, I don't know how to do it nicely in the linker script, in
particular I don't know how get the cache line size.
For testing purposes I hacked up a crap version below and verified it
works fine.
I hacked up the following crap version. Would be nice if someone with
ld-clue fixed that up. I don't care about the header either.
I just want the macro. :)
diff --git a/sys/amd64/include/param.h b/sys/amd64/include/param.h
index a619e395..ab66e79 100644
--- a/sys/amd64/include/param.h
+++ b/sys/amd64/include/param.h
@@ -152,4 +152,6 @@
#define INKERNEL(va) (((va) >= DMAP_MIN_ADDRESS && (va) < DMAP_MAX_ADDRESS) \
|| ((va) >= VM_MIN_KERNEL_ADDRESS && (va) < VM_MAX_KERNEL_ADDRESS))
+#define __read_mostly __attribute__((__section__(".data.read_mostly")))
+
#endif /* !_AMD64_INCLUDE_PARAM_H_ */
diff --git a/sys/conf/ldscript.amd64 b/sys/conf/ldscript.amd64
index 5d86b03..ae98447 100644
--- a/sys/conf/ldscript.amd64
+++ b/sys/conf/ldscript.amd64
@@ -151,6 +151,11 @@ SECTIONS
KEEP (*(.gnu.linkonce.d.*personality*))
}
.data1 : { *(.data1) }
+ {
+ *(.data.read_mostly)
+ . = ALIGN(64);
+ }
_edata = .; PROVIDE (edata = .);
__bss_start = .;
diff --git a/sys/sys/param.h b/sys/sys/param.h
index cf38985..dcd9526 100644
--- a/sys/sys/param.h
+++ b/sys/sys/param.h
@@ -360,4 +360,8 @@ __END_DECLS
*/
#define __PAST_END(array, offset) (((__typeof__(*(array)) *)(array))[offset])
+#ifndef __read_mostly
+#define __read_mostly
+#endif
+
#endif /* _SYS_PARAM_H_ */

Hi,

You might needs to patch the linuxkpi /sys/compat/linuxkpi/ too, which
also defines this macro.

--HPS

Mateusz Guzik

2016-11-28 07:23:54 UTC

Post by Hans Petter Selasky

Post by Mateusz Guzik
+#define __read_mostly __attribute__((__section__(".data.read_mostly")))

You might needs to patch the linuxkpi /sys/compat/linuxkpi/ too,
which also defines this macro.

Oh, indeed:
sys/dev/drm2/drm_os_freebsd.h:#define __read_mostly

Thanks,

--
Mateusz Guzik <mjguzik gmail.com>

Ed Schouten

2016-11-28 08:30:47 UTC

Hi Mateusz,

Post by Mateusz Guzik
The somewhat standard way of combating parts of it for frequently read
and rarely (if ever) modified variables is an annotation which puts
them in a dedicated part of the binary and the somewhat standard name
for a macro doing the work is __read_mostly.

Out of curiosity, what is the advantage of doing this?

--
Ed Schouten <***@nuxi.nl>
Nuxi, 's-Hertogenbosch, the Netherlands
KvK-nr.: 62051717

Anton Yuzhaninov

2016-11-28 15:16:22 UTC

Post by Ed Schouten

Out of curiosity, what is the advantage of doing this?

If a variable which read often shares CPU cache line with a variable
updated often we have performance degradation. Each time second variable
is updated first (read mostly) unnecessary flushed from CPU cache.

Some data structures already aligned by cache line size to avoid false
sharing, but moving read only / read mostly variables to separate
section allows to avoid false sharing without spending memory on
alignment (and alignment not always helps to avoid false sharing AFAIK).

Adrian Chadd

2016-11-28 16:27:10 UTC

.. and it makes it easy to see (a) whether a variable has it, and (b)
whether a variable doesn't - just use grep.

-a

Post by Ed Schouten

Out of curiosity, what is the advantage of doing this?

If a variable which read often shares CPU cache line with a variable updated
often we have performance degradation. Each time second variable is updated
first (read mostly) unnecessary flushed from CPU cache.
Some data structures already aligned by cache line size to avoid false
sharing, but moving read only / read mostly variables to separate section
allows to avoid false sharing without spending memory on alignment (and
alignment not always helps to avoid false sharing AFAIK).
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-arch

Mateusz Guzik

2016-11-30 01:10:34 UTC

I officially threaten to commit this by Friday.

__read_mostly-ification of certain vars will follow.

stable/11 and stable/10 will get the #define __read_mostly bit for
convenience, but the feature itself will likely get mfced to 11.

--
Mateusz Guzik <mjguzik gmail.com>

Bruce Evans

2016-11-30 21:05:54 UTC

Post by Mateusz Guzik
I officially threaten to commit this by Friday.

It has too high a density of style bugs to commit.

Post by Mateusz Guzik
diff --git a/sys/amd64/include/param.h b/sys/amd64/include/param.h
index a619e395..ab66e79 100644
--- a/sys/amd64/include/param.h
+++ b/sys/amd64/include/param.h
@@ -152,4 +152,6 @@
#define INKERNEL(va) (((va) >= DMAP_MIN_ADDRESS && (va) < DMAP_MAX_ADDRESS) \
|| ((va) >= VM_MIN_KERNEL_ADDRESS && (va) < VM_MAX_KERNEL_ADDRESS))
+#define __read_mostly __attribute__((__section__(".data.read_mostly")))
+
#endif /* !_AMD64_INCLUDE_PARAM_H_ */

1. Hard-coded gccism: __attribute__().
1a. Non-use of the FreeBSD macro __section() whose purpose is to make it
easy to avoid using __attribute__() for precisely what it is used for
here.
2. Misplaced definition. Such definitions go in <sys/cdefs.h>. This one
has no dependencies on amd64 except possibly for bugs elsewhere, but
is only in an amd64 header.

__section() has no dependencies on anything, but its contents may
be MD. Here the contents are not very MD. <sys/cdefs.h> already
refers to the magic sections ".gnu.warning." # sym without even
using its own macro (it uses hard-coded __asm__() and also doesn't
even use the preferred FreeBSD spelling __asm()). It isn't clear
what prevents these hard-coded gccisms from being syntax errors in
the lint case where __section() is defined as empty to avoid syntax
errors.

According to userland tests, section statements like the above and the
ones in <sys/cdefs.h> don't need any linker support to work, since they
create sections as necessary.

So the above definition in <sys/cdefs.h> should almost perfectly for
all arches, even without linker support. Style bug (1) is smaller if
it is there.

Post by Mateusz Guzik
diff --git a/sys/conf/ldscript.amd64 b/sys/conf/ldscript.amd64
index 5d86b03..ae98447 100644
--- a/sys/conf/ldscript.amd64
+++ b/sys/conf/ldscript.amd64
@@ -151,6 +151,11 @@ SECTIONS
KEEP (*(.gnu.linkonce.d.*personality*))
}
.data1 : { *(.data1) }
+ {
+ *(.data.read_mostly)
+ . = ALIGN(64);
+ }
_edata = .; PROVIDE (edata = .);
__bss_start = .;

For arches without this linker support, the variables would be grouped
but not aligned so much.

Aligning the subsection seems to be useless anyway. This only aligns
the first variable in the subsection. Most variables are still only
aligned according to their natural or specified alignment. This is
rarely as large as 64. But I think variables in the subsection can
be more aligned than the subsection. If they had to be (as in a.out),
then it is the responsibility of the linker to align the subsection
to more than the default if a single variable in the subsection needs
more than the default.

Post by Mateusz Guzik
diff --git a/sys/sys/param.h b/sys/sys/param.h
index cf38985..dcd9526 100644
--- a/sys/sys/param.h
+++ b/sys/sys/param.h
@@ -360,4 +360,8 @@ __END_DECLS
*/
#define __PAST_END(array, offset) (((__typeof__(*(array)) *)(array))[offset])
+#ifndef __read_mostly
+#define __read_mostly
+#endif
+
#endif /* _SYS_PARAM_H_ */

Since the definition was misplaced in only the amd64 param.h, you needed
this hack to define it for other arches. The general definition would
be misplaced here too, like most definitions in this file. __read_mostly()
is not a system parameter. Fortunately, it also doesn't use any system
parameters, so it can go in <sys/cdefs.h> or better <sys/systm.h> if
kernel-only. It is close to needing a parameter for alignment.

Bruce

Mateusz Guzik

2016-12-29 11:45:55 UTC

[..]

Post by Bruce Evans
According to userland tests, section statements like the above and the
ones in <sys/cdefs.h> don't need any linker support to work, since they
create sections as necessary.
So the above definition in <sys/cdefs.h> should almost perfectly for
all arches, even without linker support. Style bug (1) is smaller if
it is there.

I wanted to avoid providing the definition for archs which don't have
the linker script bit and this was the only header I found which is md
and effectively always included.

Indeed it seems the section is harmless even without the explicit
support.

For arches without this linker support, the variables would be grouped
but not aligned so much.
Aligning the subsection seems to be useless anyway. This only aligns
the first variable in the subsection. Most variables are still only
aligned according to their natural or specified alignment. This is
rarely as large as 64. But I think variables in the subsection can
be more aligned than the subsection. If they had to be (as in a.out),
then it is the responsibility of the linker to align the subsection
to more than the default if a single variable in the subsection needs
more than the default.

With the indended use grouping side-by-side is beneficial - as the vars
in question are supposed to be rarely modified, there is no problem with
them sharing a cache line. Making them all use dedicated lines would
only waste memory.

That said, what about the patch below. I also grepped the tree and found
2 surprises, handled in the patch.

diff --git a/sys/compat/linuxkpi/common/include/linux/compiler.h b/sys/compat/linuxkpi/common/include/linux/compiler.h
index c780abc..c32f1fa 100644
--- a/sys/compat/linuxkpi/common/include/linux/compiler.h
+++ b/sys/compat/linuxkpi/common/include/linux/compiler.h
@@ -67,7 +67,6 @@
#define typeof(x) __typeof(x)

#define uninitialized_var(x) x = x
-#define __read_mostly __attribute__((__section__(".data.read_mostly")))
#define __always_unused __unused
#define __must_check __result_use_check

diff --git a/sys/conf/ldscript.amd64 b/sys/conf/ldscript.amd64
index 5d86b03..d87d607 100644
--- a/sys/conf/ldscript.amd64
+++ b/sys/conf/ldscript.amd64
@@ -151,6 +151,11 @@ SECTIONS
KEEP (*(.gnu.linkonce.d.*personality*))
}
.data1 : { *(.data1) }
+ .data.read_mostly :
+ {
+ *(.data.read_mostly)
+ }
+ . = ALIGN(64);
_edata = .; PROVIDE (edata = .);
__bss_start = .;
.bss :
diff --git a/sys/dev/drm2/drm_os_freebsd.h b/sys/dev/drm2/drm_os_freebsd.h
index dc01c6a..11c9feb 100644
--- a/sys/dev/drm2/drm_os_freebsd.h
+++ b/sys/dev/drm2/drm_os_freebsd.h
@@ -80,7 +80,6 @@ typedef void irqreturn_t;

#define __init
#define __exit
-#define __read_mostly

#define BUILD_BUG_ON(x) CTASSERT(!(x))
#define BUILD_BUG_ON_NOT_POWER_OF_2(x)
diff --git a/sys/sys/systm.h b/sys/sys/systm.h
index a1ce9b4..5f646ff 100644
--- a/sys/sys/systm.h
+++ b/sys/sys/systm.h
@@ -445,4 +445,6 @@ extern void (*softdep_ast_cleanup)(void);

void counted_warning(unsigned *counter, const char *msg);

+#define __read_mostly __section(".data.read_mostly")
+
#endif /* !_SYS_SYSTM_H_ */

--
Mateusz Guzik <mjguzik gmail.com>

Mateusz Guzik

2016-12-29 15:31:38 UTC

[..]

I wanted to avoid providing the definition for archs which don't have
the linker script bit and this was the only header I found which is md
and effectively always included.
Indeed it seems the section is harmless even without the explicit
support.

For arches without this linker support, the variables would be grouped
but not aligned so much.
Aligning the subsection seems to be useless anyway. This only aligns
the first variable in the subsection. Most variables are still only
aligned according to their natural or specified alignment. This is
rarely as large as 64. But I think variables in the subsection can
be more aligned than the subsection. If they had to be (as in a.out),
then it is the responsibility of the linker to align the subsection
to more than the default if a single variable in the subsection needs
more than the default.

Scratch the previous patch. I extended it with __exclusive_cache_line
(happy with ideas for a better name) - to be used for something which
has to be alone in the cacheline, e.g. an atomic counter.

diff --git a/sys/compat/linuxkpi/common/include/linux/compiler.h b/sys/compat/linuxkpi/common/include/linux/compiler.h
index c780abc..c32f1fa 100644
--- a/sys/compat/linuxkpi/common/include/linux/compiler.h
+++ b/sys/compat/linuxkpi/common/include/linux/compiler.h
@@ -67,7 +67,6 @@
#define typeof(x) __typeof(x)

#define uninitialized_var(x) x = x
-#define __read_mostly __attribute__((__section__(".data.read_mostly")))
#define __always_unused __unused
#define __must_check __result_use_check

diff --git a/sys/conf/ldscript.amd64 b/sys/conf/ldscript.amd64
index 5d86b03..45685a4 100644
--- a/sys/conf/ldscript.amd64
+++ b/sys/conf/ldscript.amd64
@@ -150,6 +150,17 @@ SECTIONS
*(.data .data.* .gnu.linkonce.d.*)
KEEP (*(.gnu.linkonce.d.*personality*))
}
+ . = ALIGN(64);
+ .data.read_mostly :
+ {
+ *(.data.read_mostly)
+ }
+ . = ALIGN(64);
+ .data.exclusive_cache_line :
+ {
+ *(.data.exclusive_cache_line)
+ }
+ . = ALIGN(64);
.data1 : { *(.data1) }
_edata = .; PROVIDE (edata = .);
__bss_start = .;
diff --git a/sys/dev/drm2/drm_os_freebsd.h b/sys/dev/drm2/drm_os_freebsd.h
index dc01c6a..11c9feb 100644
--- a/sys/dev/drm2/drm_os_freebsd.h
+++ b/sys/dev/drm2/drm_os_freebsd.h
@@ -80,7 +80,6 @@ typedef void irqreturn_t;

#define __init
#define __exit
-#define __read_mostly

#define BUILD_BUG_ON(x) CTASSERT(!(x))
#define BUILD_BUG_ON_NOT_POWER_OF_2(x)
diff --git a/sys/sys/systm.h b/sys/sys/systm.h
index a1ce9b4..719e063 100644
--- a/sys/sys/systm.h
+++ b/sys/sys/systm.h
@@ -445,4 +445,8 @@ extern void (*softdep_ast_cleanup)(void);

void counted_warning(unsigned *counter, const char *msg);

+#define __read_mostly __section(".data.read_mostly")
+#define __exclusive_cache_line __aligned(CACHE_LINE_SIZE) \
+ __section(".data.exclusive_cache_line")
+
#endif /* !_SYS_SYSTM_H_ */

--
Mateusz Guzik <mjguzik gmail.com>

Bruce Evans

2017-01-13 02:10:33 UTC

[Re-sent to reach freebsd-arch.]

[This reply is late, so I quoted everything.]

[..]

I wanted to avoid providing the definition for archs which don't have
the linker script bit and this was the only header I found which is md
and effectively always included.
Indeed it seems the section is harmless even without the explicit
support.

For arches without this linker support, the variables would be grouped
but not aligned so much.
Aligning the subsection seems to be useless anyway. This only aligns
the first variable in the subsection. Most variables are still only
aligned according to their natural or specified alignment. This is
rarely as large as 64. But I think variables in the subsection can
be more aligned than the subsection. If they had to be (as in a.out),
then it is the responsibility of the linker to align the subsection
to more than the default if a single variable in the subsection needs
more than the default.

I don't really understand this. Is it still needed? I forget if you
need separate sections to keep things together for more than the
purpose of aligning them. I think objects aligned to the cache line
size or larger can live almost anywhere in the address space without
making much difference to cache effectiveness on most arches including
amd64.

Post by Mateusz Guzik
diff --git a/sys/dev/drm2/drm_os_freebsd.h b/sys/dev/drm2/drm_os_freebsd.h
index dc01c6a..11c9feb 100644
--- a/sys/dev/drm2/drm_os_freebsd.h
+++ b/sys/dev/drm2/drm_os_freebsd.h
@@ -80,7 +80,6 @@ typedef void irqreturn_t;
#define __init
#define __exit
-#define __read_mostly
#define BUILD_BUG_ON(x) CTASSERT(!(x))
#define BUILD_BUG_ON_NOT_POWER_OF_2(x)
diff --git a/sys/sys/systm.h b/sys/sys/systm.h
index a1ce9b4..719e063 100644
--- a/sys/sys/systm.h
+++ b/sys/sys/systm.h
@@ -445,4 +445,8 @@ extern void (*softdep_ast_cleanup)(void);
void counted_warning(unsigned *counter, const char *msg);
+#define __read_mostly __section(".data.read_mostly")
+#define __exclusive_cache_line __aligned(CACHE_LINE_SIZE) \
+ __section(".data.exclusive_cache_line")
+
#endif /* !_SYS_SYSTM_H_ */

systm.h is better for the kernel, but will we want this for userland?

systm.h is randomly ordered, and appending to it with the new definitions
inversely ordered improves its randomness. Similarly for style. Indentation
for multi-line #define's is more random now.

Bruce

Mateusz Guzik

2017-01-16 12:46:53 UTC

Post by Mateusz Guzik
Scratch the previous patch. I extended it with __exclusive_cache_line
(happy with ideas for a better name) - to be used for something which
has to be alone in the cacheline, e.g. an atomic counter.
diff --git a/sys/compat/linuxkpi/common/include/linux/compiler.h b/sys/compat/linuxkpi/common/include/linux/compiler.h
index c780abc..c32f1fa 100644
--- a/sys/compat/linuxkpi/common/include/linux/compiler.h
+++ b/sys/compat/linuxkpi/common/include/linux/compiler.h
@@ -67,7 +67,6 @@
#define typeof(x) __typeof(x)
#define uninitialized_var(x) x = x
-#define __read_mostly __attribute__((__section__(".data.read_mostly")))
#define __always_unused __unused
#define __must_check __result_use_check
diff --git a/sys/conf/ldscript.amd64 b/sys/conf/ldscript.amd64
index 5d86b03..45685a4 100644
--- a/sys/conf/ldscript.amd64
+++ b/sys/conf/ldscript.amd64
@@ -150,6 +150,17 @@ SECTIONS
*(.data .data.* .gnu.linkonce.d.*)
KEEP (*(.gnu.linkonce.d.*personality*))
}
+ . = ALIGN(64);
+ {
+ *(.data.read_mostly)
+ }
+ . = ALIGN(64);
+ {
+ *(.data.exclusive_cache_line)
+ }
+ . = ALIGN(64);
.data1 : { *(.data1) }
_edata = .; PROVIDE (edata = .);
__bss_start = .;

Without the above, while variables are aligned, they can end up with
unaligned ones added just after them in the same cacheline.