watchdog end-user interface

Discussion:

Andriy Gapon

2016-10-19 11:09:42 UTC

I know that there are people thinking about improving our watchdog
infrastructure. Maybe it's time to discuss some ideas in public.

I would like to start with discussing the end-user, or rather administrative,
interface to the watchdog system.

watchdogd always had -t timeout option.
Not a too long time ago it has also grown a handful of new options:
--softtimeout
--softtimeout-action action
--pretimeout timeout
--pretimeout-action action

I want to question if those options really belong to watchdogd.
When a watchdog timer expires that results in a system-wide action (like a
system reset). To me, that implies that there should be a single system-wide
configuration point. And I am not sure that the daemon is the best choice for it.

Personally I would prefer a sysctl interface for the following reasons:
- easier to change the configuration
- easier to query current values
- easier to signal that a value getting set may be different from a requested value

In my opinion, watchdogd should only be concerned with running a check action
and patting the dog(s). And, by extension, WDIOCPATPAT should not re-configure
the timeout, it should just reload the timers.

--
Andriy Gapon

Poul-Henning Kamp

2016-10-19 11:18:18 UTC

Permalink

--------

Post by Andriy Gapon
I want to question if those options really belong to watchdogd.
When a watchdog timer expires that results in a system-wide action (like a
system reset). To me, that implies that there should be a single system-wide
configuration point. And I am not sure that the daemon is the best choice for it.

The reason I originally put it in a daemon, was to have the watchdog
implicitly test the kernels ability to schedule trivial processes.

It used to be, and may still be so that, there are deadlocks where
the kernel was twiddling its thumbs but userland did not progress.
Typical triggers for this are disk-I/O errors, corrupt filesystems,
memory overcommit etc.

A kernel-only watchdog patter would not trigger in that case.

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

Andriy Gapon

2016-10-19 11:20:46 UTC

Permalink

Post by Poul-Henning Kamp
--------

The reason I originally put it in a daemon, was to have the watchdog
implicitly test the kernels ability to schedule trivial processes.
It used to be, and may still be so that, there are deadlocks where
the kernel was twiddling its thumbs but userland did not progress.
Typical triggers for this are disk-I/O errors, corrupt filesystems,
memory overcommit etc.
A kernel-only watchdog patter would not trigger in that case.

I addressed this further down my post.
Just in case, I think that watchdogd should do the pat-pats as it does now.
But that's different from setting the timeout.

--
Andriy Gapon

Alfred Perlstein

2016-10-19 21:31:48 UTC

Permalink

Post by Poul-Henning Kamp
--------

The reason I originally put it in a daemon, was to have the watchdog
implicitly test the kernels ability to schedule trivial processes.
It used to be, and may still be so that, there are deadlocks where
the kernel was twiddling its thumbs but userland did not progress.
Typical triggers for this are disk-I/O errors, corrupt filesystems,
memory overcommit etc.
A kernel-only watchdog patter would not trigger in that case.

Exactly.

-Alfred

Alfred Perlstein

2016-10-19 21:32:29 UTC

Permalink

Post by Andriy Gapon
I know that there are people thinking about improving our watchdog
infrastructure. Maybe it's time to discuss some ideas in public.
I would like to start with discussing the end-user, or rather administrative,
interface to the watchdog system.
watchdogd always had -t timeout option.
--softtimeout
--softtimeout-action action
--pretimeout timeout
--pretimeout-action action
I want to question if those options really belong to watchdogd.
When a watchdog timer expires that results in a system-wide action (like a
system reset). To me, that implies that there should be a single system-wide
configuration point. And I am not sure that the daemon is the best choice for it.
- easier to change the configuration
- easier to query current values
- easier to signal that a value getting set may be different from a requested value
In my opinion, watchdogd should only be concerned with running a check action
and patting the dog(s). And, by extension, WDIOCPATPAT should not re-configure
the timeout, it should just reload the timers.

Please look at the Linux interface for watchdogs, it is pretty good and
could/should be ported to us.

-Alfred

Ngie Cooper

2016-10-19 21:47:25 UTC

Permalink

On Wed, Oct 19, 2016 at 2:32 PM, Alfred Perlstein <***@freebsd.org> wrote:
...

Post by Alfred Perlstein
Please look at the Linux interface for watchdogs, it is pretty good and
could/should be ported to us.

We (Isilon) also have a software watchdog implementation (in lieu of
IPMI+watchdogd) to make sure "userspace processes are making
progress".

It would be nice if there was a generalized software watchdog
subsystem available in FreeBSD -- I think your suggestion to follow
Linux's example may be a good investigative step to avoid reinventing
the wheel too much.

Thanks!
-Ngie

Andriy Gapon

2016-10-20 07:14:35 UTC

Permalink

Post by Ngie Cooper
...

Post by Alfred Perlstein
Please look at the Linux interface for watchdogs, it is pretty good and
could/should be ported to us.

We (Isilon) also have a software watchdog implementation (in lieu of
IPMI+watchdogd) to make sure "userspace processes are making
progress".

Please tell me more about this. It seems that there could be different
definitions of 'software watchdog' and different expectations of what it should do.

For example, we have SW_WATCHDOG in the tree for ages.
It's a watchdog driver that's driver by clock interrupts and its logic is
implemented in software. In the current implementation there is only one
timeout action - a panic.

Not too long ago Alfred added another software watchdog that's driven by
callout-s. To me it's quite alike to SW_WATCHDOG, but it has configurable
timeout actions: printf, log, panic, debugger.

So, I wonder how Isilon's software watchdog is different from the above two.

--
Andriy Gapon

Andriy Gapon

2016-10-20 07:06:56 UTC

Permalink

Post by Alfred Perlstein

Please look at the Linux interface for watchdogs, it is pretty good and
could/should be ported to us.

That's not what I actually wanted to discuss.

Anyway, I had looked at it and I didn't find it a good model.

I don't like that each watchdog driver creates its own device entry.
I prefer the FreeBSD model where all drivers can work in concert.
If the most popular Linux watchdog daemon is used, then you would need multiple
instances of it (watchdog or wd_keepalive) to use multiple drivers.

I don't like the seconds resolution. It should be enough for everybody and,
hey, it's better than our power-of-two resolution in the most used range.
But I think that we could be even better.

I do not like this (typical of Linux, I'd say):
The Linux watchdog API is a rather ad-hoc construction and different
drivers implement different, and sometimes incompatible, parts of it.

There can be different opinions and different people can work towards different
goal. Personally, I do not have a goal of having Linux-like watchdog interface
on FreeBSD.

References:
https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.txt
https://www.kernel.org/doc/Documentation/watchdog/watchdog-kernel-api.txt
http://linux.die.net/man/5/watchdog.conf
https://sourceforge.net/p/watchdog/code/ci/master/tree/src/keep_alive.c?format=raw
http://www.sat.dundee.ac.uk/psc/watchdog/Linux-Watchdog.html

--
Andriy Gapon

Alfred Perlstein

2016-10-20 07:30:19 UTC

Permalink

Post by Andriy Gapon

Post by Alfred Perlstein

Please look at the Linux interface for watchdogs, it is pretty good and
could/should be ported to us.

That's not what I actually wanted to discuss.
Anyway, I had looked at it and I didn't find it a good model.
I don't like that each watchdog driver creates its own device entry.
I prefer the FreeBSD model where all drivers can work in concert.
If the most popular Linux watchdog daemon is used, then you would need multiple
instances of it (watchdog or wd_keepalive) to use multiple drivers.
I don't like the seconds resolution. It should be enough for everybody and,
hey, it's better than our power-of-two resolution in the most used range.
But I think that we could be even better.
The Linux watchdog API is a rather ad-hoc construction and different
drivers implement different, and sometimes incompatible, parts of it.
There can be different opinions and different people can work towards different
goal. Personally, I do not have a goal of having Linux-like watchdog interface
on FreeBSD.
https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.txt
https://www.kernel.org/doc/Documentation/watchdog/watchdog-kernel-api.txt
http://linux.die.net/man/5/watchdog.conf
https://sourceforge.net/p/watchdog/code/ci/master/tree/src/keep_alive.c?format=raw
http://www.sat.dundee.ac.uk/psc/watchdog/Linux-Watchdog.html

¯\_(ツ)_/¯