What is a watchdog timeout and what causes it?

Shamino · April 4, 2022, 3:56pm

I can’t help with your specific situation, but I can provide a bit of background information, in case it helps.

A “watchdog timer” is a mechanism used to detect when a process or thread has hanged. Either because of a deadlock or an infinite loop or some other critical problem.

There are many different mechanisms for implementing this. A basic one (often used in embedded systems and OS kernels) is based on a hardware timer:

The system starts a hardware timer scheduled to generate an interrupt at some low frequency (maybe every 10 seconds, or every minute or five minutes).
The software that is being monitored will periodically reset the timer’s count back to zero.
If the software does not reset the timer, then when the interrupt occurs, the interrupt handler will trigger an action (alert the user, kill the process, restart the device, etc.)

The idea is that if the software fails to reset the timer on schedule, it is assumed that the monitored software has failed and therefore needs to be restarted.

There are many other similar mechanisms that may be used to perform this kind of monitoring. For instance:

A minimum-priority thread periodically updating a global variable, which is monitored by a maximum-priority thread. If the variable doesn’t get updated, then some other thread (or group of threads) has been consuming 100% of the CPU and is probably hung in an infinite loop (or some other kind of CPU-hogging failure mode).
Many threads update their own resepective global variables, which are monitored by a maximum-priority thread. If any one doesn’t get updated, then that thread is assumed to have failed.
Many processes update checkpoint files as they run. If a separate monitoring process sees that their checkpoints don’t update, it assumes that the corresponding processes have failed.

In your case, if the OS is restarting itself in response to a watchdog, this strongly implies that some part of the kernel (maybe a device driver or extension) is supposed to be periodically updating a global watchdog variable or timer and it is failing to do so. Depending on what has failed, there may be no possible recovery other than to restart the OS itself, hence the system crash you are observing.

As for what might be causing these crashes and what you can do to prevent them, I’m going to have to leave that discussion to others. The articles already shared here are a good starting point. Beyond that, a detailed read of the logs might help identify the source of the problem, leading to a possible solution. Or maybe not.