What is a watchdog timeout and what causes it?

My MacBook Pro (late 2013) may be a bit long in the tooth, but it can run Big Sur and does most of what I want; and Apple diagnostics says it’s clean. Except it often crashes (restarts automatically), complains it’s run out of space when it hasn’t (sometimes it’s just the Finder that’s loaded) and it forgets stuff (like the setting that allows my Apple Watch to wake it). But the crashes are worst, and they always say that they’re watchdog timeouts. Can anyone explain what, if anything, I can do about this? Asking the Apple Discussions forum has not helped.

From a search engine search on ‘mac watchdog timeout’, the article below gave several tips:

Did you try zapping the PRAM/NVRAM/SMC?

This is a problem experienced by many users and over several versions of macOS. There are over 45 pages of comments here.

Wish I had a better answer.

Thanks to all who replied. I do understand that the problem has a long history and that Apple has failed to satisfy those who’ve experienced it, which seems very strange. I’ve tried most things, including zapping the PRAM, running Apple diagnostics, using the machine without the additional monitor etc. I brought it up here because I see that there are some very knowledgeable contributors to Tidbits Talk and I hoped that there might be one who could say what actually happens inside the OS when the problem occurs - as an ex software developer, I’m always looking for someone who sees things like this as a debugging challenge.

Anyway, since reading yet more stuff, I am interested in the more recent suggestions that the Apple Photos app may be part of the problem, as I use Photos extensively myself. However, it doesn’t seem that there is any definitive cure known, so I may just have to live with the issue until I can afford an M1 Mac - but even these may suffer from the same fault if Apple themselves can’t get to the bottom or it.

I can’t help with your specific situation, but I can provide a bit of background information, in case it helps.

See also Watchdog timer - Wikipedia

A “watchdog timer” is a mechanism used to detect when a process or thread has hanged. Either because of a deadlock or an infinite loop or some other critical problem.

There are many different mechanisms for implementing this. A basic one (often used in embedded systems and OS kernels) is based on a hardware timer:

  • The system starts a hardware timer scheduled to generate an interrupt at some low frequency (maybe every 10 seconds, or every minute or five minutes).
  • The software that is being monitored will periodically reset the timer’s count back to zero.
  • If the software does not reset the timer, then when the interrupt occurs, the interrupt handler will trigger an action (alert the user, kill the process, restart the device, etc.)

The idea is that if the software fails to reset the timer on schedule, it is assumed that the monitored software has failed and therefore needs to be restarted.

There are many other similar mechanisms that may be used to perform this kind of monitoring. For instance:

  • A minimum-priority thread periodically updating a global variable, which is monitored by a maximum-priority thread. If the variable doesn’t get updated, then some other thread (or group of threads) has been consuming 100% of the CPU and is probably hung in an infinite loop (or some other kind of CPU-hogging failure mode).
  • Many threads update their own resepective global variables, which are monitored by a maximum-priority thread. If any one doesn’t get updated, then that thread is assumed to have failed.
  • Many processes update checkpoint files as they run. If a separate monitoring process sees that their checkpoints don’t update, it assumes that the corresponding processes have failed.

In your case, if the OS is restarting itself in response to a watchdog, this strongly implies that some part of the kernel (maybe a device driver or extension) is supposed to be periodically updating a global watchdog variable or timer and it is failing to do so. Depending on what has failed, there may be no possible recovery other than to restart the OS itself, hence the system crash you are observing.

As for what might be causing these crashes and what you can do to prevent them, I’m going to have to leave that discussion to others. The articles already shared here are a good starting point. Beyond that, a detailed read of the logs might help identify the source of the problem, leading to a possible solution. Or maybe not.

2 Likes

Thanks, that was helpful! Pretty sure now who the culprit was in recent ones I’ve had. What I don’t understand, is would it have been that much more work to identify the process? At least none of mine do.