minix/kernel/watchdog.c

53 lines
1.4 KiB
C
Raw Normal View History

NMI watchdog is an awesome feature for debugging locked up kernels. There is not that much use for it on a single CPU, however, deadlock between kernel and system task can be delected. Or a runaway loop. If a kernel gets locked up the timer interrupts don't occure (as all interrupts are disabled in kernel mode). The only chance is to interrupt the kernel by a non-maskable interrupt. This patch generates NMIs using performance counters. It uses the most widely available performace counters. As the performance counters are highly model-specific this patch is not guaranteed to work on every machine. Unfortunately this is also true for KVM :-/ On the other hand adding this feature for other models is not extremely difficult and the framework makes it hopefully easy enough. Depending on the frequency of the CPU an NMI is generated at most about every 0.5s If the cpu's speed is less then 2Ghz it is generated at most every 1s. In general an NMI is generated much less often as the performance counter counts down only if the cpu is not idle. Therefore the overhead of this feature is fairly minimal even if the load is high. Uppon detecting that the kernel is locked up the kernel dumps the state of the kernel registers and panics. Local APIC must be enabled for the watchdog to work. The code is _always_ compiled in, however, it is only enabled if watchdog=<non-zero> is set in the boot monitor. One corner case is serial console debugging. As dumping a lot of stuff to the serial link may take a lot of time, the watchdog does not detect lockups during this time!!! as it would result in too many false positives. 10 nmi have to be handled before the lockup is detected. This means something between ~5s to 10s. Another corner case is that the watchdog is enabled only after the paging is enabled as it would be pure madness to try to get it right.
2010-01-16 21:53:55 +01:00
/*
* This is arch independent NMI watchdog implementaion part. It is used to
* detect kernel lockups and help debugging. each architecture must add its own
* low level code that triggers periodic checks
*/
#include "watchdog.h"
unsigned watchdog_local_timer_ticks;
struct arch_watchdog *watchdog;
int watchdog_enabled;
void nmi_watchdog_handler(struct nmi_frame * frame)
{
/* FIXME this should be CPU local */
static unsigned no_ticks;
static unsigned last_tick_count = (unsigned) -1;
/*
* when debugging on serial console, printing takes a lot of time some
* times while the kernel is certainly not locked up. We don't want to
* report a lockup in such situation
*/
if (serial_debug_active)
goto reset_and_continue;
if (last_tick_count != watchdog_local_timer_ticks) {
if (no_ticks == 1) {
kprintf("watchdog : kernel unlocked\n");
no_ticks = 0;
}
/* we are still ticking, everything seems good */
last_tick_count = watchdog_local_timer_ticks;
goto reset_and_continue;
}
/*
* if watchdog_local_timer_ticks didn't changed since last time, give it
* some more time and only if it still dead, trigger the watchdog alarm
*/
if (++no_ticks < 10) {
if (no_ticks == 1)
kprintf("WARNING watchdog : possible kernel lockup\n");
goto reset_and_continue;
}
arch_watchdog_lockup(frame);
reset_and_continue:
if (watchdog->reinit)
watchdog->reinit(cpuid);
}