X server crash recovery

Wed Oct 3 14:12:32 PDT 2007

Hi,

 I made some experiments with, still a very simple crash recovery patch.
 Basically, "setjmp" at dix/dispatch.c:Dispatch(), that will "longjmp" 
to exit the main
loop from hw/xfree86/common/xf86Events.c:xf86SigHandler()
 From the signal handler it will also share the sun code to fork and 
execute ptrace,
but in my testcase, it tries to run gstack (Mandriva doesnt have pstack 
as an alias to
the gstack script, but it is basically an "attach gdb and print 
backtrace script" anyway,
and the resulting backtrace frequently is more helpful than the current 
one). It still runs
the glibc backtrace (i.e. prints 2 backtraces if possible).
 Also, before longjmp'ing, it calls 
xf86ProcessActionEvent(ACTION_TERMINATE), i.e. it tries to
simulate a "Ctrl+Alt+Backspace", and do a "normal server exit", that 
hopefully will properly
restore the console, kill clients, etc.

 Probably, it should check if the crash cause was signal 11; I made some 
tests like:
X :1 & sleep 5; kill -11 `pidof X`
I may need to check killing it at random points to make sure it properly 
restores the console.
Maybe it should be required some special handling at the exit code, like 
some flag other than
DE_TERMINATE, so code could know if it is just exiting, or crashed, as 
the patch will just revert
to the current behavior if it crashes again during attempt to cleanly exit.

 Another important patch, that I just considered, as there are several 
ways to implement is
some kind of "anti infinite loop". This probably should be done in the 
input code, or have some
external way to send a signal to the X server.

 It is not 100% sure of working of coming back to an usable system (may 
crash again when trying to restore console, or it may not be possible to 
interpret keyboard input at certain moments, or
it may have managed to crash the kernel due to some buggy kernel 
module), but I believe, over 90%
of the cases it would restore the computer to an usable state, instead 
of instantly requiring a hard reset unless the user can connect from a 
remote host, and try to restart the X server.

 Since most people use some DM, they usually aren't as affected by a 
"random" crash as a new server will be started, but a switch to a 
virtual console will be impossible. I am more concerned about some way 
to "kill" the server from the keyboard when something like a driver, is 
spinning in some infinite loop.

 The proper way would be to have a multi threaded X server, or the 
server in the kernel, and
drivers that can be loaded/unloaded or restarted and a very reliable 
crash recovery system could
be implemented (video hardware reset included), but I don't think 
something like this will happen
very soon...

 Any suggestions for interpreting keyboard events or sending some 
signal, without major changes to the current code, and as portable as 
possible? At least for Linux and *BSD where the server has full control 
of the hardware. I would like to see something like this upstream...

Paulo