X server crash recovery
Paulo Cesar Pereira de Andrade
pcpa at mandriva.com.br
Wed Oct 3 14:12:32 PDT 2007
Hi,
I made some experiments with, still a very simple crash recovery patch.
Basically, "setjmp" at dix/dispatch.c:Dispatch(), that will "longjmp"
to exit the main
loop from hw/xfree86/common/xf86Events.c:xf86SigHandler()
From the signal handler it will also share the sun code to fork and
execute ptrace,
but in my testcase, it tries to run gstack (Mandriva doesnt have pstack
as an alias to
the gstack script, but it is basically an "attach gdb and print
backtrace script" anyway,
and the resulting backtrace frequently is more helpful than the current
one). It still runs
the glibc backtrace (i.e. prints 2 backtraces if possible).
Also, before longjmp'ing, it calls
xf86ProcessActionEvent(ACTION_TERMINATE), i.e. it tries to
simulate a "Ctrl+Alt+Backspace", and do a "normal server exit", that
hopefully will properly
restore the console, kill clients, etc.
Probably, it should check if the crash cause was signal 11; I made some
tests like:
X :1 & sleep 5; kill -11 `pidof X`
I may need to check killing it at random points to make sure it properly
restores the console.
Maybe it should be required some special handling at the exit code, like
some flag other than
DE_TERMINATE, so code could know if it is just exiting, or crashed, as
the patch will just revert
to the current behavior if it crashes again during attempt to cleanly exit.
Another important patch, that I just considered, as there are several
ways to implement is
some kind of "anti infinite loop". This probably should be done in the
input code, or have some
external way to send a signal to the X server.
It is not 100% sure of working of coming back to an usable system (may
crash again when trying to restore console, or it may not be possible to
interpret keyboard input at certain moments, or
it may have managed to crash the kernel due to some buggy kernel
module), but I believe, over 90%
of the cases it would restore the computer to an usable state, instead
of instantly requiring a hard reset unless the user can connect from a
remote host, and try to restart the X server.
Since most people use some DM, they usually aren't as affected by a
"random" crash as a new server will be started, but a switch to a
virtual console will be impossible. I am more concerned about some way
to "kill" the server from the keyboard when something like a driver, is
spinning in some infinite loop.
The proper way would be to have a multi threaded X server, or the
server in the kernel, and
drivers that can be loaded/unloaded or restarted and a very reliable
crash recovery system could
be implemented (video hardware reset included), but I don't think
something like this will happen
very soon...
Any suggestions for interpreting keyboard events or sending some
signal, without major changes to the current code, and as portable as
possible? At least for Linux and *BSD where the server has full control
of the hardware. I would like to see something like this upstream...
Paulo
More information about the xorg
mailing list