XOrg freeze that affects a lot of people

Sat Mar 26 01:01:21 PST 2005

Charles Goodwin writes:
 > On Mon, 2005-03-21 at 13:27 -0500, Michel Dänzer wrote:
 > > buggy drivers will always be able to kill the system beyond recovery.
 > 
 > Yes, undoubtedly in some scenarios, but not always.  In this scenario,
 > the system is still functioning (can ssh in, background processes still
 > running, X can be killed and restarted, Alt-SysRq rumoured to work - if
 > I'd known about it before I downgraded my nvidia-drivers, I would verify
 > this).
 > 
 > > In particular, as has been pointed out already in this thread, if
 > > Alt-SysRq doesn't work, there's no way for software to recover from 
 > > that.
 > 
 > And at no point in the thread has it been claimed that Alt-SysRq does
 > not work.  In fact, somebody else has said they could Alt-SysRq and
 > reboot.  Also things can be recovered via ssh.  (Does that count?
 > Perhaps not?)  Additionally, people have already suggested that the
 > situation could plausibly be recovered by some form of watchdog process
 > within XOrg which could be used to realise when a request made to a
 > driver has failed.

The Alt-SysRq solution is not an option for the general user.
People want to have their GUI system running and don't care how to
recover the system as a whole if they loose their running apps
possible together with a lot of work they have done in these apps.

Unfortunately I'm quite pessimistic about getting making the driver
so rock solid that we will not see these problems any more in the future.
I've been in this business for to long and seen the same problems
resurfacing too many times.
There are too many unknowns:
1. There is a great variety of chipsets integrated into hardware
   by even more OEMs together with a great variety of additional
   components.
2. To be competitive in their markets these OEMs have to drive the
   hardware to its limits.
3. This hardware runs in too many different systems which despite of 
   all conformance tests may exhibit slight differences on signal
   handling which can (but not always has) to cause occasional occurances
   of lockups.
4. The operating system environments differ too much: even on Linux
   we see a lot of customized kernels with sets of features which may
   affect the operation of the hardware all the way down at the bus level.
5. The chips itself: We know too little about the internals of a chip
   to understand which situations may cause the blit engine to hang
   for example.
   Furthermore would need tools to analyze the situation post mortem
   on system not immediately accessible to us. Such tools would ideally
   be able to obtain the hardware status (together possibly with the
   error status) for further analysis.
   Close cooperation with hardware manufactuerers is required here.
 > 
 > > Michel, 'this is off topic' nay-sayer
 > 
 > If the topic is redefined as 'how can XOrg survive or gracefully handle
 > a non-fatal driver bug' is it then on-topic?
 > 

Yes, this would be the second best solution:
It used to be good practice to detect potential lockup situation
time out the operation and kick the hardware to get working again.
In cases where the system sits in a livelock and the Xserver can
be terminated and restarted without problems this may actually be
feasable.
There are two problems with that:
1. Doing this used to be far more easy with the simple 2D engines 
   of the past.
2. If we have to do this too often we it will downgrade performance
   seriously if driver will be occupied trying to pry loose the
   hardware most of the time. 
   Real problems will go by unnoticed for too long and instead of
   taking measures in the driver to prevent these lockups people
   may start relying on such mechanisms.

There is a tradoff either way. For the user it would be far more
acceptable to live with very occasional slowdowns than with a system
that is entirely hosed making him loose several hours worth of work.

Egbert.