Radeon lockup on 3.8.5-201.fc18.x86_64

Andy Lutomirski luto at amacapital.net
Tue Apr 23 12:31:02 PDT 2013

On Tue, Apr 23, 2013 at 10:15 AM, Michel Dänzer <michel at daenzer.net> wrote:
> On Die, 2013-04-23 at 10:08 -0700, Andy Lutomirski wrote:
>> On Mon, Apr 22, 2013 at 10:55 PM, Michel Dänzer <michel at daenzer.net> wrote:
>> > On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote:
>> >
>> >> I'm not convinced there's an actual hang.  40 seconds is a long time,
>> >> and I've only ever seen this when clicking something, and when this
>> >> happens, the screen goes blank immediately (not after a 40 second
>> >> delay).
>> >
>> > Hmm, now that you mention this, I notice in your original report it
>> > claims that the CP stalled for 'more than 5102593msec', which is clearly
>> > bogus. Looks like something's wrong with the lockup detection.
>> > Did this start after a kernel update or something like that?
>> It's recent.  It may have been when F18 switched from 3.7 to 3.8.
> Can you reproduce it with an upstream kernel? Can you bisect? I realize
> it'll probably take a long time, but unless someone has an idea which
> change might have introduced the problem...

Yuck.  I can try, but it takes days to reproduce this, so it will take
forever (and may end up with a wrong answer if I get lucky and don't

>> I think there are bugs in the lockup detection and in the lockup
>> recovery.  Firefox, in particular, is *really* slow afterwards.  Are
>> interrupts possibly getting dropped or misconfigured during the reset?
> Let's not get ahead of ourselves and focus on the lockup detection issue
> for now.

I don't understand the r600_gpu_check_soft_reset code, but could this
be the sequence of events that triggers it?

1. radeon_ring_is_lockup is called just as the very last command on
the ring completes, so last_rptr gets set to the rptr.
2. Nothing happens for a while (i.e. > lockup_timeout).  rptr doesn't change.
3. A very slightly slow operation starts.
4. radeon_ring_is_lockup gets called before that command completes.

radeon_ring_test_lockup will not detect a jiffies wrap-around (because
there wasn't one), rptr will equal last_rptr (because there hasn't
been any progress since last time), and the elapsed time will be
really long, because the function hasn't been called for a long time.
So a lockup gets detected, even though nothing's wrong.

There's a comment above radeon_ring_test_lockup that says:

 * A possible false positivie is if we get call after while and last_cp_rptr ==
 * the current CP rptr, even if it's unlikely it might happen. To avoid this
 * if the elapsed time since last call is bigger than 2 second than we return
 * false and update the tracking information. Due to this the caller must call
 * radeon_ring_test_lockup several time in less than 2sec for lockup
to be reported
 * the fencing code should be cautious about that.

but the corresponding code doesn't appear to exist anywhere.

Also, and unrelatedly, I revoke my comment about gmail issues being
fixed with hyperz off.  Gmail still draws incorrectly.  This may or
may not have anything to do with the radeon driver.


More information about the xorg-driver-ati mailing list