Radeon lockup on 3.8.5-201.fc18.x86_64

Mon Apr 22 22:55:04 PDT 2013

On Mon, 2013-04-22 at 16:19 -0700, Andy Lutomirski wrote: 
> On Thu, Apr 18, 2013 at 2:12 PM, Alex Deucher <alexdeucher at gmail.com> wrote:
> > On Thu, Apr 18, 2013 at 5:11 PM, Andy Lutomirski <luto at amacapital.net> wrote:
> >> On Mon, Apr 8, 2013 at 7:01 AM, Alex Deucher <alexdeucher at gmail.com> wrote:
> >>> On Fri, Apr 5, 2013 at 5:11 PM, Andy Lutomirski <luto at amacapital.net> wrote:
> >>>> Every day or so, I'll click something and my screens go blank for a
> >>>> second or two.  dmesg complains about a lockup, and afterwards
> >>>> everything is painfully slow.  (Even switching focus to other emacs
> >>>> windows takes a second or two.)  Once this happens, if I restart X, I
> >>>> get a blank screen, although the mouse still works and I can switch
> >>>> VTs and use the console.
> >>>
> >>> Try disabling hyperZ.  Set env var R600_HYPERZ=0 (mesa 9.1) or
> >>> R600_DEBUG=nohyperz (mesa git).
> >>
> >> It lasted longer.  I have both of those environment variables set on
> >> the Xorg process but not on clients.  Do  I need it everywhere?
> >
> > For anything that uses the 3D driver.
> 
> This didn't appear to fix it, although it may have fixed some
> graphical glitches in gmail's compose window.

Seems rather unlikely that's directly related to HyperZ, but who knows.

> [350788.530966] radeon 0000:08:00.0: GPU lockup CP stall for more than 40769msec
> [350788.530970] radeon 0000:08:00.0: GPU lockup (waiting for
> 0x000000000000178f last fence id 0x000000000000178e)
> [350788.532047] radeon 0000:08:00.0: Saved 103 dwords of commands on ring 0.
> [350788.532051] radeon 0000:08:00.0: GPU softreset: 0x00000003
> [350788.547792] radeon 0000:08:00.0:   GRBM_STATUS               = 0xA0003828
> [350788.547794] radeon 0000:08:00.0:   GRBM_STATUS_SE0           = 0x00000007
> [350788.547797] radeon 0000:08:00.0:   GRBM_STATUS_SE1           = 0x00000007
> [350788.547799] radeon 0000:08:00.0:   SRBM_STATUS               = 0x200000C0
> [350788.547802] radeon 0000:08:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
> [350788.547805] radeon 0000:08:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
> [350788.547807] radeon 0000:08:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000004
> [350788.547810] radeon 0000:08:00.0:   R_008680_CP_STAT          = 0x80008647
> [350788.547811] radeon 0000:08:00.0:   GRBM_SOFT_RESET=0x00007F6B
> [350788.547866] radeon 0000:08:00.0:   GRBM_STATUS               = 0x00003828
> [350788.547869] radeon 0000:08:00.0:   GRBM_STATUS_SE0           = 0x00000007
> [350788.547872] radeon 0000:08:00.0:   GRBM_STATUS_SE1           = 0x00000007
> [350788.547874] radeon 0000:08:00.0:   SRBM_STATUS               = 0x200000C0
> [350788.547877] radeon 0000:08:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
> [350788.547879] radeon 0000:08:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
> [350788.547882] radeon 0000:08:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
> [350788.547884] radeon 0000:08:00.0:   R_008680_CP_STAT          = 0x00000000
> [350788.565361] radeon 0000:08:00.0: GPU reset succeeded, trying to resume
> [350788.583801] [drm] probing gen 2 caps for device 8086:1d1a = 2/0
> [350788.583807] [drm] enabling PCIE gen 2 link speeds, disable with
> radeon.pcie_gen2=0
> [350788.590840] [drm] PCIE GART of 512M enabled (table at 0x0000000000040000).
> [350788.590976] radeon 0000:08:00.0: WB enabled
> [350788.590978] radeon 0000:08:00.0: fence driver on ring 0 use gpu
> addr 0x0000000040000c00 and cpu addr 0xffff880442f58c00
> [350788.590979] radeon 0000:08:00.0: fence driver on ring 3 use gpu
> addr 0x0000000040000c0c and cpu addr 0xffff880442f58c0c
> [350788.607480] [drm] ring test on 0 succeeded in 2 usecs
> [350788.607560] [drm] ring test on 3 succeeded in 1 usecs
> [350788.615053] [drm] ib test on ring 0 succeeded in 0 usecs
> [350788.615133] [drm] ib test on ring 3 succeeded in 1 usecs
> 
> I'm not convinced there's an actual hang.  40 seconds is a long time,
> and I've only ever seen this when clicking something, and when this
> happens, the screen goes blank immediately (not after a 40 second
> delay).

Hmm, now that you mention this, I notice in your original report it
claims that the CP stalled for 'more than 5102593msec', which is clearly
bogus. Looks like something's wrong with the lockup detection. 
Did this start after a kernel update or something like that?

-- 
Earthling Michel Dänzer           |                   http://www.amd.com
Libre software enthusiast         |          Debian, X and DRI developer