[PATCH] radeon: Fix a false positive lockup after 10s of inactivity

Thu Jun 13 14:22:00 PDT 2013

On Wed, Jun 12, 2013 at 6:56 AM, Jerome Glisse <j.glisse at gmail.com> wrote:
> On Wed, Jun 12, 2013 at 6:26 AM, Michel Dänzer <michel at daenzer.net> wrote:
>> On Die, 2013-06-11 at 16:23 -0700, Andy Lutomirski wrote:
>>> If the device is idle for over ten seconds, then the next attempt to do
>>> anything can race with the lockup detector and cause a bogus lockup
>>> to be detected.
>>>
>>> Oddly, the situation is well-described in the lockup detector's comments
>>> and a fix is even described.  This patch implements that fix (and corrects
>>> some typos in the description).
>>>
>>> My system has been stable for about a week running this code.  Without this,
>>> my screen would go blank every now and then and, when it came back, everything
>>> would be remarkably slow (the latter is a separate bug).
>>>
>>> Signed-off-by: Andy Lutomirski <luto at amacapital.net>
>>
>> [...]
>>
>>> diff --git a/drivers/gpu/drm/radeon/radeon_ring.c b/drivers/gpu/drm/radeon/radeon_ring.c
>>> index 1ef5eaa..fb7b3ea 100644
>>> --- a/drivers/gpu/drm/radeon/radeon_ring.c
>>> +++ b/drivers/gpu/drm/radeon/radeon_ring.c
>>> @@ -547,12 +547,12 @@ void radeon_ring_lockup_update(struct radeon_ring *ring)
>>>   * have CP rptr to a different value of jiffies wrap around which will force
>>>   * initialization of the lockup tracking informations.
>>>   *
>>> - * A possible false positivie is if we get call after while and last_cp_rptr ==
>>> - * the current CP rptr, even if it's unlikely it might happen. To avoid this
>>> - * if the elapsed time since last call is bigger than 2 second than we return
>>> - * false and update the tracking information. Due to this the caller must call
>>> - * radeon_ring_test_lockup several time in less than 2sec for lockup to be reported
>>> - * the fencing code should be cautious about that.
>>> + * A possible false positive is if we get called after a while and
>>> + * last_cp_rptr == the current CP rptr, even if it's unlikely it might
>>> + * happen. To avoid this if the elapsed time since the last call is bigger
>>> + * than 2 second then we return false and update the tracking
>>> + * information. Due to this the caller must call radeon_ring_test_lockup
>>> + * more frequently than once every 2s when waiting.
>>
>> Is it guaranteed that radeon_ring_test_lockup will be called more often
>> than every 2s when waiting? If not, this change might prevent a real
>> lockup from being detected?
>
> Yes it will if you wait for a fence, because the fence timeout wait is
> way smaller than 2sec so radeon_ring_is_lockup get call several time,
> which call radeon_ring_force_activity and then
> radeon_ring_test_lockup.
>
> This also means it very very very unlikely (see below for the likely
> case) to have a wrap around that give last rptr same as current one.
>
> The likely case is when you have something like a long compute, then
> nothing is lockup but you keep filling ring with
> radeon_ring_force_activity but the cp is still stuck on the ib of the
> compute stuff so rptr does not progress.
>
>> Either way, I wonder if there might not be a simpler solution to the
>> problem, e.g. by updating last_activity when submitting commands to a
>> previously empty ring.
>
> Maybe but i still don't think it should matter.
>
> Andy can you test (without your patch) and see if it helps with your issue :
> http://people.freedesktop.org/~glisse/0001-drm-radeon-update-lockup-tracking-when-scheduling-in.patch

Testing now.  I'll report back in a couple of days.

I don't think that long computes have anything to do with it.  The
bogus lockups happen when I look away from my computer for a while and
then click something.  I thing the graphics are usually completely
idle when this happens.

AFAIK I've never run an OpenCL or similar application on this system.

--Andy