[Mesa-dev] [PATCH] RFC: Externd IMG_context_priority with NV_context_priority_realtime

Fri Apr 6 18:51:21 UTC 2018

On 18-03-31 12:00:16, Chris Wilson wrote:
>Quoting Kenneth Graunke (2018-03-30 19:20:57)
>> On Friday, March 30, 2018 7:40:13 AM PDT Chris Wilson wrote:
>> > For i915, we are proposing to use a quality-of-service parameter in
>> > addition to that of just a priority that usurps everyone. Due to our HW,
>> > preemption may not be immediate and will be forced to wait until an
>> > uncooperative process hits an arbitration point. To prevent that unduly
>> > impacting the privileged RealTime context, we back up the preemption
>> > request with a timeout to reset the GPU and forcibly evict the GPU hog
>> > in order to execute the new context.
>>
>> I am strongly against exposing this in general.  Performing a GPU reset
>> in the middle of a batch can completely screw up whatever application
>> was running.  If the application is using robustness extensions, we may
>> be forced to return GL_DEVICE_LOST, causing the application to have to
>> recreate their entire GL context and start over.  If not, we may try to
>> let them limp on(*) - and hope they didn't get too badly damaged by some
>> of their commands not executing, or executing twice (if the kernel tries
>> to resubmit it).  But it may very well cause the app to misrender, or
>> even crash.
>
>Yes, I think the revulsion has been universal. However, as a
>quality-of-service guarantee, I can understand the appeal. The
>difference is that instead of allowing a DoS for 6s or so as we
>currently allow, we allow that to be specified by the context. As it
>does allow one context to impact another, I want it locked down to
>privileged processes. I have been using CAP_SYS_ADMIN as the potential
>to do harm is even greater than exploiting the weak scheduler by
>changing priority.
>

I'm not terribly worried about this on our hardware for 3d. Today, there is
exactly one case I think where this would happen, if you have a sufficiently
long running shader on a sufficiently large triangle.

The concern I have is about compute where I think we don't do preemption nearly
as well.

>> This seems like a crazy plan to me.  Scheduling has never been allowed
>> to just kill random processes.
>
>That's not strictly true, as processes have their limits which if they
>exceed they will be killed. On the CPU preemption is much better, the
>issue of unyielding processes is pretty much limited to the kernel, where
>we can run the NMI watchdog to kill broken code.
>
>> If you ever hit that case, then your
>> customers will see random application crashes, glitches, GPU hangs,
>> and be pretty unhappy with the result.  And not because something was
>> broken, but because somebody was impatient and an app was a bit slow.
>
>Yes, that is their decision. Kill random apps so that their
>uber-critical interface updates the clock.
>
>> If you have work that is so mission critical, maybe you shouldn't run it
>> on the same machine as one that runs applications which you care so
>> little about that you're willing to watch them crash and burn.  Don't
>> run the entertainment system on the flight computer, so to speak.
>
>You are not the first to say that ;)
>-Chris