[Intel-xe] [PATCH 1/3] drm/suballoc: Introduce a generic suballocation manager
Thomas Hellström
thomas.hellstrom at linux.intel.com
Wed Feb 22 15:58:45 UTC 2023
On 2/22/23 15:20, Christian König wrote:
> Am 22.02.23 um 14:54 schrieb Thomas Hellström:
>> Hi,
>>
>> On 2/22/23 12:39, Christian König wrote:
>>> Hi Thomas,
>>>
>>> Am 22.02.23 um 12:00 schrieb Thomas Hellström:
>>>> Hi, Christian,
>>>>
>>>> So I resurrected Maarten's previous patch series around this (the
>>>> amdgpu suballocator) slightly modified the code to match the API of
>>>> this patch series, re-introduced the per-allocation alignment as
>>>> per a previous review comment from you on that series, and made
>>>> checkpatch.pl pass mostly, except for pre-existing style problems,
>>>> and added / fixed some comments. No memory corruption seen so far
>>>> on limited Xe testing.
>>>>
>>>> To move this forward I suggest starting with that as a common drm
>>>> suballocator. I'll post the series later today. We can follow up
>>>> with potential simplifactions lif needed.
>>>>
>>>> I also made a kunit test also reporting some timing information.
>>>> Will post that as a follow up. Some interesting preliminary
>>>> conclusions:
>>>>
>>>> * drm_mm is per se not a cpu hog, If the rb tree processing is
>>>> disabled and the EVICT algorithm is changed from MRU to ring-like
>>>> LRU traversal, it's more or less just as fast as the ring
>>>> suballocator.
>>>>
>>>> * With a single ring, and the suballocation buffer never completely
>>>> filled (no sleeps) the amd suballocator is a bit faster per
>>>> allocation / free. (Around 250 ns instead of 350). Allocation is
>>>> slightly slower on the amdgpu one, freeing is faster, mostly due to
>>>> the locking overhead incurred when setting up the fence callbacks,
>>>> and for avoiding irq-disabled processing on the one I proposed.
>>>
>>> For some more realistic numbers try to signal the fence from another
>>> CPU. Alternative you can invalidate all the CPU read cache lines
>>> touched by the fence callback so that they need to be read in again
>>> from the allocating CPU.
>>
>> Fences are signalled using hr-timer driven fake "ring"s, so should
>> probably be distributed among cpus in a pretty realistic way. But
>> anyway I agree results obtained from that kunit test can and should
>> be challenged before we actually use them for improvements.
>
> I would double check that. My expectation is that hr-timers execute by
> default on the CPU from which they are started.
Hmm, since not using the _PINNED hrtimer flag, I'd expect them to be
more distributed but you're right, they weren't. A rather few
timer_expires from other cpus only. So figures for signalling on other
cpus are, around 500ns for the amdgpu variant, around 900 ns for the
fence-callback one. Still, sleeping starts around 50-75% fill with the
amdgpu variant.
>
>>
>>>
>>>>
>>>> * With multiple rings and varying allocation sizes and signalling
>>>> times creating fragmentation, the picture becomes different as the
>>>> amdgpu allocator starts to sleep/throttle already round 50% - 75%
>>>> fill. The one I proposed between 75% to 90% fill, and once that
>>>> happens, the CPU cost of putting to sleep and waking up should
>>>> really shadow the above numbers.
>>>>
>>>> So it's really a tradeoff. Where IMO also code size and
>>>> maintainability should play a role.
>>>>
>>>> Also I looked at the history of the amdgpu allocator originating
>>>> back to Radeon 2012-ish, but couldn't find any commits mentioning
>>>> fence callbacks nor problem with those. Could you point me to that
>>>> discussion?
>>>
>>> Uff that was ~10 years ago. I don't think I can find that again.
>>
>> OK, fair enough. But what was the objective reasoning against using
>> fence callbacks for this sort of stuff, was it unforeseen locking
>> problems, caching issues or something else?
>
> Well caching line bouncing is one major problem. Also take a look at
> the discussion about using list_head in interrupt handlers, that
> should be easy to find on LWN.
>
> The allocator usually manages enough memory so that it never runs into
> waiting for anything, only in extreme cases like GPU resets we
> actually wait for allocations to be freed.
I guess this varies with the application, but can be remedied with just
adding more managed memory if needed.
/Thomas
>
> So the only cache lines which is accessed from more than one CPU
> should be the signaled flag of the fence.
>
> With moving list work into the interrupt handler you have at least 3
> cache lines which start to bounce between different CPUs.
>
> Regards,
> Christian.
>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>>
>>>
>>>
>>> Regards,
>>> Christian.
>>>
>>>>
>>>> Thanks,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> On 2/17/23 14:51, Thomas Hellström wrote:
>>>>>
>>>>> On 2/17/23 14:18, Christian König wrote:
>>>>>> Am 17.02.23 um 14:10 schrieb Thomas Hellström:
>>>>>>> [SNIP]
>>>>>>>>>>>
>>>>>>>>>>> Any chance you could do a quick performance comparison? If
>>>>>>>>>>> not, anything against merging this without the amd / radeon
>>>>>>>>>>> changes until we can land a simpler allocator?
>>>>>>>>>>
>>>>>>>>>> Only if you can stick the allocator inside Xe and not drm,
>>>>>>>>>> cause this seems to be for a different use case than the
>>>>>>>>>> allocators inside radeon/amdgpu.
>>>>>>>>>
>>>>>>>>> Hmm. No It's allocating in a ring-like fashion as well. Let me
>>>>>>>>> put together a unit test for benchmaking. I think it would be
>>>>>>>>> a failure for the community to end up with three separate
>>>>>>>>> suballocators doing the exact same thing for the same problem,
>>>>>>>>> really.
>>>>>>>>
>>>>>>>> Well exactly that's the point. Those allocators aren't the same
>>>>>>>> because they handle different problems.
>>>>>>>>
>>>>>>>> The allocator in radeon is simpler because it only had to deal
>>>>>>>> with a limited number of fence timelines. The one in amdgpu is
>>>>>>>> a bit more complex because of the added complexity for more
>>>>>>>> fence timelines.
>>>>>>>>
>>>>>>>> We could take the one from amdgpu and use it for radeon and
>>>>>>>> others as well, but the allocator proposed here doesn't even
>>>>>>>> remotely matches the requirements.
>>>>>>>
>>>>>>> But again, what *are* those missing requirements exactly? What
>>>>>>> is the pathological case you see for the current code?
>>>>>>
>>>>>> Well very low CPU overhead and don't do anything in a callback.
>>>>>
>>>>> Well, dma_fence_wait_any() will IIRC register callbacks on all
>>>>> affected fences, although admittedly there is no actual allocator
>>>>> processing in them.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> From what I can tell the amdgpu suballocator introduces
>>>>>>> excessive complexity to coalesce waits for fences from the same
>>>>>>> contexts, whereas the present code just frees from the fence
>>>>>>> callback if the fence wasn't already signaled.
>>>>>>
>>>>>> And this is exactly the design we had previously which we removed
>>>>>> after Dave stumbled over tons of problems with it.
>>>>>
>>>>> So is the worry that those problems have spilled over in this code
>>>>> then? It's been pretty extensively tested, or is it you should
>>>>> never really use dma-fence callbacks?
>>>>>
>>>>>>
>>>>>>> The fence signalling code that fires that callback is typcally
>>>>>>> always run anyway on scheduler fences.
>>>>>>>
>>>>>>> The reason we had for not using the amdgpu suballocator as
>>>>>>> originally planned was that this complexity made it very hard
>>>>>>> for us to undertand it and to fix issues we had with it.
>>>>>>
>>>>>> Well what are those problems? The idea is actually not that
>>>>>> hardware to understand.
>>>>>
>>>>> We hit memory corruption, and we spent substantially more time
>>>>> trying to debug it than to put together this patch, while never
>>>>> really understanding what happened, nor why you don't see that
>>>>> with amdgpu.
>>>>>
>>>>>>
>>>>>> We could simplify it massively for the cost of only waiting for
>>>>>> the oldest fence if that helps.
>>>>>
>>>>> Let me grab the latest version from amdgpu and give it a try
>>>>> again, but yes I think that to make it common code we'll need it
>>>>> simpler (and my personal wish would be to separate the allocator
>>>>> functionality a bit more from the fence waiting, which I guess
>>>>> should be OK if the fence waiting is vastly simplified).
>>>>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Christian.
>>>>>>
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Thomas
>>>>>>
>>>
>
More information about the dri-devel
mailing list