<div dir="ltr">Hi Jason,<div><br></div><div>I've been wrestling with the sync problems in Wayland some time ago, but only with regards to 3D drivers.</div><div><br></div><div>The guarantee given by the GL/GLES spec is limited to a single graphics context. If the same buffer is accessed by 2 contexts the outcome is unspecified. The cross-context and cross-process synchronisation is not guaranteed. It happens to work on Mesa, because the read/write locking is implemented in the kernel space, but it didn't work on Broadcom driver, which has read-write interlocks in user space.</div><div><br></div><div> A Vulkan client makes it even worse because of conflicting requirements: Vulkan's vkQueuePresentKHR() passes in a number of semaphores but disallows waiting. Wayland WSI requires wl_surface_commit() to be called from vkQueuePresentKHR() which does require a wait, unless a synchronisation primitive representing Vulkan samaphores is passed between Vulkan client and the compositor.</div><div><br></div><div>The most troublesome part was Wayland buffer release mechanism, as it only involves a CPU signalling over Wayland IPC, without any 3D driver involvement. The choices were: explicit synchronisation extension or a buffer copy in the compositor (i.e. compositor textures from the copy, so the client can re-write the original), or some implicit synchronisation in kernel space (but that wasn't an option in Broadcom driver).</div><div><br></div><div>With regards to V4L2, I believe it could easily work the same way as 3D drivers, i.e. pass a buffer+fence pair to the next stage. The encode always succeeds, but for capture or decode, the main problem is the uncertain outcome, I believe? If we're fine with rendering or displaying an occasional broken frame, then buffer+fence pair would work too. The broken frame will go into the pipeline, but application can drain the pipeline and start over once the capture works again.<br></div><div><br></div><div>To answer some points raised by Laurent (although I'm unfamiliar with the camera drivers):</div><div><br></div><div>> you don't know until capture complete in which buffer the frame has been captured<br></div><div>Surely you do, you only don't know in advance if the capture will be successful</div><div><br></div><div>> but if an error occurs during capture, they can be recycled internally and put to the back of the queue.<br></div><div>That would have to change in order to use explicit synchronisation. Every started capture becomes immediately available as a buffer+fence pair. Fence is signalled once the capture is finished (successfully or otherwise). The buffer must not be reused until it's released, possibly with another fence - in that case the buffer must not be reused until the release fence is signalled. </div><div><br></div><div>Cheers,</div><div>Tomek</div><div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 16 Mar 2020 at 10:20, Laurent Pinchart <<a href="mailto:laurent.pinchart@ideasonboard.com" target="_blank">laurent.pinchart@ideasonboard.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Wed, Mar 11, 2020 at 04:18:55PM -0400, Nicolas Dufresne wrote:<br>
> (I know I'm going to be spammed by so many mailing list ...)<br>
> <br>
> Le mercredi 11 mars 2020 à 14:21 -0500, Jason Ekstrand a écrit :<br>
> > On Wed, Mar 11, 2020 at 12:31 PM Jason Ekstrand <<a href="mailto:jason@jlekstrand.net" target="_blank">jason@jlekstrand.net</a>> wrote:<br>
> > > All,<br>
> > > <br>
> > > Sorry for casting such a broad net with this one. I'm sure most people<br>
> > > who reply will get at least one mailing list rejection. However, this<br>
> > > is an issue that affects a LOT of components and that's why it's<br>
> > > thorny to begin with. Please pardon the length of this e-mail as<br>
> > > well; I promise there's a concrete point/proposal at the end.<br>
> > > <br>
> > > <br>
> > > Explicit synchronization is the future of graphics and media. At<br>
> > > least, that seems to be the consensus among all the graphics people<br>
> > > I've talked to. I had a chat with one of the lead Android graphics<br>
> > > engineers recently who told me that doing explicit sync from the start<br>
> > > was one of the best engineering decisions Android ever made. It's<br>
> > > also the direction being taken by more modern APIs such as Vulkan.<br>
> > > <br>
> > > <br>
> > > ## What are implicit and explicit synchronization?<br>
> > > <br>
> > > For those that aren't familiar with this space, GPUs, media encoders,<br>
> > > etc. are massively parallel and synchronization of some form is<br>
> > > required to ensure that everything happens in the right order and<br>
> > > avoid data races. Implicit synchronization is when bits of work (3D,<br>
> > > compute, video encode, etc.) are implicitly based on the absolute<br>
> > > CPU-time order in which API calls occur. Explicit synchronization is<br>
> > > when the client (whatever that means in any given context) provides<br>
> > > the dependency graph explicitly via some sort of synchronization<br>
> > > primitives. If you're still confused, consider the following<br>
> > > examples:<br>
> > > <br>
> > > With OpenGL and EGL, almost everything is implicit sync. Say you have<br>
> > > two OpenGL contexts sharing an image where one writes to it and the<br>
> > > other textures from it. The way the OpenGL spec works, the client has<br>
> > > to make the API calls to render to the image before (in CPU time) it<br>
> > > makes the API calls which texture from the image. As long as it does<br>
> > > this (and maybe inserts a glFlush?), the driver will ensure that the<br>
> > > rendering completes before the texturing happens and you get correct<br>
> > > contents.<br>
> > > <br>
> > > Implicit synchronization can also happen across processes. Wayland,<br>
> > > for instance, is currently built on implicit sync where the client<br>
> > > does their rendering and then does a hand-off (via wl_surface::commit)<br>
> > > to tell the compositor it's done at which point the compositor can now<br>
> > > texture from the surface. The hand-off ensures that the client's<br>
> > > OpenGL API calls happen before the server's OpenGL API calls.<br>
> > > <br>
> > > A good example of explicit synchronization is the Vulkan API. There,<br>
> > > a client (or multiple clients) can simultaneously build command<br>
> > > buffers in different threads where one of those command buffers<br>
> > > renders to an image and the other textures from it and then submit<br>
> > > both of them at the same time with instructions to the driver for<br>
> > > which order to execute them in. The execution order is described via<br>
> > > the VkSemaphore primitive. With the new VK_KHR_timeline_semaphore<br>
> > > extension, you can even submit the work which does the texturing<br>
> > > BEFORE the work which does the rendering and the driver will sort it<br>
> > > out.<br>
> > > <br>
> > > The #1 problem with implicit synchronization (which explicit solves)<br>
> > > is that it leads to a lot of over-synchronization both in client space<br>
> > > and in driver/device space. The client has to synchronize a lot more<br>
> > > because it has to ensure that the API calls happen in a particular<br>
> > > order. The driver/device have to synchronize a lot more because they<br>
> > > never know what is going to end up being a synchronization point as an<br>
> > > API call on another thread/process may occur at any time. As we move<br>
> > > to more and more multi-threaded programming this synchronization (on<br>
> > > the client-side especially) becomes more and more painful.<br>
> > > <br>
> > > <br>
> > > ## Current status in Linux<br>
> > > <br>
> > > Implicit synchronization in Linux works via a the kernel's internal<br>
> > > dma_buf and dma_fence data structures. A dma_fence is a tiny object<br>
> > > which represents the "done" status for some bit of work. Typically,<br>
> > > dma_fences are created as a by-product of someone submitting some bit<br>
> > > of work (say, 3D rendering) to the kernel. The dma_buf object has a<br>
> > > set of dma_fences on it representing shared (read) and exclusive<br>
> > > (write) access to the object. When work is submitted which, for<br>
> > > instance renders to the dma_buf, it's queued waiting on all the fences<br>
> > > on the dma_buf and and a dma_fence is created representing the end of<br>
> > > said rendering work and it's installed as the dma_buf's exclusive<br>
> > > fence. This way, the kernel can manage all its internal queues (3D<br>
> > > rendering, display, video encode, etc.) and know which things to<br>
> > > submit in what order.<br>
> > > <br>
> > > For the last few years, we've had sync_file in the kernel and it's<br>
> > > plumbed into some drivers. A sync_file is just a wrapper around a<br>
> > > single dma_fence. A sync_file is typically created as a by-product of<br>
> > > submitting work (3D, compute, etc.) to the kernel and is signaled when<br>
> > > that work completes. When a sync_file is created, it is guaranteed by<br>
> > > the kernel that it will become signaled in finite time and, once it's<br>
> > > signaled, it remains signaled for the rest of time. A sync_file is<br>
> > > represented in UAPIs as a file descriptor and can be used with normal<br>
> > > file APIs such as dup(). It can be passed into another UAPI which<br>
> > > does some bit of queue'd work and the submitted work will wait for the<br>
> > > sync_file to be triggered before executing. A sync_file also supports<br>
> > > poll() if you want to wait on it manually.<br>
> > > <br>
> > > Unfortunately, sync_file is not broadly used and not all kernel GPU<br>
> > > drivers support it. Here's a very quick overview of my understanding<br>
> > > of the status of various components (I don't know the status of<br>
> > > anything in the media world):<br>
> > > <br>
> > > - Vulkan: Explicit synchronization all the way but we have to go<br>
> > > implicit as soon as we interact with a window-system. Vulkan has APIs<br>
> > > to import/export sync_files to/from it's VkSemaphore and VkFence<br>
> > > synchronization primitives.<br>
> > > - OpenGL: Implicit all the way. There are some EGL extensions to<br>
> > > enable some forms of explicit sync via sync_file but OpenGL itself is<br>
> > > still implicit.<br>
> > > - Wayland: Currently depends on implicit sync in the kernel (accessed<br>
> > > via EGL/OpenGL). There is an unstable extension to allow passing<br>
> > > sync_files around but it's questionable how useful it is right now<br>
> > > (more on that later).<br>
> > > - X11: With present, it has these "explicit" fence objects but<br>
> > > they're always a shmfence which lets the X server and client do a<br>
> > > userspace CPU-side hand-off without going over the socket (and<br>
> > > round-tripping through the kernel). However, the only thing that<br>
> > > fence does is order the OpenGL API calls in the client and server and<br>
> > > the real synchronization is still implicit.<br>
> > > - linux/i915/gem: Fully supports using sync_file or syncobj for explicit<br>
> > > sync.<br>
> > > - linux/amdgpu: Supports sync_file and syncobj but it still<br>
> > > implicitly syncs sometimes due to it's internal memory residency<br>
> > > handling which can lead to over-synchronization.<br>
> > > - KMS: Implicit sync all the way. There are no KMS APIs which take<br>
> > > explicit sync primitives.<br>
> > <br>
> > Correction: Apparently, I missed some things. If you use atomic, KMS<br>
> > does have explicit in- and out-fences. Non-atomic users (e.g. X11)<br>
> > are still in trouble but most Wayland compositors use atomic these<br>
> > days<br>
> > <br>
> > > - v4l: ???<br>
> > > - gstreamer: ???<br>
> > > - Media APIs such as vaapi etc.: ???<br>
> <br>
> GStreamer is consumer for V4L2, VAAPI and other stuff. Using asynchronous buffer<br>
> synchronisation is something we do already with GL (even if limited). We place<br>
> GLSync object in the pipeline and attach that on related GstBuffer. We wait on<br>
> these GLSync as late as possible (or superseed the sync if we queue more work<br>
> into the same GL context). That requires a special mode of operation of course.<br>
> We don't usually like making lazy blocking call implicit, as it tends to cause<br>
> random issues. If we need to wait, we think it's better to wait int he module<br>
> that is responsible, so in general, we try to negotiate and fallback locally<br>
> (it's plugin base, so this can be really messy otherwise).<br>
> <br>
> So basically this problem needs to be solved in V4L2, VAAPI and other lower<br>
> level APIs first. We need API that provides us these fence (in or out), and then<br>
> we can consider using them. For V4L2, there was an attempt, but it was a bit of<br>
> a miss-fit. Your proposal could work, need to be tested I guess, but it does not<br>
> solve some of other issues that was discussed. Notably for camera capture, were<br>
> the HW timestamp is capture about at the same time the frame is ready. But the<br>
> timestamp is not part of the paylaod, so you need an entire API asynchronously<br>
> deliver that metadata. It's the biggest pain point I've found, such an API would<br>
> be quite invasive or if made really generic, might just never be adopted widely<br>
> enough.<br>
<br>
Another issue is that V4L2 doesn't offer any guarantee on job ordering.<br>
When you queue multiple buffers for camera capture for instance, you<br>
don't know until capture complete in which buffer the frame has been<br>
captured. In the normal case buffers are processed in sequence, but if<br>
an error occurs during capture, they can be recycled internally and put<br>
to the back of the queue. Unless I'm mistaken, this problem also exists<br>
with stateful codecs. And if you don't know in advance which buffer you<br>
will receive from the device, the usefulness of fences becomes very<br>
questionable :-)<br>
<br>
> There is other elements that would implement fencing, notably kmssink, but no<br>
> one actually dared porting it to atomic KMS, so clearly there is very little<br>
> comunity interest. glimagsink could clearly benifit. Right now if we import a<br>
> DMABuf, and that this DMAbuf is used for render, a implicit fence is attached,<br>
> which we are unaware. Philippe Zabbel is working on a patch, so V4L2 QBUF would<br>
> wait, but waiting in QBUF is not allowed if O_NONBLOCK was set (which GStreamer<br>
> uses), so then the operation will just fail where it worked before (breaking<br>
> userspace). If it was an explcit fence, we could handle that in GStreamer<br>
> cleanly as we do for new APIs.<br>
> <br>
> > > ## Chicken and egg problems<br>
> > > <br>
> > > Ok, this is where it starts getting depressing. I made the claim<br>
> > > above that Wayland has an explicit synchronization protocol that's of<br>
> > > questionable usefulness. I would claim that basically any bit of<br>
> > > plumbing we do through window systems is currently of questionable<br>
> > > usefulness. Why?<br>
> > > <br>
> > > From my perspective, as a Vulkan driver developer, I have to deal with<br>
> > > the fact that Vulkan is an explicit sync API but Wayland and X11<br>
> > > aren't. Unfortunately, the Wayland extension solves zero problems for<br>
> > > me because I can't really use it unless it's implemented in all of the<br>
> > > compositors. Until every Wayland compositor I care about my users<br>
> > > being able to use (which is basically all of them) supports the<br>
> > > extension, I have to continue carry around my pile of hacks to keep<br>
> > > implicit sync and Vulkan working nicely together.<br>
> > > <br>
> > > From the perspective of a Wayland compositor (I used to play in this<br>
> > > space), they'd love to implement the new explicit sync extension but<br>
> > > can't. Sure, they could wire up the extension, but the moment they go<br>
> > > to flip a client buffer to the screen directly, they discover that KMS<br>
> > > doesn't support any explicit sync APIs.<br>
> > <br>
> > As per the above correction, Wayland compositors aren't nearly as bad<br>
> > off as I initially thought. There may still be weird screen capture<br>
> > cases but the normal cases of compositing and displaying via<br>
> > KMS/atomic should be in reasonably good shape.<br>
> > <br>
> > > So, yes, they can technically<br>
> > > implement the extension assuming the EGL stack they're running on has<br>
> > > the sync_file extensions but any client buffers which come in using<br>
> > > the explicit sync Wayland extension have to be composited and can't be<br>
> > > scanned out directly. As a 3D driver developer, I absolutely don't<br>
> > > want compositors doing that because my users will complain about<br>
> > > performance issues due to the extra blit.<br>
> > > <br>
> > > Ok, so let's say we get KMS wired up with implicit sync. That solves<br>
> > > all our problems, right? It does, right up until someone decides that<br>
> > > they wan to screen capture their Wayland session via some hardware<br>
> > > media encoder that doesn't support explicit sync. Now we have to<br>
> > > plumb it all the way through the media stack, gstreamer, etc. Great,<br>
> > > so let's do that! Oh, but gstreamer won't want to plumb it through<br>
> > > until they're guaranteed that they can use explicit sync when<br>
> > > displaying on X11 or Wayland. Are you seeing the problem?<br>
> > > <br>
> > > To make matters worse, since most things are doing implicit<br>
> > > synchronization today, it's really easy to get your explicit<br>
> > > synchronization wrong and never notice. If you forget to pass a<br>
> > > sync_file into one place (say you never notice KMS doesn't support<br>
> > > them), it will probably work anyway thanks to all the implicit sync<br>
> > > that's going on elsewhere.<br>
> > > <br>
> > > So, clearly, we all need to go write piles of code that we can't<br>
> > > actually properly test until everyone else has written their piece and<br>
> > > then we use explicit sync if and only if all components support it.<br>
> > > Really? We're going to do multiple years of development and then just<br>
> > > hope it works when we finally flip the switch? That doesn't sound<br>
> > > like a good plan to me.<br>
> > > <br>
> > > <br>
> > > ## A proposal: Implicit and explicit sync together<br>
> > > <br>
> > > How to solve all these chicken-and-egg problems is something I've been<br>
> > > giving quite a bit of thought (and talking with many others about) in<br>
> > > the last couple of years. One motivation for this is that we have to<br>
> > > deal with a mismatch in Vulkan. Another motivation is that I'm<br>
> > > becoming increasingly unhappy with the way that synchronization,<br>
> > > memory residency, and command submission are inherently intertwined in<br>
> > > i915 and would like to break things apart. Towards that end, I have<br>
> > > an actual proposal.<br>
> > > <br>
> > > A couple weeks ago, I sent a series of patches to the dri-devel<br>
> > > mailing list which adds a pair of new ioctls to dma-buf which allow<br>
> > > userspace to manually import or export a sync_file from a dma-buf.<br>
> > > The idea is that something like a Wayland compositor can switch to<br>
> > > 100% explicit sync internally once the ioctl is available. If it gets<br>
> > > buffers in from a client that doesn't use the explicit sync extension,<br>
> > > it can pull a sync_file from the dma-buf and use that exactly as it<br>
> > > would a sync_file passed via the explicit sync extension. When it<br>
> > > goes to scan out a user buffer and discovers that KMS doesn't accept<br>
> > > sync_files (or if it tries to use that pesky media encoder no one has<br>
> > > converted), it can take it's sync_file for display and stuff it into<br>
> > > the dma-buf before handing it to KMS.<br>
> > > <br>
> > > Along with the kernel patches, I've also implemented support for this<br>
> > > in the Vulkan WSI code used by ANV and RADV. With those patches, the<br>
> > > only requirement on the Vulkan drivers is that you be able to export<br>
> > > any VkSemaphore as a sync_file and temporarily import a sync_file into<br>
> > > any VkFence or VkSemaphore. As long as that works, the core Vulkan<br>
> > > driver only ever sees explicit synchronization via sync_file. The WSI<br>
> > > code uses these new ioctls to translate the implicit sync of X11 and<br>
> > > Wayland to the explicit sync the Vulkan driver wants.<br>
> > > <br>
> > > I'm hoping (and here's where I want a sanity check) that a simple API<br>
> > > like this will allow us to finally start moving the Linux ecosystem<br>
> > > over to explicit synchronization one piece at a time in a way that's<br>
> > > actually correct. (No Wayland explicit sync with compositors hoping<br>
> > > KMS magically works even though it doesn't have a sync_file API.)<br>
> > > Once some pieces in the ecosystem start moving, there will be<br>
> > > motivation to start moving others and maybe we can actually build the<br>
> > > momentum to get most everything converted.<br>
> > > <br>
> > > For reference, you can find the kernel RFC patches and mesa MR here:<br>
> > > <br>
> > > <a href="https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html" rel="noreferrer" target="_blank">https://lists.freedesktop.org/archives/dri-devel/2020-March/258833.html</a><br>
> > > <br>
> > > <a href="https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037" rel="noreferrer" target="_blank">https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/4037</a><br>
> > > <br>
> > > At this point, I welcome your thoughts, comments, objections, and<br>
> > > maybe even help/review. :-)<br>
> > > <br>
> > > --Jason Ekstrand<br>
> <br>
<br>
-- <br>
Regards,<br>
<br>
Laurent Pinchart<br>
_______________________________________________<br>
wayland-devel mailing list<br>
<a href="mailto:wayland-devel@lists.freedesktop.org" target="_blank">wayland-devel@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/wayland-devel" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/wayland-devel</a><br>
</blockquote></div>