[PATCH 08/27] habanalabs: add info when FD released while device still in use
Tomer Tayar
ttayar at habana.ai
Fri Feb 17 11:34:39 UTC 2023
On Thu, Feb 16, 2023 at 17:05 Stanislaw Gruszka <stanislaw.gruszka at linux.intel.com> wrote:
> On Thu, Feb 16, 2023 at 04:21:48PM +0200, Oded Gabbay wrote:
> > On Thu, Feb 16, 2023 at 2:25 PM Stanislaw Gruszka
> > <stanislaw.gruszka at linux.intel.com> wrote:
> > >
> > > On Sun, Feb 12, 2023 at 10:44:35PM +0200, Oded Gabbay wrote:
> > > > From: Tomer Tayar <ttayar at habana.ai>
> > > >
> > > > When user closes the device file descriptor, it is checked whether the
> > > > device is still in use, and a message is printed if it is.
> > >
> > > I guess this is only for debugging your user-space component?
> > > Because kernel driver should not make any assumption about
> > > user-space behavior. Closing whenever user wants should be
> > > no problem.
> > First of all, indeed the user can close the device whatever it wants.
> > We don't limit him, but we do need to track the device state, because
> > we can't allow a new user to acquire the device until it is idle (due
> > to h/w limitations).
> > Therefore, this print is not so much for debug, as it is for letting
> > the user know the device wasn't idle after he closed it, and
> > therefore, we are going to reset it to make it idle.
> > So, it is a notification that is important imo.
>
> This sounds like something that should be handed in open() with -EAGAIN
> error with eventual message in dmesg. But you know best what info
> is needed by user-space :-)
Because of the reset in this case and the involved cleanup, this info won't be available in next open().
> > > > +static void print_device_in_use_info(struct hl_device *hdev, const char
> *message)
> > > > +{
> > > > + u32 active_cs_num, dmabuf_export_cnt;
> > > > + char buf[64], *buf_ptr = buf;
> > > > + size_t buf_size = sizeof(buf);
> > > > + bool unknown_reason = true;
> > > > +
> > > > + active_cs_num = hl_get_active_cs_num(hdev);
> > > > + if (active_cs_num) {
> > > > + unknown_reason = false;
> > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u active
> CS]", active_cs_num);
> > > > + }
> > > > +
> > > > + dmabuf_export_cnt = atomic_read(&hdev->dmabuf_export_cnt);
> > > > + if (dmabuf_export_cnt) {
> > > > + unknown_reason = false;
> > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [%u
> exported dma-buf]",
> > > > + dmabuf_export_cnt);
> > > > + }
> > > > +
> > > > + if (unknown_reason)
> > > > + compose_device_in_use_info(&buf_ptr, &buf_size, " [unknown
> reason]");
> > > > +
> > > > + dev_notice(hdev->dev, "%s%s\n", message, buf);
> > >
> > > why not print counters directly, i.e. "active cs count %u, dmabuf export
> count %u" ?
> > Because we wanted to print the specific reason, or unknown reason, and
> > not print all the possible counters in one line, because most of the
> > time most of the counters will be 0.
> > We plan to add more reasons so this helper simplifies the code.
>
> Ok, just place replace compose_device_in_use_info() with snprintf().
> I don't think you need custom implementation of snprintf().
compose_device_in_use_info() was added to handle in a single place the snprintf() return value and the buffer pointer moving.
However, you are correct and it is too much here, as the local buffer size is set with a value that should be enough for max possible print.
We will remove compose_device_in_use_info() and use snprintf() directly.
Thanks!
> > > > + print_device_in_use_info(hdev, "User process closed FD but
> device still in use");
> > > > hl_device_reset(hdev, HL_DRV_RESET_HARD);
> > >
> > > You really need reset here ?
> > Yes, our h/w requires that we reset the device after the user closed
> > it. If the device is not idle after the user closed it, we hard reset
> > it.
> > If it is idle, we do a more graceful reset.
>
> Hmm, ok.
>
> Regards
> Stanislaw
More information about the dri-devel
mailing list