amdgpu didn't start with pci=nocrs parameter, get error "Fatal error during GPU init"

Christian König ckoenig.leichtzumerken at gmail.com
Mon Feb 27 10:22:40 UTC 2023


Am 24.02.23 um 17:21 schrieb Mikhail Gavrilov:
> On Fri, Feb 24, 2023 at 8:31 PM Christian König
> <ckoenig.leichtzumerken at gmail.com> wrote:
>> Sorry I totally missed that you attached the full dmesg to your original
>> mail.
>>
>> Yeah, the driver did fail gracefully. But then X doesn't come up and
>> then gdm just dies.
> Are you sure that these messages should be present when the driver
> fails gracefully?

Unfortunately yes. We could clean that up a bit more so that you don't 
run into a BUG() assertion, but what essentially happens here is that we 
completely fail to talk to the hardware.

In this situation we can't even re-enable vesa or text console any more.

Regards,
Christian.

>
> turning off the locking correctness validator.
> CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G             L
> -------  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
> #1
> Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
> BIOS G513QY.320 09/07/2022
> Call Trace:
>   <TASK>
>   dump_stack_lvl+0x57/0x90
>   register_lock_class+0x47d/0x490
>   __lock_acquire+0x74/0x21f0
>   ? lock_release+0x155/0x450
>   lock_acquire+0xd2/0x320
>   ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
>   ? lock_is_held_type+0xce/0x120
>   _raw_spin_lock_irqsave+0x4d/0xa0
>   ? amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
>   amdgpu_irq_disable_all+0x37/0xf0 [amdgpu]
>   amdgpu_device_fini_hw+0x43/0x2c0 [amdgpu]
>   amdgpu_driver_load_kms+0xe8/0x190 [amdgpu]
>   amdgpu_pci_probe+0x140/0x420 [amdgpu]
>   local_pci_probe+0x41/0x90
>   pci_device_probe+0xc3/0x230
>   really_probe+0x1b6/0x410
>   __driver_probe_device+0x78/0x170
>   driver_probe_device+0x1f/0x90
>   __driver_attach+0xd2/0x1c0
>   ? __pfx___driver_attach+0x10/0x10
>   bus_for_each_dev+0x8a/0xd0
>   bus_add_driver+0x141/0x230
>   driver_register+0x77/0x120
>   ? __pfx_init_module+0x10/0x10 [amdgpu]
>   do_one_initcall+0x6e/0x350
>   do_init_module+0x4a/0x220
>   __do_sys_init_module+0x192/0x1c0
>   do_syscall_64+0x5b/0x80
>   ? asm_exc_page_fault+0x22/0x30
>   ? lockdep_hardirqs_on+0x7d/0x100
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fd58cfcb1be
> Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01
> RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
> RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
> RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
> R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
> R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
>   </TASK>
> amdgpu: probe of 0000:03:00.0 failed with error -12
> amdgpu 0000:08:00.0: enabling device (0006 -> 0007)
> [drm] initializing kernel modesetting (RENOIR 0x1002:0x1638 0x1043:0x16C2 0xC4).
>
>
> list_add corruption. prev->next should be next (ffffffffc0940328), but
> was 0000000000000000. (prev=ffff8c9b734062b0).
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:30!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 14 PID: 470 Comm: (udev-worker) Tainted: G             L
> -------  ---  6.3.0-0.rc0.20230222git5b7c4cabbb65.3.fc39.x86_64+debug
> #1
> Hardware name: ASUSTeK COMPUTER INC. ROG Strix G513QY_G513QY/G513QY,
> BIOS G513QY.320 09/07/2022
> RIP: 0010:__list_add_valid+0x74/0x90
> Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
> 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
> 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
> RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
> RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
> R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
> R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
> FS:  00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
> PKRU: 55555554
> Call Trace:
>   <TASK>
>   ttm_device_init+0x184/0x1c0 [ttm]
>   amdgpu_ttm_init+0xb8/0x610 [amdgpu]
>   ? _printk+0x60/0x80
>   gmc_v9_0_sw_init+0x4a3/0x7c0 [amdgpu]
>   amdgpu_device_init+0x14e5/0x2520 [amdgpu]
>   amdgpu_driver_load_kms+0x15/0x190 [amdgpu]
>   amdgpu_pci_probe+0x140/0x420 [amdgpu]
>   local_pci_probe+0x41/0x90
>   pci_device_probe+0xc3/0x230
>   really_probe+0x1b6/0x410
>   __driver_probe_device+0x78/0x170
>   driver_probe_device+0x1f/0x90
>   __driver_attach+0xd2/0x1c0
>   ? __pfx___driver_attach+0x10/0x10
>   bus_for_each_dev+0x8a/0xd0
>   bus_add_driver+0x141/0x230
>   driver_register+0x77/0x120
>   ? __pfx_init_module+0x10/0x10 [amdgpu]
>   do_one_initcall+0x6e/0x350
>   do_init_module+0x4a/0x220
>   __do_sys_init_module+0x192/0x1c0
>   do_syscall_64+0x5b/0x80
>   ? asm_exc_page_fault+0x22/0x30
>   ? lockdep_hardirqs_on+0x7d/0x100
>   entry_SYSCALL_64_after_hwframe+0x72/0xdc
> RIP: 0033:0x7fd58cfcb1be
> Code: 48 8b 0d 4d 0c 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f
> 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 8b 0d 1a 0c 0c 00 f7 d8 64 89 01 48
> RSP: 002b:00007ffd1d1065d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
> RAX: ffffffffffffffda RBX: 000055b0b5aa6d70 RCX: 00007fd58cfcb1be
> RDX: 000055b0b5a96670 RSI: 00000000016b6156 RDI: 00007fd589392010
> RBP: 00007ffd1d106690 R08: 000055b0b5a93bd0 R09: 00000000016b6ff0
> R10: 000055b5eea2c333 R11: 0000000000000246 R12: 000055b0b5a96670
> R13: 0000000000020000 R14: 000055b0b5a9c170 R15: 000055b0b5aa58a0
>   </TASK>
> Modules linked in: amdgpu(+) drm_ttm_helper hid_asus ttm asus_wmi
> iommu_v2 crct10dif_pclmul ledtrig_audio drm_buddy crc32_pclmul
> sparse_keymap gpu_sched crc32c_intel polyval_clmulni platform_profile
> hid_multitouch polyval_generic drm_display_helper nvme rfkill
> ucsi_acpi ghash_clmulni_intel nvme_core typec_ucsi serio_raw
> sp5100_tco ccp sha512_ssse3 r8169 cec typec nvme_common i2c_hid_acpi
> video i2c_hid wmi ip6_tables ip_tables fuse
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:__list_add_valid+0x74/0x90
> Code: 8d ff 0f 0b 48 89 c1 48 c7 c7 a0 3d b3 99 e8 a3 ed 8d ff 0f 0b
> 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 f8 3d b3 99 e8 8c ed 8d ff <0f> 0b
> 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 50 3e b3 99 e8 75 ed 8d
> RSP: 0018:ffffa50f81aafa00 EFLAGS: 00010246
> RAX: 0000000000000075 RBX: ffff8c9b734062b0 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000027 RDI: 00000000ffffffff
> RBP: ffff8c9b734062b0 R08: 0000000000000000 R09: ffffa50f81aaf8a0
> R10: 0000000000000003 R11: ffff8caa1d2fffe8 R12: ffff8c9b7c0a5e48
> R13: 0000000000000000 R14: ffffffffc13a6d20 R15: 0000000000000000
> FS:  00007fd58c6a5940(0000) GS:ffff8ca9d9a00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000055b0b5a955e0 CR3: 000000017e860000 CR4: 0000000000750ee0
> PKRU: 55555554
> (udev-worker) (470) used greatest stack depth: 12416 bytes left
>
> I thought that gracefully means switching to svga mode and showing the
> desktop with software rendering (exactly as it happens when I
> blacklist amdgpu driver). Currently the boot process stucking and the
> local console is unavailable.
>
>



More information about the dri-devel mailing list