Bad DMA from CEDAR card
Benjamin Herrenschmidt
benh at kernel.crashing.org
Tue Oct 23 00:42:10 PDT 2012
Hi Folks !
I've been tracking a problem on POWER server with a CEDAR card (FirePro
2270 X1 "server") on a POWER machine.
The cool thing is that our PCI-E bridges fancy error handling is kicking
in and freezing all to that device traffic as soon as the error is
detected. By sprinkling some test code around, here is what I have
found:
- The initial symptom is that the system reports an "EEH" (our error
handling system) error on the card at boot. Under our normal firmware &
Linux, that pretty much causes the card to be taken out.
- With some hand made debug, I have got some more details about the
error which is a bad DMA write from the card to an address that isn't
mapped in our IOMMU. (Ie something that wasn't the result of a
dma_map_sg or dma_alloc_coherent, etc...)
- The error seems to always happen while ATOM is executing from the
"atom_enable_crtc" table called from:
[c000001f7e206fb0] [c000000000440148] .atombios_enable_crtc+0x38/0x50
[c000001f7e207030] [c000000000441784] .atombios_crtc_dpms+0x104/0x1a0
[c000001f7e2070c0] [c000000000441d68] .atombios_crtc_disable+0x28/0x170
[c000001f7e207190] [c0000000003e6ef4] .drm_helper_disable_unused_functions+0x144/0x230
[c000001f7e207230] [c0000000003e69ec] .drm_fb_helper_initial_config+0x5c/0x310
[c000001f7e207340] [c00000000046862c] .radeon_fbdev_init+0xdc/0x190
[c000001f7e2073e0] [c0000000004620c0] .radeon_modeset_init+0x740/0xc90
[c000001f7e2074b0] [c000000000438cfc] .radeon_driver_load_kms+0x14c/0x1a0
[c000001f7e207550] [c0000000003f6e14] .drm_get_pci_dev+0x1c4/0x2e0
[c000001f7e207600] [c00000000079aea8] .radeon_pci_probe+0xc4/0xe8
[c000001f7e207690] [c000000000378d60] .pci_device_probe+0x1a0/0x1c0
[c000001f7e207740] [c0000000004db054] .driver_probe_device+0xe4/0x2c0
[c000001f7e2077e0] [c0000000004db33c] .__driver_attach+0x10c/0x110
[c000001f7e207870] [c0000000004d8b2c] .bus_for_each_dev+0x8c/0xe0
[c000001f7e207920] [c0000000004daaa8] .driver_attach+0x28/0x40
[c000001f7e2079a0] [c0000000004da3d8] .bus_add_driver+0x228/0x300
[c000001f7e207a40] [c0000000004db8b0] .driver_register+0xa0/0x1e0
[c000001f7e207ae0] [c000000000378eb8] .__pci_register_driver+0x48/0x60
[c000001f7e207b60] [c0000000003f709c] .drm_pci_init+0x16c/0x1a0
[c000001f7e207c10] [c0000000009eae5c] .radeon_init+0xb4/0xd0
The error is detected on an atom_op_jump() that loops for ever due to
the isolation which means that all MMIOs are returning ffffffff's. The
actual error might have happened slightly earlier (see below).
. From the backtrace, it seems to be trying to *disable* CRTCs (I would
have understood if it was trying to incorrectly enable one which is
sourcing pixels from the wrong address...)
- I've added various delays in all sort of stages of
radeon_modeset_init() and radeon_fbdev_init(), and the error still
appears to be fairly well localized to the execution of that table, so
it looks like it's not some stray DMA that happens to hit at that moment
due to some timing, but something specifically triggered by that table
execution.
- I've turned on atom_debug right before the call to
drm_fb_helper_initial_config() in radeon_fbdev_init() and added a freeze
check between each op, and here's the result. I don't really have the
brain cycles to try to parse that right now :-) I used to back then but
heh, it's a long time ago... that's where I'm handing you the hot potato
hoping it will make some obvious sense :-)
As you can see, it's not doing much before the failure:
>> execute D7E2 (len 24, WS 0, PS 4)
SET_ATI_PORT @ 0xD7E8
port: 0 (MM)
CLEAR_REG @ 0xD7EB
dst:
AND_REG @ 0xD7EF
dst:
src:
dst:
OR_REG @ 0xD7F4
dst:
src:
dst:
EOT @ 0xD7F9
<<
>> execute D7CA (len 24, WS 0, PS 4)
SET_ATI_PORT @ 0xD7D0
port: 0 (MM)
CLEAR_REG @ 0xD7D3
dst:
AND_REG @ 0xD7D7
dst:
src:
dst:
OR_REG @ 0xD7DC
dst:
src:
dst:
EOT @ 0xD7E1
<<
>> execute BADE (len 25, WS 0, PS 0)
0001:01:00.0: EEH freeze detected, fstate=3 pcierr=9
[ here we have detected the freeze, the stuff below is my own diagnostic
code, it I will decrypt some of it later if it's of use, basically it
says the freeze occurred due to a DMA error to an unmapped DMA address
though I am not 100% sure of the actual DMA address, I think what it
gives me is actually the address of the iommu "PTE" that had the valid
bit clear, I need to do some work to turn that back into a page or
something ... The packet hasn't been captured in the TLP header capture
of the AER function unfortunately.
]
PHB 1 diagnostic data:
brdgCtl = 0x00000002
portStatusReg = 0x00000000
rootCmplxStatus = 0x00000000
busAgentStatus = 0x00000000
deviceStatus = 0x0000000f
slotStatus = 0x016003c0
linkStatus = 0xa0120008
devCmdStatus = 0x00100147
devSecStatus = 0x00000000
rootErrorStatus = 0x00000000
uncorrErrorStatus = 0x00000000
corrErrorStatus = 0x00000000
tlpHdr1 = 0x00000000
tlpHdr2 = 0x00000000
tlpHdr3 = 0x00000000
tlpHdr4 = 0x00000000
sourceId = 0x00000000
errorClass = 0x0000000000000000
correlator = 0x0000000000000000
p7iocPlssr = 0x0000001c00000000
p7iocCsr = 0x0000000000000000
lemFir = 0x0200000004000000
lemErrorMask = 0x1249a1147f500f2c
lemWOF = 0x0000000000000000
phbErrorStatus = 0x0000000000000000
phbFirstErrorStatus = 0x0000000000000000
phbErrorLog0 = 0x0000000000000000
phbErrorLog1 = 0x0000000000000000
mmioErrorStatus = 0x0200000000000000
mmioFirstErrorStatus = 0x0200000000000000
mmioErrorLog0 = 0x02040070004a3da1
mmioErrorLog1 = 0x98006e9800000000
dma0ErrorStatus = 0x0000000000004000
dma0FirstErrorStatus = 0x0000000000004000
dma0ErrorLog0 = 0x160007dbe0000002
dma0ErrorLog1 = 0x1500000000000002
dma1ErrorStatus = 0x0000000000000000
dma1FirstErrorStatus = 0x0000000000000000
dma1ErrorLog0 = 0x0000000000000000
dma1ErrorLog1 = 0x0000000000000000
PE[ 2] PESTA = 0x8000302500000000
PESTB = 0x8000001f6f800000
foo !
------------[ cut here ]------------
WARNING: at arch/powerpc/platforms/powernv/pci.c:312
Modules linked in:
NIP: c00000000003ffe0 LR: c00000000003ffdc CTR: 0000000030009e5c
REGS: c000001f7e206ad0 TRAP: 0700 Not tainted (3.7.0-rc2-00006-gae38062-dirty)
MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI> CR: 28002084 XER: 20000000
SOFTE: 1
CFAR: c0000000007946ec
TASK = c000001f71340000[1] 'swapper/0' THREAD: c000001f7e204000 CPU: 57
GPR00: c00000000003ffdc c000001f7e206d50 c000000000b591a0 0000000000000005
GPR04: 0000000000000000 00000000000002c2 9000000000009032 c0000000008e5e80
GPR08: c000001f7e206d88 0000000080000039 0000000000000000 000000000000e086
GPR12: 0000000028002082 c00000000ff4ab00 c00000000000b410 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 c00000000091c9f0 c000001f7e206e78 c000001f7e206e70
GPR24: 000000000000bade 0000000000000019 0000000000000000 0000000000000002
GPR28: 0000000000000002 c000001f7130a800 c000000000aa4300 c000000000ce5800
NIP [c00000000003ffe0] .pnv_pci_check_eeh+0x130/0x150
LR [c00000000003ffdc] .pnv_pci_check_eeh+0x12c/0x150
Call Trace:
[c000001f7e206d50] [c00000000003ffdc] .pnv_pci_check_eeh+0x12c/0x150 (unreliable)
[c000001f7e206e00] [c00000000044b03c] .atom_execute_table_locked+0x1ac/0x380
[c000001f7e206f10] [c00000000044e724] .atom_execute_table+0x54/0x80
[c000001f7e206fb0] [c0000000004400b8] .atombios_enable_crtc_memreq+0x38/0x50
[c000001f7e207030] [c0000000004417bc] .atombios_crtc_dpms+0x17c/0x1a0
[c000001f7e2070c0] [c000000000441d28] .atombios_crtc_disable+0x28/0x170
[c000001f7e207190] [c0000000003e6eb4] .drm_helper_disable_unused_functions+0x144/0x230
[c000001f7e207230] [c0000000003e69ac] .drm_fb_helper_initial_config+0x5c/0x310
[c000001f7e207340] [c0000000004686ac] .radeon_fbdev_init+0x13c/0x1e0
[c000001f7e2073e0] [c0000000004620e0] .radeon_modeset_init+0x740/0xc90
[c000001f7e2074b0] [c000000000438cbc] .radeon_driver_load_kms+0x14c/0x1a0
[c000001f7e207550] [c0000000003f6dd4] .drm_get_pci_dev+0x1c4/0x2e0
[c000001f7e207600] [c00000000079af18] .radeon_pci_probe+0xc4/0xe8
[c000001f7e207690] [c000000000378d20] .pci_device_probe+0x1a0/0x1c0
[c000001f7e207740] [c0000000004db0c4] .driver_probe_device+0xe4/0x2c0
[c000001f7e2077e0] [c0000000004db3ac] .__driver_attach+0x10c/0x110
[c000001f7e207870] [c0000000004d8b9c] .bus_for_each_dev+0x8c/0xe0
[c000001f7e207920] [c0000000004dab18] .driver_attach+0x28/0x40
[c000001f7e2079a0] [c0000000004da448] .bus_add_driver+0x228/0x300
[c000001f7e207a40] [c0000000004db920] .driver_register+0xa0/0x1e0
[c000001f7e207ae0] [c000000000378e78] .__pci_register_driver+0x48/0x60
[c000001f7e207b60] [c0000000003f705c] .drm_pci_init+0x16c/0x1a0
[c000001f7e207c10] [c0000000009eae5c] .radeon_init+0xb4/0xd0
[c000001f7e207c90] [c00000000000ac04] .do_one_initcall+0x64/0x1e0
[c000001f7e207d50] [c00000000000b62c] .kernel_init+0x21c/0x3e0
[c000001f7e207e30] [c000000000009b0c] .ret_from_kernel_thread+0x5c/0x64
Cheers,
Ben.
More information about the xorg-driver-ati
mailing list