Bad DMA from CEDAR card

Tue Oct 23 00:42:10 PDT 2012

Hi Folks !

I've been tracking a problem on POWER server with a CEDAR card (FirePro
2270 X1 "server") on a POWER machine.

The cool thing is that our PCI-E bridges fancy error handling is kicking
in and freezing all to that device traffic as soon as the error is
detected. By sprinkling some test code around, here is what I have
found:

 - The initial symptom is that the system reports an "EEH" (our error
handling system) error on the card at boot. Under our normal firmware &
Linux, that pretty much causes the card to be taken out. 

 - With some hand made debug, I have got some more details about the
error which is a bad DMA write from the card to an address that isn't
mapped in our IOMMU. (Ie something that wasn't the result of a
dma_map_sg or dma_alloc_coherent, etc...)

 - The error seems to always happen while ATOM is executing from the
"atom_enable_crtc" table called from:

[c000001f7e206fb0] [c000000000440148] .atombios_enable_crtc+0x38/0x50
[c000001f7e207030] [c000000000441784] .atombios_crtc_dpms+0x104/0x1a0
[c000001f7e2070c0] [c000000000441d68] .atombios_crtc_disable+0x28/0x170
[c000001f7e207190] [c0000000003e6ef4] .drm_helper_disable_unused_functions+0x144/0x230
[c000001f7e207230] [c0000000003e69ec] .drm_fb_helper_initial_config+0x5c/0x310
[c000001f7e207340] [c00000000046862c] .radeon_fbdev_init+0xdc/0x190
[c000001f7e2073e0] [c0000000004620c0] .radeon_modeset_init+0x740/0xc90
[c000001f7e2074b0] [c000000000438cfc] .radeon_driver_load_kms+0x14c/0x1a0
[c000001f7e207550] [c0000000003f6e14] .drm_get_pci_dev+0x1c4/0x2e0
[c000001f7e207600] [c00000000079aea8] .radeon_pci_probe+0xc4/0xe8
[c000001f7e207690] [c000000000378d60] .pci_device_probe+0x1a0/0x1c0
[c000001f7e207740] [c0000000004db054] .driver_probe_device+0xe4/0x2c0
[c000001f7e2077e0] [c0000000004db33c] .__driver_attach+0x10c/0x110
[c000001f7e207870] [c0000000004d8b2c] .bus_for_each_dev+0x8c/0xe0
[c000001f7e207920] [c0000000004daaa8] .driver_attach+0x28/0x40
[c000001f7e2079a0] [c0000000004da3d8] .bus_add_driver+0x228/0x300
[c000001f7e207a40] [c0000000004db8b0] .driver_register+0xa0/0x1e0
[c000001f7e207ae0] [c000000000378eb8] .__pci_register_driver+0x48/0x60
[c000001f7e207b60] [c0000000003f709c] .drm_pci_init+0x16c/0x1a0
[c000001f7e207c10] [c0000000009eae5c] .radeon_init+0xb4/0xd0

The error is detected on an atom_op_jump() that loops for ever due to
the isolation which means that all MMIOs are returning ffffffff's. The
actual error might have happened slightly earlier (see below).

. From the backtrace, it seems to be trying to *disable* CRTCs (I would
have understood if it was trying to incorrectly enable one which is
sourcing pixels from the wrong address...)

 - I've added various delays in all sort of stages of
radeon_modeset_init() and radeon_fbdev_init(), and the error still
appears to be fairly well localized to the execution of that table, so
it looks like it's not some stray DMA that happens to hit at that moment
due to some timing, but something specifically triggered by that table
execution.

 - I've turned on atom_debug right before the call to
drm_fb_helper_initial_config() in radeon_fbdev_init() and added a freeze
check between each op, and here's the result. I don't really have the
brain cycles to try to parse that right now :-) I used to back then but
heh, it's a long time ago... that's where I'm handing you the hot potato
hoping it will make some obvious sense :-)

As you can see, it's not doing much before the failure:

>> execute D7E2 (len 24, WS 0, PS 4)
   SET_ATI_PORT @ 0xD7E8
      port: 0 (MM)
   CLEAR_REG @ 0xD7EB
      dst: 
   AND_REG @ 0xD7EF
      dst: 
      src: 
      dst: 
   OR_REG @ 0xD7F4
      dst: 
      src: 
      dst: 
   EOT @ 0xD7F9
<<
>> execute D7CA (len 24, WS 0, PS 4)
   SET_ATI_PORT @ 0xD7D0
      port: 0 (MM)
   CLEAR_REG @ 0xD7D3
      dst: 
   AND_REG @ 0xD7D7
      dst: 
      src: 
      dst: 
   OR_REG @ 0xD7DC
      dst: 
      src: 
      dst: 
   EOT @ 0xD7E1
<<
>> execute BADE (len 25, WS 0, PS 0)
0001:01:00.0: EEH freeze detected, fstate=3 pcierr=9

[ here we have detected the freeze, the stuff below is my own diagnostic
  code, it I will decrypt some of it later if it's of use, basically it
  says the freeze occurred due to a DMA error to an unmapped DMA address
  though I am not 100% sure of the actual DMA address, I think what it
  gives me is actually the address of the iommu "PTE" that had the valid
  bit clear, I need to do some work to turn that back into a page or
  something ... The packet hasn't been captured in the TLP header capture
  of the AER function unfortunately.
 ]

PHB 1 diagnostic data:
  brdgCtl              = 0x00000002
  portStatusReg        = 0x00000000
  rootCmplxStatus      = 0x00000000
  busAgentStatus       = 0x00000000
  deviceStatus         = 0x0000000f
  slotStatus           = 0x016003c0
  linkStatus           = 0xa0120008
  devCmdStatus         = 0x00100147
  devSecStatus         = 0x00000000
  rootErrorStatus      = 0x00000000
  uncorrErrorStatus    = 0x00000000
  corrErrorStatus      = 0x00000000
  tlpHdr1              = 0x00000000
  tlpHdr2              = 0x00000000
  tlpHdr3              = 0x00000000
  tlpHdr4              = 0x00000000
  sourceId             = 0x00000000
  errorClass           = 0x0000000000000000
  correlator           = 0x0000000000000000
  p7iocPlssr           = 0x0000001c00000000
  p7iocCsr             = 0x0000000000000000
  lemFir               = 0x0200000004000000
  lemErrorMask         = 0x1249a1147f500f2c
  lemWOF               = 0x0000000000000000
  phbErrorStatus       = 0x0000000000000000
  phbFirstErrorStatus  = 0x0000000000000000
  phbErrorLog0         = 0x0000000000000000
  phbErrorLog1         = 0x0000000000000000
  mmioErrorStatus      = 0x0200000000000000
  mmioFirstErrorStatus = 0x0200000000000000
  mmioErrorLog0        = 0x02040070004a3da1
  mmioErrorLog1        = 0x98006e9800000000
  dma0ErrorStatus      = 0x0000000000004000
  dma0FirstErrorStatus = 0x0000000000004000
  dma0ErrorLog0        = 0x160007dbe0000002
  dma0ErrorLog1        = 0x1500000000000002
  dma1ErrorStatus      = 0x0000000000000000
  dma1FirstErrorStatus = 0x0000000000000000
  dma1ErrorLog0        = 0x0000000000000000
  dma1ErrorLog1        = 0x0000000000000000
  PE[  2] PESTA        = 0x8000302500000000
          PESTB        = 0x8000001f6f800000
foo !
------------[ cut here ]------------
WARNING: at arch/powerpc/platforms/powernv/pci.c:312
Modules linked in:
NIP: c00000000003ffe0 LR: c00000000003ffdc CTR: 0000000030009e5c
REGS: c000001f7e206ad0 TRAP: 0700   Not tainted  (3.7.0-rc2-00006-gae38062-dirty)
MSR: 9000000000029032 <SF,HV,EE,ME,IR,DR,RI>  CR: 28002084  XER: 20000000
SOFTE: 1
CFAR: c0000000007946ec
TASK = c000001f71340000[1] 'swapper/0' THREAD: c000001f7e204000 CPU: 57
GPR00: c00000000003ffdc c000001f7e206d50 c000000000b591a0 0000000000000005 
GPR04: 0000000000000000 00000000000002c2 9000000000009032 c0000000008e5e80 
GPR08: c000001f7e206d88 0000000080000039 0000000000000000 000000000000e086 
GPR12: 0000000028002082 c00000000ff4ab00 c00000000000b410 0000000000000000 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
GPR20: 0000000000000000 c00000000091c9f0 c000001f7e206e78 c000001f7e206e70 
GPR24: 000000000000bade 0000000000000019 0000000000000000 0000000000000002 
GPR28: 0000000000000002 c000001f7130a800 c000000000aa4300 c000000000ce5800 
NIP [c00000000003ffe0] .pnv_pci_check_eeh+0x130/0x150
LR [c00000000003ffdc] .pnv_pci_check_eeh+0x12c/0x150
Call Trace:
[c000001f7e206d50] [c00000000003ffdc] .pnv_pci_check_eeh+0x12c/0x150 (unreliable)
[c000001f7e206e00] [c00000000044b03c] .atom_execute_table_locked+0x1ac/0x380
[c000001f7e206f10] [c00000000044e724] .atom_execute_table+0x54/0x80
[c000001f7e206fb0] [c0000000004400b8] .atombios_enable_crtc_memreq+0x38/0x50
[c000001f7e207030] [c0000000004417bc] .atombios_crtc_dpms+0x17c/0x1a0
[c000001f7e2070c0] [c000000000441d28] .atombios_crtc_disable+0x28/0x170
[c000001f7e207190] [c0000000003e6eb4] .drm_helper_disable_unused_functions+0x144/0x230
[c000001f7e207230] [c0000000003e69ac] .drm_fb_helper_initial_config+0x5c/0x310
[c000001f7e207340] [c0000000004686ac] .radeon_fbdev_init+0x13c/0x1e0
[c000001f7e2073e0] [c0000000004620e0] .radeon_modeset_init+0x740/0xc90
[c000001f7e2074b0] [c000000000438cbc] .radeon_driver_load_kms+0x14c/0x1a0
[c000001f7e207550] [c0000000003f6dd4] .drm_get_pci_dev+0x1c4/0x2e0
[c000001f7e207600] [c00000000079af18] .radeon_pci_probe+0xc4/0xe8
[c000001f7e207690] [c000000000378d20] .pci_device_probe+0x1a0/0x1c0
[c000001f7e207740] [c0000000004db0c4] .driver_probe_device+0xe4/0x2c0
[c000001f7e2077e0] [c0000000004db3ac] .__driver_attach+0x10c/0x110
[c000001f7e207870] [c0000000004d8b9c] .bus_for_each_dev+0x8c/0xe0
[c000001f7e207920] [c0000000004dab18] .driver_attach+0x28/0x40
[c000001f7e2079a0] [c0000000004da448] .bus_add_driver+0x228/0x300
[c000001f7e207a40] [c0000000004db920] .driver_register+0xa0/0x1e0
[c000001f7e207ae0] [c000000000378e78] .__pci_register_driver+0x48/0x60
[c000001f7e207b60] [c0000000003f705c] .drm_pci_init+0x16c/0x1a0
[c000001f7e207c10] [c0000000009eae5c] .radeon_init+0xb4/0xd0
[c000001f7e207c90] [c00000000000ac04] .do_one_initcall+0x64/0x1e0
[c000001f7e207d50] [c00000000000b62c] .kernel_init+0x21c/0x3e0
[c000001f7e207e30] [c000000000009b0c] .ret_from_kernel_thread+0x5c/0x64

Cheers,
Ben.