Bad DMA from CEDAR card

Tue Oct 23 05:31:55 PDT 2012

On Tue, 2012-10-23 at 21:45 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2012-10-23 at 18:54 +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2012-10-23 at 18:42 +1100, Benjamin Herrenschmidt wrote:
> > > 
> > > As you can see, it's not doing much before the failure:
> > 
> > Allright, that debug output is bad, it's missing a bunch of stuff,
> > due to a bad log level (the prink(KERN_DEBUG) in the atom debug
> > stuff doesn't work anymore new kernel btw)
> 
> More data: I've done a bit of AtomDis under Dave instructions and
> improved my tracing, and what it looks like is we run those 3 tables in
> that order:

And more :-)

 .../...

> I don't know (yet) whether anything happens in between that doesn't go
> via ATOM, in which case that wouldn't be traced. That's the next thing
> to check (including interrupts though we shouldn't be getting any at
> this stage afaik).

So I think it's in between. From what I can tell, the error happens
somewhere inside the call to drm_vblank_pre_modeset() from
atombios_crtc_dpms().

This is actually a bit nasty, but basically that pre_modeset() calls
drm_vblank_get() which enables vblanks.

Now, it doesn't look like it's actually taking any interrupt ... Well
assuming this is indeed using the irq handler in evergreen.c, which
doesn't appear to be called, but I might have confused my ASICs and
missed something specific to CEDAR here.

Here's what I've traced so far:

evergreen_irq_set: vblank 0
evergreen_irq_set: hpd 1
evergreen_irq_set: hpd 2
evergreen_irq_set: hpd 3
evergreen_irq_set: hpd 4
0001:01:00.0: EEH freeze detected, fstate=3 pcierr=9 msg: irqset 2

What that output means is that it called evergreen_irq_set() which
enables vblank0 (and various hpd's but those were already enabled), and
the freeze is detected at the tracepoint "irqset 2" that I added in
there.

This point is basically right at the end of evergreen_irq_set(), where I
do a 500ms delay and check for freeze. A previous trace point right
before writing to CP_INT_CNTL didn't show any freeze.

Now the interrupt being an MSI, it's a memory store ... I had a vague
memory of one of you guys mentioning address limitations to 40-bit or so
in the radeon, though I though that shouldn't affect MSIs right ?
Well ...

Our 64-bit MSIs are actually using all 64-bit address bits. If the
radeon doesn't do that properly and crops the address bits, the MSIs are
going to hit wrong, right in the middle of nowhere, possibly some DMA
space.

So I hacked my platform code to force it to only hand out 32-bit MSI
addresses and guess what ? ... the problem seems to be gone. Ouch.

That's really nasty. Supporting only a subset of the PCI address space
for DMA was already fairly nasty to begin with, but not doing the full
MSI addresses looks like a clear violation of the PCIe spec :-(

I'll do some more tests tomorrow to confirm whether that is the problem
or not at which point, if it is, we'll need some kind of quirk to
indicate that it supports only MSI32 and not MSI64 or something along
those lines. Guys, go shoot your HW engineers please.

Cheers,
Ben.