Bad DMA from CEDAR card

Tue Oct 23 06:34:45 PDT 2012

On Tue, Oct 23, 2012 at 8:31 AM, Benjamin Herrenschmidt
<benh at kernel.crashing.org> wrote:
> On Tue, 2012-10-23 at 21:45 +1100, Benjamin Herrenschmidt wrote:
>> On Tue, 2012-10-23 at 18:54 +1100, Benjamin Herrenschmidt wrote:
>> > On Tue, 2012-10-23 at 18:42 +1100, Benjamin Herrenschmidt wrote:
>> > >
>> > > As you can see, it's not doing much before the failure:
>> >
>> > Allright, that debug output is bad, it's missing a bunch of stuff,
>> > due to a bad log level (the prink(KERN_DEBUG) in the atom debug
>> > stuff doesn't work anymore new kernel btw)
>>
>> More data: I've done a bit of AtomDis under Dave instructions and
>> improved my tracing, and what it looks like is we run those 3 tables in
>> that order:
>
> And more :-)
>
>  .../...
>
>> I don't know (yet) whether anything happens in between that doesn't go
>> via ATOM, in which case that wouldn't be traced. That's the next thing
>> to check (including interrupts though we shouldn't be getting any at
>> this stage afaik).
>
> So I think it's in between. From what I can tell, the error happens
> somewhere inside the call to drm_vblank_pre_modeset() from
> atombios_crtc_dpms().
>
> This is actually a bit nasty, but basically that pre_modeset() calls
> drm_vblank_get() which enables vblanks.
>
> Now, it doesn't look like it's actually taking any interrupt ... Well
> assuming this is indeed using the irq handler in evergreen.c, which
> doesn't appear to be called, but I might have confused my ASICs and
> missed something specific to CEDAR here.
>
> Here's what I've traced so far:
>
> evergreen_irq_set: vblank 0
> evergreen_irq_set: hpd 1
> evergreen_irq_set: hpd 2
> evergreen_irq_set: hpd 3
> evergreen_irq_set: hpd 4
> 0001:01:00.0: EEH freeze detected, fstate=3 pcierr=9 msg: irqset 2
>
> What that output means is that it called evergreen_irq_set() which
> enables vblank0 (and various hpd's but those were already enabled), and
> the freeze is detected at the tracepoint "irqset 2" that I added in
> there.
>
> This point is basically right at the end of evergreen_irq_set(), where I
> do a 500ms delay and check for freeze. A previous trace point right
> before writing to CP_INT_CNTL didn't show any freeze.
>
> Now the interrupt being an MSI, it's a memory store ... I had a vague
> memory of one of you guys mentioning address limitations to 40-bit or so
> in the radeon, though I though that shouldn't affect MSIs right ?
> Well ...
>
> Our 64-bit MSIs are actually using all 64-bit address bits. If the
> radeon doesn't do that properly and crops the address bits, the MSIs are
> going to hit wrong, right in the middle of nowhere, possibly some DMA
> space.
>
> So I hacked my platform code to force it to only hand out 32-bit MSI
> addresses and guess what ? ... the problem seems to be gone. Ouch.
>
> That's really nasty. Supporting only a subset of the PCI address space
> for DMA was already fairly nasty to begin with, but not doing the full
> MSI addresses looks like a clear violation of the PCIe spec :-(
>
> I'll do some more tests tomorrow to confirm whether that is the problem
> or not at which point, if it is, we'll need some kind of quirk to
> indicate that it supports only MSI32 and not MSI64 or something along
> those lines. Guys, go shoot your HW engineers please.
>

Well, we only support a 40 bit DMA mask, so I suspect MSIs are limited
to 40 bits as well.

Alex