Initial attempts at i965 text batching

Wed Dec 19 08:12:02 PST 2007

So a long time ago I reported that with my i965 I could get about
290,000 glyphs/sec. from "x11perf -aa10text" by using the NoAccel
option to the X server, and similar performance with XAA, but only
95,000 glyphs./sec with EXA due to the synchronous compositing bug in
the driver.

Since then, Dave Airlie rewrote the driver to use batch buffers,
which completely eliminated all of the syncs. By design, his work was
"functional, not performant" as it would go through all the effort of
allocating a new batch buffer, initializing all device state, and
emitting the batch for every compositing operation.

Needless to say, that's more work than we really want to do, and it
showed by getting performance in the range of 1000 - 10,000
glyphs/sec.

Since then, I've rewritten parts of the driver to attempt to take
advantage of the batch buffers by actually batching up as much as
possible. General device state is only initialized once, then
surface-specific state is initialized in a batch basis within a buffer
object.

My work is available in the master branch of my personal
xf86-video-intel repository:

	http://cgit.freedesktop.org/~cworth/xf86-video-intel/log/

This work required some changes to the drm interface which Dave kindly
provided here, (in a hacked form---a cleaner version merged together
with Keith's recent work will come soon):

	http://cgit.freedesktop.org/~airlied/drm/log/?h=i965-hack-drm

So both of those are required for anyone that wants to experiment with
this.

As for performance, initially batching seems to help a lot, but we hit
a ceiling sooner than I would like to:

	Ops/batch	Glyphs/sec.
	----------	-----------
	1		 10,000
	2		 20,000
	4		 37,000
	8		 67,000
	16		110,000
	32		120,000
	64		120,000
	128		120,000

For people that saw an earlier version of this table, I should
explain two differences:

	* Earlier, it stopped at 64 since it started crashing after
          that. This was easy to workaround by increasing the BATCH_SZ
          value. Clearly there's some missing error-checking around
          that value.

	* Previously, it looked like things kept improving all the way
          to 64 ops./sec. That was because that version was
          unconditionally allocating a maximally larger buffer object
          for the surface state, (so the allocation overheard hit
          every case). Here, the surface state buffer object is
          allocated at the appropriate size, so the smaller batching
          cases improve and we hit the 120,000 glyphs/sec. ceiling
          earlier.

I'll be looking into why things aren't faster than this, but first
I'll need to get oprofile working on my system again.[*]

-Carl

[*] Right now opreport is complaining with:

opreport: error while loading shared libraries: libbfd-2.18.so: cannot
open shared object file: No such file or directory

Does that mean anything to anybody? I'm doing a general system upgrade
now to see if that helps.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
URL: <http://lists.x.org/archives/xorg/attachments/20071219/b5bc5c81/attachment.pgp>