[cairo] Pixman glyph performance, and beyond!
sandmann at daimi.au.dk
Fri Oct 23 05:28:44 PDT 2009
> > The latest incarnation of that work is the 'flags' branch here:
> > http://cgit.freedesktop.org/~sandmann/pixman/log/?h=flags
> > which contains several optimizations in this area.
> > It might be worthwhile rerunning the benchmark against that branch,
> > though I suspect there is still some overhead. Almost anything will
> > show up when the images are as small as glyphs are.
> Very effective, Søren, it eliminated the get_fast_path() overhead entirely:
> 32.84% [.] sse2_composite_add_n_8888_8888_ca
> 17.13% [.] sse2_composite_over_n_8888_8888_ca
> 15.98% [.] pixman_image_composite
> 5.78% [.] pixman_blt_sse2
> 5.40% [.] _pixman_image_validate
> 3.98% [.] pixman_compute_composite_region32
> 2.12% [.] pixman_fill_sse2
> It looks like it's been absorbed into pixman_image_composite(), but the
> runtime improved by over 10% -- indicative that the lookup overhead was
> eliminated. Though there is still around 25% to be recovered.
Some of those 25% might be recovable from the _pixman_image_validate()
by precomputing the flags, but other than that, I don't think there is
all that much more to be gained without a better interface for glyph
> > However, I do agree that glyph compositing needs to become much faster
> > in both X and cairo, but I think that a better way would be to move
> > the Render glyph management code into pixman and expose a new
> > pixman_glyph_set_t
> > along with something like a pixman_composite_glyphs() similar to how
> > Render works. This would allow both cairo and X to become
> > substantially faster, while sharing glyph caching code.
> > For spans, I still think that a polygon image type in pixman is the
> > way to go, since again this would benefit both X and cairo. There
> > could certainly be a call to convert it into spans if that is useful
> > to other cairo backends, so that we wouldn't need to have two
> > rasterizers.
> I'm actually not so convinced that this the direction that pixman should
> be going in. From my perspective cairo requires specific path -> backend
> geometry converters, and a polygon rasteriser with a span line interface
> has quickly become the default method for pushing masks around. Whereas
> traps have been relegated to mostly handling boxes, aside from when the
> most efficient wire request we have available is CompositeTraps. (Has
> anyone else noticed that the RLE mask for curved geometry is often an
> order of magnitude smaller than the equivalent set of trapezoids,
> almost as small as the original path?)
One of the major reasons for adding a polygon image to pixman is that
it would make a PictPolygon render picture possible, thereby finally
eliminating the use of trapezoids, while allowing the X server to do
one-pass compositing of geometry, at least in software.
I am not proposing to tesselate the polygons in pixman, but to
composite them directly scanline by scanline. Using Tor's clever trick
 of representing scanlines as accumulation buffers, the operation
(source IN polygon) OVER dest
can be implemented by
/* Generate polygon scanline accumulation buffer */
for reach subpixel scanline
for each active edge
add delta to one pixel in accumulation buffer
uint8_t m = 0
for each pixel:
m += polygon[i]
s = load_source();
d = load_dest()
d = composite (s, m, d);
which is very SIMD-able and allows us to not allocate and zero-fill a
potentially large temporary mask. I think this would be faster than
the arrays of spans. This would benefit at least the image backend
too, but in any case, there is a clear need to do better than
trapezoids for uploading geometry to X.
It's less convenient for shaders because of the horizontal prefix sum,
which is why there could be a
interface to generate spans for the cairo GPU backends, or a
callback-based one like the one in current cairo. DDX drivers could
use this as well, as could pixman GPU backends.
> Similarly, I'd rather not add the overhead of an independent layer
> of glyph management.
Simply moving the Render glyph code into pixman would be an immediate
improvement for non-hardware-accelerated X servers since the remaining
overhead in the profile above would essentially disappear.
A little longer term, it would allow storing the glyphs as efficiently
as possible for the CPU in question (since pixman already has details
about the CPU). For example, the SSE2 backend could benefit greatly if
the glyphs were stored in a 16 x n image, possibly sorted by
approximate use frequency. This would improve both cache performance
and allow aligned loads to be used.
I'm not sure I understand what 'overhead of an independent layer of
glyph management' you see for cairo. For at least the image backend, I
don't see anything that a pixman_glyph_set_t could not provide, and it
seems like a win to share this code with the X server. For other
backends, it may not be as useful, but it won't be harmful either.
> With that bias, I'd prefer that pixman retained its focus on pixel
> manipulation routines and we improve the interfaces for performing
> large sets of similar operations.
> One issue that we will encounter very soon is the pain caused by forcing
> the user to emit cairo_show_glyphs() early for each change in font. This
> can be fixed up in the backends that batch requests and use a
> consolidated glyph atlas (i.e. there is no level state change and so the
> geometry is just accumulated onto the previous operation). [There is
> still substantial overhead from cairo doing the analysis on the extra
> operations.] Similarly we can move away from an immediate mode, direct
> access, pixman - and treat pixman more like a GPU, if it is performant.
Even with glyphs and polygons in pixman, I agree that there is
potentially a lot of benefit to be had from submitting more work to
pixman at a time.
I don't have a clear proposal for what such an interface would look
like, but here are some things that are worth thinking about:
- How well does the interface help if pixman is multithreaded?
- How well does it help if pixman gains GPU backends?
- Can we eliminate temporary images. Ie., if someone does
<some simple stuff>
can we do the whole operation in one pass without allocating
a temporary mask? One way to do this would be to add a new
image type, that would contain pointers to two other images,
then composite them on the fly when the toplevel image is
asked to fetch a scanline. This leads to the idea of an
'expression tree' of images as the way to submit a lots of
- Can it be JIT compiled? There are two quite different
approaches to JIT compilation:
- Generate code similar to the current fast paths and
cache it. This is simpler to get going at first, but
also fundamentally requires temporary images.
- Generate one-shot code for a lot of operations at
once. With the expression tree idea, this might make a
lot of sense. The compiler could look at the tree and
generate the code that would produce the least amount
of memory traffic.
Shader code could be generated from this as well.
More information about the xorg-devel