In reviewing perf data while running Cairo’s firefox-fishbowl macro benchmark with the spans compositor, I found three routines of interest:
_cairo_gl_bounded_spans loops over spans objects, issuing emit requests. For this particular test, _cairo_gl_composite_emit_solid_span handles most events. This routine is an optimized shortcut taken for certain geometries (solid filled objects I suppose?) instead of the more expensive _cairo_gl_composite_emit_span call; using the more expensive call would incur a ~24% performance hit by my measurements:
minimum median stddev. count baseline 14.267 14.487 0.71% 6/6 (ref) always-solid 14.172 14.462 0.88% 6/6 always-span 17.770 17.898 0.36% 6/6
So that was definitely a good optimization!
cairo_gl_composite_emit_solid_span calls _cairo_gl_composite_prepare_buffer before doing anything. The only thing this routine does is to flush a buffer; the specific code is:
if (ctx->vb_offset + n_vertices * ctx->vertex_size > CAIRO_GL_VBO_SIZE) _cairo_gl_composite_flush (ctx);
Here we find CAIRO_GL_VBO_SIZE, which effectively controls the frequency of flushing. The larger the VBO, the fewer flush operations. An important clue appears when looking at its definition:
/* VBO size that we allocate, smaller size means we gotta flush more often, * but larger means hogging more memory and can cause trouble for drivers * (especially on embedded devices). */ #define CAIRO_GL_VBO_SIZE (16*1024)
Looking through the git history for this define, at one point this changed to a larger value, 256k, but was reduced to 16k because embedded devices had trouble with large VBOs. I’m not sure whether that holds true for the full class of embedded hardware, or just certain chips, but clearly there’s some trade-offs in picking a size here. Other than the above comment, I couldn’t tell if anyone had looked at the performance impact of this change before, so gave it a shot myself, testing across a range of different VBO sizes.
This testing, done on an Intel Sandy Bridge (i5-2405S) with HD 3000 graphics, indicates about a 5% performance regression from reducing the VBO size from 256k to 16k. If we opened the throttle up to 1M we’d improve by 8.3% on the firefox-fishbowl test.
VBOs, or Vertex Buffer Objects, are a OpenGL feature that allows storing data on the video card’s high performance memory instead of the system memory, allowing quicker rendering. Since different video cards can obviously have different amounts of memory, I would expect different hardware to have different optimum points, and it’s no surprise that embedded devices might have trouble with larger sizes.
I don’t have access to any embedded devices for testing at the moment, but I did re-run the tests on another system, this one running Ubuntu 12.04 with a Radeon HD 4670 card (RV730XT) and mesa version 8.0.4. Even on this older setup and different hardware, we see the same general trend:
I don’t know if there is a way to programmatically determine the amount of video memory available for VBOs, but the box my video card came in says it has 512M memory. The Intel HD 3000 is an integrated chip and uses the system memory instead of its own separate video memory; so I’m not at all sure how to find how much it’s using. My guess is that the optimal VBO size might be correlated to the available video memory; one would certainly think that exceeding the video memory size would be counterproductive.
For ease of testing, I converted the CAIRO_GL_VBO_SIZE define into an environment variable, which falls back to 16k as the default if it’s not specified. I’ve posted the patch to my fdo repo.
A good question is if increasing the VBO size would have any noticeable real-world effect, aside from running the fishbowl benchmark. I ran a number of other benchmarks varying the VBO size to see. Most of the benchmarks showed no effect what-so-ever. A few like firefox-paintball and swfdec-giant-steps showed that small VBO sizes (4k and 8k) incurred some performance loss, but at 16k and higher performance was essentially unchanged. The firefox-canvas benchmark shows a 1-2% benefit to increasing VBO to 512k on Intel, but that’s almost noise. On a more positive note, none of the benchmarks show performance regressions from increasing VBO size.