As I start to look at the performance of Cairo’s spans compositor, one of the first things I wanted to look at was profiling data for some relevant tests. I looked at gprof and valgrind, but there was one big problem. The symbols that I wanted to focus on were linked in via an external library – libcairo; I couldn’t find an easy way to get gprof or valgrind to report symbols from linked libraries.
But linux-perf did the trick.
linux-perf, aka “Performance Counters for Linux”, was introduced in kernel 2.6.31 as a low overhead means of examining performance data from the system and applications. It doesn’t require instrumenting the application, and provides a deeper wealth of information than statistical profilers. It includes a set of commands, but the two I needed for cairo were:
$ perf record -g -- ./perf/cairo-perf-trace firefox-fishbowl $ perf report
The first command invokes the application – here running Cairo’s macro performance benchmark using Firefox’s ‘fishbowl’ test trace. This generates a perf.data file, which perf report uses to display a listing of functions called. The listing is quite comprehensive, showing cairo internal calls, libc calls, mesa calls, and more. Sounds like it would even profile kernel calls, but I wasn’t interested in that level of detail so didn’t enable it.
The output looks like this:
+ 11.56% lt-cairo-perf-t libcairo.so.2.11200.15 [.] _cairo_tor_scan_converter_generate + 11.27% lt-cairo-perf-t libc-2.15.so [.] 0x80f31 + 9.32% lt-cairo-perf-t libcairo.so.2.11200.15 [.] _cairo_gl_composite_emit_solid_span + 6.78% lt-cairo-perf-t libcairo.so.2.11200.15 [.] cell_list_render_edge + 5.68% lt-cairo-perf-t libcairo-script-interpreter.so.2.11200.15 [.] _csi_hash_table_lookup + 5.06% lt-cairo-perf-t libcairo-script-interpreter.so.2.11200.15 [.] _scan_file.5939 + 3.60% lt-cairo-perf-t libcairo.so.2.11200.15 [.] _cairo_gl_bounded_spans + 3.24% lt-cairo-perf-t [kernel.kallsyms] [k] 0xffffffff8103e0aa + 2.35% lt-cairo-perf-t libcairo-script-interpreter.so.2.11200.15 [.] _csi_parse_number + 2.25% lt-cairo-perf-t libcairo-script-interpreter.so.2.11200.15 [.] csi_file_getc + 1.32% lt-cairo-perf-t libcairo.so.2.11200.15 [.] _cairo_gl_composite_prepare_buffer
The three _cairo_gl* calls are what I’m interested in.
To visualize the call structure, you can then generate a dot graph using gprof2dot.py:
perf script | gprof2dot.py -f perf | dot -Tpng -o dot_graph.png
In this particular case, the dot graph doesn’t give much help. I think it’s because the function calls are invoked indirectly. So we must manually examine the routines to figure out which one calls which. The calls occur in this structure:
Thus, this suggests looking at _cairo_gl_bounded_spans for macro-optimizations, and at _cairo_gl_composite_prepare_buffer and below for micro-optimizations. (In my next post I’ll discuss what I actually found.)
The _cairo_tor_scan_converter_generate call obviously looks like a heavy hitter, but it’s outside our area of focus of the cairo-gl backend. Also, taking a peek at the code, this is a low level routine to write out pixels that looks (to me) so it is likely perfectly acceptable for it to be so high in the execution list, and it may be tough to find anything worth optimizing in it.