Linker: Performance tracing

After I completed the Rust migration of linker, I was dissatisfied with the runtime performance of the application. This caused me to spend some additional time to attempt to resolve the performance issues.

In order to resolve performance issues in a methodical way you need to know where the issues are, otherwise you might optimize parts of the code that is not even the bottleneck.

The first thing that I decided to do was to include and configure a library for tracing parts of the application. I decided to use OpenTelemetry as I’ve worked with the JVM version earlier and there was a official version available for Rust. To visualize the traces generated by OpenTelemetry I went with Jaeger, also because I’ve had experience with it prior.

I started with six traces that covered the different parts of the application, visualizing the first run was really eye-opening and really proves the importance of measuing what you are attempting to improve.

collect_and_parse_arguments (65μs)
read_configuration (153μs)
collect_and_filter_source_nodes (678.43ms)
collect_and_filter_target_nodes (56.59ms)
filter (735.3ms)
link_nodes_matching_link_maps (1m 13s)

Before the initial trace, my initial thought was that the bottleneck most likely was in either of the “collect and filter”-methods, but the trace proved me wrong.

The “collect and filter”-methods are traversing the source and target directories and creating an index of the content, and the “link nodes”-method traverse the index to see if we can create symbolic links for any missing nodes based on configuration. The two first methods are primarily IO bound as they traverse the file system and the last method is CPU bound, and since the application is single threaded it’s perhaps not a surprised that the CPU bound method is slow.

The next thing that I did was to include a data-parallelism library, I went with rayon¹ as the API was rather unintrusive and easily allowed me to replace my .iter() calls with .par_iter().

Running the application again I actually saw that my performance was either the same or worse, this really threw me for a loop. If you run a single thread application that is CPU bound, splitting the work on multiple threads should improve the performance.

After some time debugging and reading several articles, I decided to try using the debian:buster-slim as the base image, instead of alpine:latest. This change did improve the performance drastically.

collect_and_parse_arguments (62μs)
read_configuration (159μs)
collect_and_filter_source_nodes (228.44ms)
collect_and_filter_target_nodes (25.18ms)
filter (10.24ms)
link_nodes_matching_link_maps (9.22s)

I’ve not yet looked into the reason for the reduced performance using the alpine image and I don’t want to speculate. However, I do recall running into a similar scenario, or reading about it, before.

Using the debian:buster-slim, instead of alpine:latest, comes at a small cost with regard to image size. It went from ~5MB to ~76MB, which is still a long way from the initial JVM variant at ~350MB, I’ve updated the initial article with the new figures.

Using .par_iter() along with tracing subsequent methods calls caused the trace to be registered separate from the application trace. This is most likely due to me not passing the necessary context, but it’s something to be aware of. ↩︎