feat: Allow reading per-sample tags from fiber-local variable storage #203

dalehamel · 2023-04-09T06:46:34Z

What

This adds support for applying tags to individual stackprof samples. Tags are read from fiber-local variable storage to add additional metadata to each sample collected. For example:

In brief:

profile = StackProf.run(mode: :cpu, tags: [:foo]) do
  math
  StackProf::Tag.with(foo: :bar) do
    math
  end
  math
end

Will result in tags being added to the stackprof output, which can viewed like

irb > puts StackProf::Tags.from(profile).inspect
=> [{}, {}, {}, {}, {}, {}, {:foo=>"bar"}, {:foo=>"bar"}, {:foo=>"bar"}, {:foo=>"bar"}, {}, {}, {}, {}, {}]

In practice, I hope the utility of this will be to tag samples with useful labels like controller=HomeController, action=Welcome, to analyze specific parts of application code by tagging samples that contain them. Tags can be applied as wrappers around existing framework actions, and potentially help to correlate profiles to application domain areas.

In addition to user-specified context, some "built-in" context is also potentially useful. For now, there is only support for reading the current ruby thread thread id via :thread_id and current fiber id via :fiber_id. In future PRs, other builtins like "pid", or "pthread id" (since it looks like we already store a handle to it?), "thread name", etc. could be added. These "built-in" tags do not count towards the user tag total, StackProf::Tag::MAX_TAGS.

Why

Although we can already apply tags at the profile-level with stackprof's existing metadata field, we cannot tag at the granularity of individual samples. Other profiling formats, such as pprof, have the ability to have sample-level tags.

With each tag applied to a sample, a new "index" to select on is available for aggregating samples sharing the same tag values. Open source tools like Grafana Phlare / Pyroscope can provide a means of querying the profiles based on these new dimensions.

For example #194 is one use case served by this, since the gecko format makes threads a first class citizen. I have included a proof of concept for this use case as well, using some existing golang converter code, but a future PR could generate a "gecko report output" using the tags to map to gecko thread IDs.

How

This approach is implemented as an optional extension to stackprof, following an approach parallel to how it collects frame samples. If the tags argument is not supplied, it should not change the behaviour of stackprof. If tags are supplied, then during stackprof's signal handler, it will check the thread that it buffered frames for a matching fiber local variables, and buffer whatever it finds.

Stackprof's design already involves periodically interrupting execution in order to collect stack traces. We take advantage of this interrupt collect the tags at a point when VM state is already already "paused" due to either the SIGPROF or SIGALARM interrupt firing, guaranteeing that they should be coherent with the stacktrace, and thus there is no "data race" that could lead to tags being mis-attributed to the wrong callchain.

Like how stackprof currently buffers frames, the tag buffering routine must by signal safe. It will only copy memory to a buffer in order to minimize the interrupt overhead, then the existing delayed job for recording the frame buffer will end with likewise recording the tags. This is where the buffer is recorded and reset, and heap space is allocated as necessary to maintain the array of tag recording entries. Just as stackprof optimizes for repeated identical callchains, the recording process compares the current tagset with the previously recorded one in order to count repeats and reduce redundancy.

Use cases

I've included converter.zip which is extracted from another codebase and facilitates converting stackprof to pprof and pprof to gecko in order to demonstrate two use cases, as those formats already have some native support for tags that we can collect here.

Future PRs could add such output formats to stackprof natively, but to reduce the scope of this PR i've included these PoC converters so that my results below can be externally replicated.

Querying on Tags

Here is a stackprof profile, generated from one of the tests here with tags:

cpu_profile_with_tags.json.gz

And after converting it to pprof:

cpu_profile_with_tags.pb.gz

We can verify the breakdown of tags with go tool pprof -tags:

$ go tool pprof -tags cpu_profile_with_tags.pb.gz
Main binary filename not available.
 function: Total 287.0
           190.0 (66.20%): slow
            97.0 (33.80%): fast

 thread_id: Total 391.0
            391.0 (  100%): 0x0000000100d4b128

Uploading this to Grafana Phlare (specifically via their brand new /pyroscope/ingest endpoint, which appears supports sample level tags in pprof), we can query on these tags:

click here to see full stacktrace

The callchains match what we expect, chains with slow_function are tagged with function: slow, chains with fast_function are tagged with function: fast.

Note also that the displayed sample counts matches the output from go tool pprof -tags earlier.

We can apply tags within framework like rails, to make this transparent. For instance, the following snippet (inspired by Pyroscope Ruby's tag auto-instrumentation):

module StackProf
  extend self

  def initialize_rails_hooks
    tag_wrapper = lambda do |ctrl, action|
      StackProf::Tag.with(
        action: ctrl.action_name,
        controller: ctrl.controller_name,
        &action
      )
    end

    ActionController::API.__send__(:around_action, tag_wrapper) if defined? ActionController::API
    ActionController::Base.__send__(:around_action, tag_wrapper) if defined? ActionController::Base
  end
end

StackProf.initialize_rails_hooks

Will result in being able to query on the tags "action" and "controller", to see where controllers are spending their time. Other frameworks exposing these sorts of hooks, or middlewares, can be used to apply and wrap with tags, potentially including other useful metadata such as request IDs.

Multi-threaded view

Here is a PoC for using these tags to power multi-threaded profile views.

Viewing the resulting profile's frame graph in speedscope, we can see both threads combined:

But viewing the converted gecko format in profiler.firefox.com, we can see them show up as separate threads in the timeline at the top, and based on the sample order, which thread was active:

Note that the gecko converter is heavily WIP so many aspects of the visualization are off (eg, it shows timestamps even though the profile is being represented as just "samples", which is misleading. It is also naive about how much time each thread is active/inactive in the timeline, all time steps are assumed equal.

Implementation

Sample Tags

Sample tags are now stored on the stackprof struct, which are an array of mappings of string ids, which are indices into the :tag_strings field, in the order that the strings were encountered when reading tag values.

The :sample_tags field is an array of mixed types, effectively an array of 2-tuples. Even number array elements indicate the tag set, odd number array elements indicate how many times the previous tag set repeats:

irb> profile[:sample_tags]
=> [{1=>2}, 16, {1=>2, 3=>4}, 6, {1=>2}, 11, {1=>2, 3=>5}, 14, {1=>2}, 10, {1=>2, 3=>4}, 3, {1=>2}, 8, {1=>2, 3=>5}, 14, {1=>2}, 11, {1=>2, 3=>4}, 4, {1=>2}, 9, {1=>2, 3=>5}, 20, {1=>2}, 14, {1=>2, 3=>4}, 6, {1=>2}, 8, {1=>2}, 3, {1=>2, 3=>5}, 17, {1=>2}, 13, {1=>2, 3=>4}, 6, {1=>2}, 10, {1=>2, 3=>5}, 18]
irb> profile[:tag_strings]
=> ["", "thread_id", "0x0000000100a4b158", "function", "fast", "slow"]

Collection Overhead

The overhead will be O(T), where T = cardinality of tags per sample, added to each sample. Currently this has a harcoded maximum of 16. For string keys, the maximum value is 128 bytes, and for tag values, the maximum is 512 bytes. These are arbitrary limits, to prevent enormous strings from being used as tags, while leaving some room for large unique IDs to be stored. Thus the maximum amount of string data read while buffering tags is copying up to 16 * 128 + 16 * 512 bytes, which is about 10kb, to a buffer. In practice, it is probably much less data than this which will be copied.

Values are read directly from existing variables, so most of the operations are type and nil checks, and copying data around.
There are a few function calls added for each tag collected, however:

Reading the thread ID (optional)
- Just the integer id of the current fiber / thread is recorded, so it should be very fast.
Reading the thread local variable
- Involves a function call for each one
- Copying this to a buffer
Recording the buffer to the _stackprof struct, accumulating a set of tags for each sample

The majority of the overhead here is in the last step of recording it, as we need to:

Process T tags
- Check if we already have a string ID for this tag, which is an st_lookup, but lets us take advantage of its hash function to check if we've seen the string before, and quickly obtain its integer representatiot.
- Check for equality with the previous tag from the last sample (done via integer comparisons after getting string ids in above step). It is asserted that the sizes of the tag sets match first, then each value is compared with the logic short-circuiting on the first miss.
- Accumulate the results, either by pushing a new tagset if there was no match, or incrementing the number of times the tagset repeats.

This following gist shows what the overhead looks like for tag vs frame buffer collection and recording, including a basic diff for collecting the overhead: https://gist.github.com/dalehamel/18a8f291864cee2bb3e34285f6f82c0c

It is a trivial example, but the key points are:

Generally in the test suite at least, buffering tags is less time than buffering frames, which is the most critical overhead added here as it is done during the interrupt handler
The majority of the overhead is in the recording of the tags, which runs in a postponed job

The overhead will scale with the number of tags and amount of tag data to be read, which the unit tests don't stress very hard.

Storage Overhead

We can assume that lots of string values will repeat, so rather than store the same string many times, we use a st_table using its string hashing feature for a strtable to detect if we have seen a string before. The happy path is that the lookup succeeds, and it returns the integer unique id for that string. If it is a new string never seen before, we assign it an id and add it to the struct so future instances will hit and get that ID. If a sample has lots of unique tag values (request id, trace id, span id), this optimization is somewhat blunted, but reduces the storage size of tags that are consistent across multiiple samples.

Currently the data size of the sample_tags field is T = cardinality of tags per sample S = cardinality of samples, so maximum size of tag storage is T * S * size of int * 2.

However, generally expect that tags will not change from sample to sample, and tags for contiguous samples should match. This allows us to optimize: if the tagset of the current sample is exactly equal to the tagset of the previous sample, the previous sample's repeats counter increments rather than inserting a new tagset.

We can decode this with a helper, StackProf::Tags.from(profile), to map to
the tag values, which will be 1:1 per sample:

irb> StackProf::Tags.from(profile)
=> [{:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118"}, {:thread_id=>"0x0000000102ddb118", :function=>"fast"}, {:thread_id=>"0x0000000102ddb118", :function=>"fast"}, {:thread_id=>"0x0000000102ddb118", :function=>"fast"}, {:thread_id=>"0x0000000102ddb118", :function=>"fast"}, {:thread_id=>"0x0000000102ddb118", :function=>"fast"}, ... ]

Because the tags are not stored verbatim, a helper is needed to decode them. The data structure reported on stackprof.results requires further parsing in order to "decode it", for which the ruby helper StackProf::Tags.from(profile) provides.

Tag Strings

To support optimizing the storage usage of the tag collection, a single string array is maintained and is appended to for each unique new string encountered. This will be where the bulk of the added size will be to profiles featuring tags. This is a new field added to the stackprof struct, which will contain all unique tag strings. Note that by convention, the first entry will be the empty string, "", with an ID of 0, and so the numeric string IDs are the index into this 0-indexed array.

Known limitations

My current experience is that the new tests are generally not the source of CI failures, but I have seen them fail in unexpected ways in CI matrix testing, or after many repeated tests with the following script to try to "harden" it:

untilfail() {i=0; while "$@"; do :; ((i=i+1)); echo "attempt $i succeeded"; done; echo "ran $i times before failure"}
untilfail bundle exec ruby -I lib test/test_stackprof_tags.rb

Here are some of the rare failures that I see in CI which break assertions around the expected tag order in the tests:

Sometimes an unexpected thread ID shows up in the tests which breaks the assertions of tag order (as it should)
- no idea why this happens or who it belongs to, it isn't specified in the tests. Could it be coming from minitest internally? I cannot figure out where this rogue thread id comes from
- Perhaps the test suite could be updated to drop any samples with unknown thread id in order to reduce flakiness when a rogue thread shows up in the samples?
Sometimes a sample is missing any tags, even built-in's like :thread_id, even though it is configured to collect it
- This shouldn't be possible, as it should always be able to read the current thread ID afaik
- Test failure looks like: 1/143 samples did not contain the tag thread_id
- It think can also manifest as a sample missing all tags, including those read from the thread local storage, which can cause a test to fail due to the expected order being incorrect due to an unexpected empty tagset
- Current theory is some nuance in how postponed jobs / interrupts work, as if i set the interval to be very small the frequency of this goes seems to go up, but at the default interval it is more rare. Perhaps some VM state is being hit that we can't read?
Due to a race condition, it is possible for values set in the parent to not immediately propagate to the child
- This shows up as failures in StackProfTagsTest#test_sample_tag_persistence_from_parent, where a sample is taken after the new thread has started, but before the thread local variables are set within the child thread
- The test suite has been updated to permit either case, to prevent spurious failures due to this race condition.

I'm not sure what the root cause of each of these issues are yet, but I'll keep digging as best I can. The most concerning issue to me are the samples that lack any tags, which I cannot explain yet. However, I think this can be reasonably caveated with the disclaimer that tags are "Best effort", and not guaranteed for a sample, but that if tags are collected, they are believed to be accurate.

One major issue with the new tests however is that they are substantially slower than existing tests. The base stackprof suite runs in about 300ms, and the new tag suite runs in about 1500ms. This is due to a few slower tests for more complex cases which involve nested threads and ensuring the correct tags are tread under different concurrency situations. The majority of the single-threaded tests use mode: :wall, the default, and this makes it easier to collect samples to make assertions on. However, for multi-threaded tests, we by default currently only signal the thread that called stackprof in :wall mode, so these tests aren't compatible with mode: :wall, and thus mode: :cpu is required. In order to make the desired frames appear on the tests, we need to do additional math to guarantee that stackprof will collect a snapshot of the work in order for us to make assertions on. In general, tests that use mode: :wall are fast and those with mode: :cpu are slower and more nuanced.

ext/stackprof/stackprof.c

…g.with

…ray to mimic 'pass by value'

dalehamel · 2023-04-26T20:18:02Z

as a heads-up in case anyone reviews this before I push the fixes, I am testing this PR internally and finding a few minor bugs that I will push sporadically as I find and fix them. I have two pending fixes for edge cases I've discovered in testing with "real apps" and not just the test suite.

itarato · 2023-05-02T17:55:32Z

lib/stackprof.rb

@@ -2,6 +2,7 @@
  require "stackprof/truffleruby"
 else
  require "stackprof/stackprof"
+  require "stackprof/tags"


When using this on a project on TruffleRuby code using StackProf::Tag will fail (const not being loaded). If I'm not mistaken we don't need this for TruffleRuby (as it has it's own internal profilers), maybe we could stub it with a noop class and load that in that case.

yeah i think we'll probably need a stub, as the usage of the StackProf::Tag helpers is intended to be throughout the codebase, so at least having the receivers available should prevent runtime errors, and not need to have every single call behind a an if truffleruby guard

dalehamel · 2023-05-26T15:59:57Z

@tenderlove if you have the time, sorry it's a big one.

tenderlove · 2023-07-12T17:03:16Z

I really like this idea, but the patch is really large. 😅

Gecko format supports a "Marker chart", and I've been messing with converting Stackprof profiles to Gecko format. Gecko format has an idea of "markers" and the markers basically span an amount of time (and are specific to a thread). I've made a sample profile here that uses the markers to denote GC events (click "Marker chart").

The markers only need a time (in our case the monotonic clock) for the profile viewer to make sense of them, so could we have a Ruby API that just stores these in an array? Then when we dump the profile include those markers?

At least in Gecko format, the names of the markers can be arbitrary, and we can even include a bag of data with them.

Implementation is something like this:

INSTANT = 0
INTERVAL = 1
START = 2
FINISH = 3

def start name, data
  Stackprof.markers << [Thread.current.id,
                        Process.clock_gettime(Process::CLOCK_MONOTONIC),
                        nil,
                        name,
                        data,
                        START]
end

def finish name, data
  Stackprof.markers << [Thread.current.id,
                        Process.clock_gettime(Process::CLOCK_MONOTONIC),
                        nil,
                        name,
                        data,
                        FINISH]
end

Usage is something like this:

  Stackprof.start("Request", { :controller => controller, :action => action })
    ...
  Stackprof.finish("Request", { :controller => controller, :action => action })

I think this would be enough for Gecko. I'm not sure about the other formats (but to be honest, I don't know if I care that much? 😅)

dalehamel · 2023-07-12T17:25:49Z

@tenderlove thanks for taking a look, and yeah sorry for the size. I wanted to show things working end to end which is why I added a bunch of tests. The actual C code changes aren’t as big but I definitely understand not wanting to take on the baggage if not needed.

the approach you’ve outlined sounds like a better way to address this. If the data is ultimately there it should be possible to convert to formats other than gecko too.

I’ll close this and hopefully can build off your work with markers.

tenderlove · 2023-07-12T18:01:22Z

Great, and thanks for opening the PR. I hadn't thought of this, and it's definitely a great feature

dalehamel mentioned this pull request Apr 9, 2023

WIP adding sample tags dalehamel/stackprof#1

Draft

dalehamel force-pushed the stackprof-tags branch from 5e3478e to 3690997 Compare April 15, 2023 07:19

dalehamel added 7 commits April 18, 2023 09:53

Add sample tag collection support to stackprof

187b8cb

Test sample tag collection under a variety of circumstances

a5801ce

Add support for 'tag persistence/inheritance' when creating new threads

a907709

Update README to document new sample tagging feature

664c533

Add test for concurrent threads

b688a79

Add builtin for reading current fiber id, mirroring thread id

a7a9d9c

Speed up tag tests

4cd9189

dalehamel force-pushed the stackprof-tags branch from 3690997 to 4cd9189 Compare April 18, 2023 17:42

dalehamel changed the title ~~WIP - Stackprof tags~~ feat: Allow reading per-sample tags from fiber-local variable storage Apr 18, 2023

dalehamel marked this pull request as ready for review April 18, 2023 17:47

dalehamel commented Apr 19, 2023

View reviewed changes

ext/stackprof/stackprof.c Outdated Show resolved Hide resolved

dalehamel added 2 commits April 24, 2023 11:57

Fix memory leak

fb1ec66

Do no reset buffer count in results

f9facc2

dalehamel mentioned this pull request Apr 24, 2023

Add valgrind support to Rakefile via ruby_memcheck gem #208

Closed

dalehamel added 2 commits April 26, 2023 09:45

Ensure that tags are reset if there is exception within StackProf::Ta…

28dc27f

…g.with

Ensure that 'builtin' tags persist between runs by duplicating tag ar…

021bd03

…ray to mimic 'pass by value'

itarato reviewed May 2, 2023

View reviewed changes

dalehamel mentioned this pull request Jun 5, 2023

Labels mismatch grafana/pyroscope-rs#61

Open

dalehamel closed this Jul 12, 2023

tenderlove mentioned this pull request Jul 12, 2023

Add marker metadata feature #215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Allow reading per-sample tags from fiber-local variable storage #203

feat: Allow reading per-sample tags from fiber-local variable storage #203

dalehamel commented Apr 9, 2023 •

edited

dalehamel commented Apr 26, 2023

itarato May 2, 2023 •

edited

dalehamel May 2, 2023

dalehamel commented May 26, 2023 •

edited

tenderlove commented Jul 12, 2023

dalehamel commented Jul 12, 2023

tenderlove commented Jul 12, 2023

feat: Allow reading per-sample tags from fiber-local variable storage #203

feat: Allow reading per-sample tags from fiber-local variable storage #203

Conversation

dalehamel commented Apr 9, 2023 • edited

What

Why

How

Use cases

Querying on Tags

Multi-threaded view

Implementation

Sample Tags

Collection Overhead

Storage Overhead

Tag Strings

Known limitations

dalehamel commented Apr 26, 2023

itarato May 2, 2023 • edited

Choose a reason for hiding this comment

dalehamel May 2, 2023

Choose a reason for hiding this comment

dalehamel commented May 26, 2023 • edited

tenderlove commented Jul 12, 2023

dalehamel commented Jul 12, 2023

tenderlove commented Jul 12, 2023

dalehamel commented Apr 9, 2023 •

edited

itarato May 2, 2023 •

edited

dalehamel commented May 26, 2023 •

edited