Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program hangs if sample-collection time is > interval #123

Open
2 of 3 tasks
benbuckman opened this issue Nov 6, 2019 · 8 comments
Open
2 of 3 tasks

Program hangs if sample-collection time is > interval #123

benbuckman opened this issue Nov 6, 2019 · 8 comments

Comments

@benbuckman
Copy link
Contributor

benbuckman commented Nov 6, 2019

Context

We are trying to enable StackProf in production on a REST API service. We would like to use mode: :wall, raw: true, aggregate: true, but it has been challenging to calibrate the interval. The segfault bug described in #81 makes very low/fast sampling rates concerning (especially because our hypothesis about the causes of those segfaults, which we have not verified, centered on overlapping signal handlers). But at an interval of (e.g.) 20ms the data is not particularly useful.

Observed behavior

At lower intervals, however, we encounter a separate bug in which the program doesn't crash, but rather it hangs – the CPU is pegged to 100% and the program being profiled does not advance. This specifically seems to occur when the time to capture a single sample is greater than the interval between samples. (The time it takes to capture a sample fluctuates, I assume, based on the size of the stack, CPU load, etc, so it could take a bunch of samples successfully and then encounter this race.) What I suspect is occurring – and I am still in the process of debugging and verifying the details – is that it starts piling up jobs with rb_postponed_job_register_one, and by the time it finishes one sample, it's started another, (and so on forever), blocking any other instructions.

Proposal

I think some code needs to be added, to stackprof_record_sample or stackprof_record_sample_for_stack, measuring the time it took to capture the sample. If that time is >= interval, it needs to handle that somehow. I can think of a few options at this point:

  1. Calculate how many sample(s)' time was borrowed and skip that number of next samples. (This seems complicated and risky.)
  2. Inject fake frames into the results, similar to GC frames, that make it clear (to the final report or flamegraph) that this situation was encountered. (I'm not sure how feasible this would be.)
  3. Internally stop profiling, and when StackProf.results is called, raise an exception that makes it clear that the profiling was stopped due to this situation. (We don't want to raise mid-profiling because that could bubble up anywhere in the program. This seems the easiest of the 3.)

Next steps for me

  • Distill this bug down to a simple repro script.
  • Confirm that the bug is in fact caused by a pile-on of signal handlers / postponed jobs.
  • Implement one of the 3 solutions (probably the 3rd) behind a param passed to start or run and confirm that it fixes the bug.

(I'll update this issue as I do these.)

In the meantime, I'd be interested in anyone else's thoughts – have you also encountered this, do you know why it might be occurring, and what fix do you propose?

Thank you!

@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 6, 2019

OK, it's fairly easy to repro this. This uses recursion (to a max depth) because that causes the stack size to keep growing, which causes rb_profile_frames and stackprof_record_sample_for_stack to keep doing more work on each iteration.

  def recurse(num = 1, depth = 1)
    if depth == 10000
      num
    else
      recurse(num * 2, depth + 1)
    end
  end

At a high/slow sampling rate, this is fine:

StackProf.run(mode: :wall, interval: 1000) {recurse}

and it prints out some results. (And I see the same with interval: 100.)

But then down to 1 microsecond, this consistently hangs:

StackProf.run(mode: :wall, interval: 1) {recurse}

I assume the exact cutoff where this hangs will depend on context-specific factors like CPU. Using printf to observe the runtime, the amount of time required for stackprof_record_sample_for_stack is highly variable. But I think it only takes one, or a few, of the sample collections to be longer than interval for the program to hang indefinitely.

@NickLaMuro
Copy link
Contributor

NickLaMuro commented Nov 7, 2019

@benbuckman So looking at this per your request over in #122

So my initial thought after looking through your right up again and suggestions was "why don't we just use a lock so when multiple job handlers are being acted on at the same time, the handler just ejects early", but it turns out that was already implemented:

static void
stackprof_job_handler(void *data)
{
static int in_signal_handler = 0;
if (in_signal_handler) return;
if (!_stackprof.running) return;
in_signal_handler++;
stackprof_record_sample();
in_signal_handler--;
}

... or at least, I think it is. Not sure if line 551 is causing it to basically reset the counter every time the handler method is fired, but that might just be my lack of experience with C talking here...

If that code is indeed not working as intended, do we know why? Is it because the code isn't working as I described, or is something else at play?

Would a more proper mutex help here?

I will also try reproducing using the method you described above.

Edit: Lack of C experience is showing here. The static extends the lifetime of that variable to the end of the program, so in this case, in_signal_handler should be acting like a simple mutex. That said, unsure if that is not enough when the interval gets so low.

Edit 2: After re-reading your "working theory", everything above is pointless in mentioning as the lock doesn't prevent work from being piled up right after the handler code and outside of the lock... as your had already described. That said, I think I am able to provide some evidence supporting your theory, so I will do so in my next comment.

@NickLaMuro
Copy link
Contributor

Alright, so I think I have some evidence to support @benbuckman 's claims that we are creating a situation where the "queue" is piling up when the interval gets too small, and probably checks off your second box.

Most of the info for this is in a gist I created:

https://gist.github.com/NickLaMuro/3b54c57e1ffeb21316f38529fa93ea67

But effectively is a modified form of the reproducer that @benbuckman posted earlier. What I did was basically wrap his recurse method an iterator, and tweak some of the values when running it. The iterator specifically was there just to make sure I was find some consistent results between runs that I could go off of, and also allowed me to kill off the "hanging" process at consistent intervals.

In addition, when the process was killed, I was also keeping track of how large the caller got when it was killed, which can help determine how much was left to "work" on.

I was surprised at how quick I got consistent results, but once I started seeing somewhat consistent results when the process timed out, I used a general average to set the depth to so that, ideally, some times it would fail and some times it would pass. The number I ended up on was 900, and that did exactly as I expected.

I think what this shows, indirectly, in respect to your second "TODO" item:

  • Confirm that the bug is in fact caused by a pile-on of signal handlers / postponed jobs.

That there is a pileup, and that work is getting completed, though it is hinder specifically by the extra work that stackprof ends up injecting as part of it's profiling. I didn't add timestamps to the log output statements (I should have, and probably will shortly), but basically "Test 1" happens instantly (as Ben probably observed as well), but when I dialed it in to a depth of 900, it would take nearly the entire 30 seconds to complete.


So after going through this exercise, I am curious if we should possibly go about this in a similar (though opposite) fashion as I mentioned in a comment Ben's previous PR:

So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.

Obviously, in this case, we are in the complete opposite direction, but at some point the interval we are working with becomes so low that we are better off with an event-based profiler like ruby-prof instead of this. So I guess I am curious if we should provide a warning (or even the "errror" suggestion you provided in the description), or even just some documentation around the limits of too small of an interval.

That said, you mentioned that you had this issue in production. Are you using a more reasonable interval value in that case, or are you trying to push the interval value to something really small? My question would then be is there something else at play here that is causing the ruby to basically not get any work done in between profiles? Or causing the stack to be incredibly large in your case?

@benbuckman
Copy link
Contributor Author

Thank you @NickLaMuro !

My C knowledge is very weak, but I was able to add a bunch of timestamp diffs to the code – specifically, in stackprof_record_sample – and print them out. So, adding another angle –

  1. With mode: :wall, interval: 200 (microseconds), the flow (in a test, profiling my earlier recurse code) looks like this:
sampling 311 usec since last, 311 since start, with interval 200
duration of stackprof_record_sample: 79 usec with interval 200
sampling 36 usec since last, 426 since start, with interval 200
duration of stackprof_record_sample: 67 usec with interval 200
sampling 136 usec since last, 629 since start, with interval 200
duration of stackprof_record_sample: 51 usec with interval 200

... [a few dozen more lines] ...

sampling 107 usec since last, 17221 since start, with interval 200
duration of stackprof_record_sample: 83 usec with interval 200

(and then it prints out results and exits nicely.)

  1. However, with interval: 50, it's like this:
sampling 117 usec since last, 117 since start, with interval 50
duration of stackprof_record_sample: 187 usec with interval 50
sampling 35 usec since last, 339 since start, with interval 50
duration of stackprof_record_sample: 32 usec with interval 50
sampling 6 usec since last, 377 since start, with interval 50
duration of stackprof_record_sample: 13 usec with interval 50
sampling 27 usec since last, 417 since start, with interval 50
duration of stackprof_record_sample: 19 usec with interval 50
sampling 26 usec since last, 462 since start, with interval 50
duration of stackprof_record_sample: 19 usec with interval 50
sampling 34 usec since last, 515 since start, with interval 50
duration of stackprof_record_sample: 20 usec with interval 50
sampling 48 usec since last, 583 since start, with interval 50
duration of stackprof_record_sample: 29 usec with interval 50
sampling 3 usec since last, 615 since start, with interval 50
duration of stackprof_record_sample: 16 usec with interval 50

... [50k more lines] ...

sampling 7 usec since last, 3657101 since start, with interval 50
duration of stackprof_record_sample: 77 usec with interval 50
sampling 9 usec since last, 3657187 since start, with interval 50
duration of stackprof_record_sample: 69 usec with interval 50
Interrupted. Exiting...

(It was going on forever, I had to sigterm the process.)

My interpretation:

  1. on this sample code, it takes ~10..187 microseconds to run stackprof_record_sample.
  2. The static int in_signal_handler mutex is sort of working, insofar as it doesn't seem to be overlapping samples.
  3. However, it seems to be queuing up so many samples, faster than they're being collected, that the sampler hogs the whole process ~forever.

@benbuckman
Copy link
Contributor Author

benbuckman commented Nov 7, 2019

One possible solution which seems like it might work, is simply moving the in_signal_handler lock from static methods in the job handlers (which are working on the already-enqueued jobs), to the _stackprof global struct, checked in both the job handlers and the signal handler ... so it doesn't enqueue jobs if it's still working on one.

I'm not sure what this would mean for the final results, in terms of lost frames, I'll have to think about that tomorrow. I'll put up a POC-PR with what I've got so far, shortly.

Edit: WIP-PR is at #124

@benbuckman
Copy link
Contributor Author

One other note, re

So I guess I am curious if we should provide a warning (or even the "errror" suggestion you provided in the description), or even just some documentation around the limits of too small of an interval.
That said, you mentioned that you had this issue in production. Are you using a more reasonable interval value in that case, or are you trying to push the interval value to something really small?

One challenge here is that the safe interval is not constant. It depends on the size of the stack at any given moment, whether the same frame is repeating (because stackprof_record_sample in raw+aggregate modes is doing a bunch of analysis/aggregation on the frames). So I don't think it's as simple as setting a minimum number that the code always rejects, it has to be context-aware (e.g. know how long each sample is actually taking).

@tenderlove
Copy link
Collaborator

Hi, I'll take a look at this too. Of course any solution to this will result in dropped samples. Is there a reason you need to make the sampling frequency so high? Or is that just to demonstrate the bug? I'm curious because I've never run in to this, and I use this tool very frequently.

@isobit
Copy link
Contributor

isobit commented Aug 23, 2021

Hey all, we're still running into this with Ruby 3 despite #150. In our case we're profiling in a separate thread using something like this, which never returns since the thread is never given the opportunity to wake up:

profile = StackProf.run(mode: :cpu, raw: true, interval: 1000) do
  sleep(duration)
end

I'm not sure exactly why the sampling takes so long with our situation, but I've managed to repro this by forcing stackprof_record_sample() to take longer than the sampling interval with some dummy work:

index 98d4857..5f0f462 100644
--- i/ext/stackprof/stackprof.c
+++ w/ext/stackprof/stackprof.c
@@ -578,6 +578,9 @@ stackprof_record_sample_for_stack(int num, int64_t timestamp_delta)
 void
 stackprof_record_sample()
 {
+    volatile unsigned long long i;
+    for (i = 0; i < 1000000000ULL; ++i);
+
     int64_t timestamp_delta = 0;
     int num;
     if (_stackprof.raw) {

My hypothesis is that when the handler takes longer than the SIGPROF timer, by the time a SIGPROF handler is done, there is already another pending SIGPROF that preempts everything else. It's really the kernel/signal system that's doing the "queuing", since signals that happen during handling don't interrupt immediately, but rather are blocked and delivered at the next opportunity.

Unfortunately when a process enters this state, it's impossible to recover, since the signal handling code always preempts other work (and seems to be holding the GVL).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants