Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use postponed jobs if YJIT is enabled. #180

Merged
merged 1 commit into from Jul 18, 2022

Conversation

casperisfine
Copy link
Contributor

Fix: #179

YJIT doesn't support being interrupted by signal in random places, and probably will never support it.

Basically turn off #150, if YJIT is enabled.

cc @tenderlove @XrXr @jhawthorn

Fix: tmm1#179

YJIT doesn't support being interrupted by signal
in random places, and probably will never support it.
@casperisfine
Copy link
Contributor Author

I got some weird failures on 2.3 and 2.4, but not sure it's worth digging into it, I'd be tempted to make 2.5 the minimum tested version.

@casperisfine
Copy link
Contributor Author

I tested this on our YJIT CI and it does get rid of the VM crashes.

@tenderlove tenderlove merged commit ec245ef into tmm1:master Jul 18, 2022
@tenderlove tenderlove deleted the yjit-head-compat branch July 18, 2022 15:53
* See https://github.com/ruby/ruby/commit/0e276dc458f94d9d79a0f7c7669bde84abe80f21
*/
#if RUBY_API_VERSION_MAJOR < 3
stackprof_use_postponed_job = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bug.
If Ruby < 3.0 (or YJIT), we want to use postponed job.
In other words rb_profile_frames() can only be used anywhere for Ruby 3.0+ no YJIT.
This is most likely the cause of https://bugs.ruby-lang.org/issues/18967 and recent CI failures #182 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in #186

@eregon
Copy link
Contributor

eregon commented May 21, 2023

One surprising side effect of this is wildly different profiles when using YJIT vs not, and some of these differences are not because of JITed code but because of postponed jobs when using YJIT vs not.

For instance I'm looking at YARP parse, dump and load performance (on branch varint).
3.2.1 without YJIT:

$ ruby -Ilib -ryarp -rstackprof/autorun -rbenchmark -e '10.times { p Benchmark.realtime { Dir.glob("top-100-gems/activesupport-7.0.4.3/**/*.rb") { |file| code = File.read(file); YARP.load code, YARP.dump(code, file) } } }'
0.20659038900066662
0.20447865699952672
0.20755467499930091
...
StackProf results dumped at: /tmp/stackprof20230521-28681-srrvhw.dump
$ stackprof /tmp/stackprof20230521-28681-srrvhw.dump
==================================
  Mode: wall(1000)
  Samples: 2006 (0.35% miss rate)
  GC: 144 (7.18%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       273  (13.6%)         273  (13.6%)     YARP.dump
       349  (17.4%)         237  (11.8%)     Class#new
       275  (13.7%)         227  (11.3%)     YARP::Serialize::Loader#load_integer
      1532  (76.4%)         163   (8.1%)     YARP::Serialize::Loader#load_node
       496  (24.7%)         156   (7.8%)     YARP::Serialize::Loader#load_token
        97   (4.8%)          97   (4.8%)     (sweeping)
       503  (25.1%)          79   (3.9%)     YARP::Serialize::Loader#load_location
        79   (3.9%)          79   (3.9%)     StringIO#read
      1517  (75.6%)          58   (2.9%)     Enumerable#map
        58   (2.9%)          58   (2.9%)     StringIO#getbyte
       397  (19.8%)          57   (2.8%)     YARP::Serialize::Loader#load_optional_token
      1862  (92.8%)          54   (2.7%)     Integer#times
      1478  (73.7%)          51   (2.5%)     YARP::Serialize::Loader#load_optional_node
        48   (2.4%)          48   (2.4%)     String#[]
        48   (2.4%)          48   (2.4%)     String#unpack1
        47   (2.3%)          47   (2.3%)     (marking)
        41   (2.0%)          41   (2.0%)     YARP::Location#initialize
        38   (1.9%)          38   (1.9%)     IO.read
        36   (1.8%)          36   (1.8%)     String#force_encoding
        27   (1.3%)          27   (1.3%)     YARP::Token#initialize
      1517  (75.6%)          18   (0.9%)     Enumerator#each
        66   (3.3%)          17   (0.8%)     YARP::Serialize::Loader#load_optional_location
        45   (2.2%)          13   (0.6%)     YARP::Serialize::Loader#load_string
        11   (0.5%)          11   (0.5%)     StringIO#pos
        11   (0.5%)          11   (0.5%)     YARP::CallNode#initialize
         7   (0.3%)           7   (0.3%)     StringIO#pos=
         6   (0.3%)           6   (0.3%)     Integer#<<
      1541  (76.8%)           5   (0.2%)     YARP::Serialize::Loader#load
         5   (0.2%)           5   (0.2%)     YARP::DefNode#initialize
      1861  (92.8%)           4   (0.2%)     Dir.glob

3.2.1 with YJIT:

$ ruby --yjit -Ilib -ryarp -rstackprof/autorun -rbenchmark -e '10.times { p Benchmark.realtime { Dir.glob("top-100-gems/activesupport-7.0.4.3/**/*.rb") { |file| code = File.read(file); YARP.load code, YARP.dump(code, file) } } }'
0.17736748199968133
0.15191541199965286
0.14858805099993333
...
StackProf results dumped at: /tmp/stackprof20230521-28997-ym472m.dump
$ stackprof /tmp/stackprof20230521-28997-ym472m.dump
==================================
  Mode: wall(1000)
  Samples: 1512 (0.40% miss rate)
  GC: 125 (8.27%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       290  (19.2%)         290  (19.2%)     BasicObject#initialize
       144   (9.5%)         144   (9.5%)     String#unpack1
       128   (8.5%)         128   (8.5%)     StringIO#getbyte
       117   (7.7%)         117   (7.7%)     YARP::Location#initialize
       109   (7.2%)         109   (7.2%)     StringIO#read
       102   (6.7%)         102   (6.7%)     String#[]
        93   (6.2%)          93   (6.2%)     (sweeping)
        78   (5.2%)          78   (5.2%)     YARP::Token#initialize
        68   (4.5%)          68   (4.5%)     String#force_encoding
        35   (2.3%)          35   (2.3%)     Integer#==
        34   (2.2%)          34   (2.2%)     IO.read
        32   (2.1%)          32   (2.1%)     (marking)
      1386  (91.7%)          31   (2.1%)     Integer#times
       112   (7.4%)          23   (1.5%)     YARP::Serialize::Loader#load_integer
        23   (1.5%)          23   (1.5%)     Integer#<<
        19   (1.3%)          19   (1.3%)     StringIO#pos
      1029  (68.1%)          18   (1.2%)     Enumerable#map
        16   (1.1%)          16   (1.1%)     YARP::CallNode#initialize
        13   (0.9%)          13   (0.9%)     YARP::StatementsNode#initialize
        10   (0.7%)          10   (0.7%)     YARP::ArgumentsNode#initialize
        10   (0.7%)          10   (0.7%)     YARP.dump
         9   (0.6%)           9   (0.6%)     YARP::LocalVariableReadNode#initialize
       999  (66.1%)           8   (0.5%)     YARP::Serialize::Loader#load_optional_node
       255  (16.9%)           8   (0.5%)     YARP::Serialize::Loader#load_optional_token
       341  (22.6%)           7   (0.5%)     YARP::Serialize::Loader#load_token
         7   (0.5%)           7   (0.5%)     StringIO#pos=
      1029  (68.1%)           6   (0.4%)     Enumerator#each
      1385  (91.6%)           5   (0.3%)     Dir.glob
      1040  (68.8%)           4   (0.3%)     YARP::Serialize::Loader#load_node
         4   (0.3%)           4   (0.3%)     YARP::ConstantReadNode#initialize

YARP.dump (a method implemented in C) had 273 samples (=~ 273 ms) in the first profile, but only 10 samples in the second profile. So with YJIT (with postponed jobs) it seems the time spent in C extension methods is misreported, and probably attributed to something else incorrectly.
And also the time spent in JITed code might also be misattributed.

For instance 19.2% BasicObject#initialize seems unlikely and too high here for an empty method, that's probably in part a misattribution for allocation, i.e., for Class#new, and possibly a misattribution for YARP.dump (that's just a guess though).

@casperisfine
Copy link
Contributor Author

Note that historically, stackprof would just always go through postponed jobs.

It's only since #150 that postponed jobs are skipped if you are on Ruby 3+.

So yeah, enabling YJIT introduce some amount of inaccuracy, but I don't see what we can do about it :/

@ivoanjo
Copy link
Contributor

ivoanjo commented May 22, 2023

In the JVM, there's a ~unofficial API that profilers such as async-profiler where they can still sample from the signal handler, even in the presence of JIT and other weird VM states (such as during GC).

That's obviously a non-trivial approach, but making rb_profile_frames (or some successor) always be usable from the signal handler would help avoiding this sampling bias.

@casperisfine
Copy link
Contributor Author

When I discussed this with the YJIT team, they said it was unlikely this would ever be supported. I only have limited knowledge here, but I get that it's tricky to do.

@ivoanjo
Copy link
Contributor

ivoanjo commented May 22, 2023

I wonder if it's mostly a performance concern (e.g. limiting the kind of optimizations that can be done, or being more costly having to push the stack frames), or if there's other concerns as well.

E.g. if the main concern was performance, it occurs to me that it would be interesting to maybe make it an option to "go slower, but more observable". That way we could enable the flag we you really need more visibility, with knowledge that Ruby would go slower.

@casperisfine
Copy link
Contributor Author

IIRC it was both, as it both complicate the code and made it slower. That conversation was a long time ago.

But overall I agree that Ruby profiling APIs are less than ideal, and some work upstream would be nice.

@eregon
Copy link
Contributor

eregon commented May 22, 2023

I think the main question is why is it unsafe to use rb_profile_frames() in a signal handler with YJIT enabled?
Is the problem that rb_profile_frames() can't walk the stack to find the caller?
I think https://bugs.ruby-lang.org/issues/19541 might be related.

@casperisfine
Copy link
Contributor Author

why is it unsafe to use rb_profile_frames() in a signal handler with YJIT enabled?

The conversation was too long ago, so I'll defer to @tenderlove here.

@tenderlove
Copy link
Collaborator

@eregon At the time we did this, YJIT didn't "atomically" write Ruby frames. It would push a frame before filling out the frame's memory, and if rb_profile_frames was ever executed when that happened, it would break.

The redmine ticket you're linking to is with regard to C stack frames, rb_profile_frames only walks Ruby frames. I don't think we want to force YJIT (or any JIT) to write stack frames in a certain order, so it seems like maybe we need to make rb_profile_frames JIT aware.

Regardless, I don't think this PR has anything to do with the redmine issue.

@eregon
Copy link
Contributor

eregon commented May 23, 2023

It would push a frame before filling out the frame's memory

Is there any performance advantage to do so? I'm thinking "push a frame" is basically updating some pointer by writing to memory, so performance-wise it might be identical to do it in a different order. Doing it like the interpreter ("push" last) would then let rb_profile_frames() work as-is probably. And in fact the JIT can easily choose the order of writes since it emits the assembly, unlike the interpreter.

But maybe it's about filling the frame lazily and only when needed, e.g., just before a side exit?

The redmine ticket you're linking to is with regard to C stack frames, rb_profile_frames only walks Ruby frames

Ah, my bad, thanks for clarifying.

@ianks
Copy link
Contributor

ianks commented Jun 16, 2023

Thanks for the information here, y’all. My question is: can we currently trust rb_profile_frames when YJIT is enabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incompatible with YJIT 3.2.0
7 participants