Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser::Lexer#advance method is too big for JIT compilers #871

Open
eregon opened this issue Oct 13, 2022 · 37 comments
Open

Parser::Lexer#advance method is too big for JIT compilers #871

eregon opened this issue Oct 13, 2022 · 37 comments

Comments

@eregon
Copy link
Contributor

eregon commented Oct 13, 2022

I have been investigating why parser is so slow on TruffleRuby.
I had a suspicion it's a too big generated method and indeed that's now clearly the case from this analysis: oracle/truffleruby#2290 (comment)
This has been found before, and notably I found #524 from a quick search and #248 (comment), #248 (comment).

To summarize, the Parser::Lexer#advance method is currently 13525 lines long which is humongous.
I would think most if not all Ruby JITs cannot compile such a big method (or if they do they will compile it in a very suboptimal manner).
And so that method needs to be run in the interpreter, and that's typically slower the more advanced the JIT is due to more profiling and other reasons.
Hence parser is slow on anything but CRuby, on MJIT it hangs, on YJIT it's a bit faster like 11% (YJIT seems to compile a few blocks of advance, not all). I used ruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 10.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'.
So on TruffleRuby we can see it tried to compile it, saw it was way too big and bailed out, and then executed the advance method all the time in interpreter.

It's easy to reproduce, for example in the parser repo:

$ truffleruby --engine.TraceCompilation -Ilib -rparser/current -e 'loop { Parser::CurrentRuby.parse(File.read("test/helper.rb")) }' |& grep 'Lexer#advance'      
[engine] opt failed id=6106  Lexer#advance                                      |Tier 1|Time  778(-24199569+24200347)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 113880. Graph Size: 150001. Limit: 150000.|Timestamp 24200346847485|Src lexer.rb:11303
[engine] opt failed id=6106  Lexer#advance                                      |Tier 1|Time  635(-24200352+24200987)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 113879. Graph Size: 150001. Limit: 150000.|Timestamp 24200987352654|Src lexer.rb:11303

(The Graph Size: 150001 just means it aborted after reaching the limit)

Interesting it does compile when lexing very small and simple files

$ truffleruby --engine.TraceCompilation -Ilib -rparser/current -e 'loop { Parser::CurrentRuby.parse(File.read("test/test_static_environment.rb")) }' |& grep 'Lexer#advance'
[engine] opt done   id=6106  Lexer#advance                                      |Tier 1|Time  3984(1366+2618)ms|AST 26200|Inlined   0Y  66N|IR   9583/ 42570|CodeSize  224281|Addr 0x7f472e900000|Timestamp 24166598032966|Src lexer.rb:11303

So maybe it's not so much above the limit of what TruffleRuby/Graal can compile.

Of course, the lexer.rb file is generated by ragel from lexer.rl.
One idea would be to use different ragel flags to produce smaller method(s).
I tried all options from the --help with ragel 6.10 to see if we can make the method smaller, here are my notes:

ragel --version
Ragel State Machine Compiler version 6.10 March 2017
Copyright (c) 2001-2009 by Adrian Thurston

lexer.rl:
2618 lines

F1: (currently used)
25014 lines
advance: 11302-24827 = 13525 lines
with -L: 11298-23926 = 12628 lines
works

F0:
18145 lines
advance: 11486-17958 = 6472 lines
with -L: 11482-17576 = 6094 lines
fails with undefined local variable or method `_lex_actions' for #<Parser...> (NameError) on CRuby

T0:
10482 lines
advance: 3786-10295 = 6509 lines
with -L: 3782- 9913 = 6131 lines
fails with undefined method `_lex_key_spans' for Parser::Lexer:Class (NoMethodError) on CRuby

T1:
17356 lines
advance: 3602-17169 = 13567 lines
with -L: 3598-16268 = 12670 lines
fails with undefined method `_lex_key_spans' for Parser::Lexer:Class (NoMethodError) on CRuby

G0 --rbx:
43465 lines
advance: 820-43278 = 42458 lines
with -L: 816-42900 = 42084 lines

G1, G2, P: Invalid code style

So the smallest is F0 (and T0), but that's still 6094 lines which still sounds a lot for a single method.
But maybe it helps enough for JITs.

Some points:

  1. For some reason only F1 works, F0/T0/T1 fail as noted above.
    Does anyone know why? Maybe that's a bug of Ragel? EDIT: I've found It's time to sizzle #248 (comment)

  2. There is a BlockNode in Truffle which enables automatically splitting huge methods in smaller ones, this is not yet supported on TruffleRuby but would be worth a try. I think it depends on the structure of the code for whether this is able to split Parser::Lexer#advance in smaller chunks (e.g., it wouldn't work with a huge case/when spanning most of the method, which looks like it is the case here :/). And it will be compiled less optimally than if it was reasonably-sized methods in the first place.

  3. Have there been other experiments in this area to reduce the size of Parser::Lexer#advance?

  4. Extracting larger semantic actions into methods/lambdas would definitely help. (It's time to sizzle #248 (comment))

  5. Maybe generating C or Java code would be a better for this kind of stuff (although Java also has a size limit to compile, and C compilers can have pretty bad register allocation with huge functions)?

  6. I guess one possibility is of course to wait for the new Ruby parser by @kddnewton, as that will have the lexer and parser written in C (and probably not generated with ragel so much more reasonable function sizes). And possibly we could use it for this gem for recent Ruby versions.

@eregon eregon changed the title The generated Parser::Lexer#advance method is too big for JIT compilers Parser::Lexer#advance method is too big for JIT compilers Oct 13, 2022
@eregon
Copy link
Contributor Author

eregon commented Oct 14, 2022

  1. Extracting larger semantic actions into methods/lambdas would definitely help.

Maybe we could experiment by doing that as postprocessing (e.g. using parser :D) or tweaking Ragel.
Specifically we could wrap every

when 1
  some
  logic
when 2

in a lambda like

when 1
  -> {
    some
    logic
  }.call
when 2

And possibly only do so when there is more than some threshold of code for a semantic action.
It's likely not as efficient as manually grouping bigger semantic actions in methods, as calling a lambda passes the frame etc, but OTOH it's easier because we can still access all local variables.

@iliabylich
Copy link
Collaborator

For some reason only F1 works, F0/T0/T1 fail as noted above.
Does anyone know why? Maybe that's a bug of Ragel? EDIT: I've found #248 (comment)

I don't know, last time I checked it (~2y ago) it worked but I had similar number of LOCs.

Have there been other experiments in this area to reduce the size of Parser::Lexer#advance?

No.

Extracting larger semantic actions into methods/lambdas would definitely help.

True, but it will also make it slightly slower on MRI, right?

Maybe generating C or Java code would be a better for this kind of stuff (although Java also has a size limit to compile, and C compilers can have pretty bad register allocation with huge functions)?

Not sure. A long time ago I made this - it's a port of the lexer to C, to make Opal faster (mostly based on the PR you mentioned). Last time I updated it ~3 years ago, basically it's no longer maintained. Is it possible to check how it behaves on TruffleRuby?

On C and register allocation optimisation - I think if two locals do not co-exist at the same time compiler can assign both of them to the same register. This case/when looks exactly like this scenario to me, so I think it can be optimised.

I guess one possibility is of course to wait for the new Ruby parser by @kddnewton, as that will have the lexer and parser written in C (and probably not generated with ragel so much more reasonable function sizes). And possibly we could use it for this gem for recent Ruby versions.

IIRC the plan is to release it in a year or two (cc @kddnewton) so I don't know if it's ok for you.

@eregon
Copy link
Contributor Author

eregon commented Oct 14, 2022

True, but it will also make it slightly slower on MRI, right?

It could, we'd have to measure it to know. Probably not much if it's extracted methods.

We could also potentially generate two lexer files, especially with the automated lambda approach in #871 (comment) that'd make more sense. So we'd use the current state on CRuby but the "lambda-ified" versions for TruffleRuby, JRuby, etc.
If we use another Ragel mode I also think two generated lexer files is better, the current setup is tuned for CRuby, but unfortunately does not work well on implementations with an optimizing JIT.

Is it possible to check how it behaves on TruffleRuby?

Ah yes I remembered something like that but couldn't find it anymore. I'll try it.

On C and register allocation optimisation

Right, that's correct. It's probably much harder to analyze if it's gotos everywhere, but ultimately it depends on the max number of live variables at a point in the function. I think C compilers also have trouble with very large functions (e.g. I could imagine some optimizations are skipped because too slow to run on such functions), although at least they are a bit less time-limited than JITs for how how long it takes to compile.

@kddnewton
Copy link

The parser side of things in the new project is pretty far off, but the lexer side of things is actually quite a bit closer. Maybe it's worth investigating using the shared library just for the lexing bit?

@iliabylich
Copy link
Collaborator

No, I think this won't work. In MRI parser tells lexer what to do in multiple places, i.e. you can't tokenize without performing reduce actions in the parser 😄 . What that means is that the state of the parser is tightly coupled with the state of the lexer, and any other lexer (with different internal state) won't match the parser (and other way around), and so only this (or very similar) lexer can work with this parser. Yeah, this is one of the reasons why parse.y should be replaced with something simpler.

@janbiedermann
Copy link

janbiedermann commented Oct 30, 2022

This issue applies to opal likewise, when the compiler is run in node. V8 cannot optimize, because the method is too big.

@eregon
Copy link
Contributor Author

eregon commented Nov 23, 2022

Right, that's correct. It's probably much harder to analyze if it's gotos everywhere, but ultimately it depends on the max number of live variables at a point in the function. I think C compilers also have trouble with very large functions (e.g. I could imagine some optimizations are skipped because too slow to run on such functions), although at least they are a bit less time-limited than JITs for how how long it takes to compile.

On that subject, https://sillycross.github.io/2022/11/22/2022-11-22/ makes a good case for C compilers not being to optimize well big switch/case statements.

Regarding c_lexer I didn't try it yet, but I suspect if we execute it on GraalVM LLVM like other C extensions we'll also just have a too big C function to compile (lexer_advance is 8355 lines long). We actually had the same problem with Ripper, which is a huge C function as well: oracle/truffleruby#2767.

I think these huge lexing/parsing functions/methods are a problem regardless of the language. When written in C and compiled AOT at least it compiles but it's probably suboptimal (still much better than interpreted of course).

@iliabylich
Copy link
Collaborator

I think these huge lexing/parsing functions/methods are a problem regardless of the language

lib-ruby-parser also uses Bison and its main function consists of 8513 LOC and 775 case-switch branches but it's very performant (~ 600K LOC/s), so I'm pretty sure it can be potentially optimised. Could you test it please? Or is it incompatible with TruffleRuby in terms of C APIs?

Anyway, I don't see any other solution, so I guess it's worth to try.

@eregon
Copy link
Contributor Author

eregon commented Nov 23, 2022

BTW, I noticed in Ripper, the lexing function seems to be parser_yylex and that's only 707 lines long:
https://github.com/oracle/truffleruby/blob/c9c717bf822deb4d5592eeedc05d1842bf72b88d/src/main/c/ripper/ripper.c#L16309-L17016
That's about 10x smaller than Ragel-generated lexers, which is quite an impressive difference.

I tried c_lexer on TruffleRuby (with these fixes) and indeed it's too big:

$ cd parser
$ truffleruby --engine.TraceCompilation -rparser -rc_lexer -e 'loop { Parser::Ruby26WithCLexer.parse(File.read("test/helper.rb")) }' |& grep 'opt failed'
[engine] opt failed id=7661  lexer_advance<OSR@4>                               |Tier 2|Time  599(-7799877+7800475)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 122151. Graph Size: 150001. Limit: 150000.|Timestamp 7800475343459|Src n/a
[engine] opt failed id=7661  lexer_advance<OSR@4>                               |Tier 2|Time  848(-7800509+7801357)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 122151. Graph Size: 150001. Limit: 150000.|Timestamp 7801356791513|Src n/a
[engine] opt failed id=5939  lexer_advance                                      |Tier 1|Time 15409(-7792107+7807517)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 217492. Graph Size: 150001. Limit: 150000.|Timestamp 7807516580461|Src lexer.rl:342
[engine] opt failed id=5939  lexer_advance                                      |Tier 1|Time 11109(-7807533+7818643)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 200583. Graph Size: 150002. Limit: 150000.|Timestamp 7818642852920|Src lexer.rl:342

I'll try lib-ruby-parser. The gem is only distributed as precompiled and that can't be used on TruffleRuby, so I'll try to build it from the repo.

I think a potential solution here is to extract actions into lambdas like in #871 (comment), or do the same in TruffleRuby's parser/translator based on some heuristic. I haven't attempted that yet though, and it seems easier to prototype it with post-processing than changing the TruffleRuby translator for that.

@iliabylich
Copy link
Collaborator

iliabylich commented Nov 23, 2022

I'll try lib-ruby-parser

I don't think it's an easy task. The library is written in Rust, C bindings are a wrapper around Rust lib (and it's pre-compiled for major platforms in Releases section), Ruby bindings are a wrapper around C bindings, so... you'll have to re-compile the whole chain of libraries into TruffleRuby format.

I think a potential solution here is to extract actions into lambdas like in #871 (comment), or do the same in TruffleRuby's parser/translator based on some heuristic. I haven't attempted that yet though, and it seems easier to prototype it with post-processing than changing the TruffleRuby translator for that.

Many actions in lexer depend on local variables that are a part of this giant switch, so it's possible only in a small subset of trivial actions. IIRC any Ragel DSL (like fnext/fgoto) relies on local variable p (pointer to the current byte in the buffer), and these lambdas must return new p (and potentially all values for other primitive local variables)

@eregon
Copy link
Contributor Author

eregon commented Nov 23, 2022

I don't think it's an easy task. The library is written in Rust, C bindings are a wrapper around Rust lib (and it's pre-compiled for major platforms in Releases section), Ruby bindings are a wrapper around C bindings, so... you'll have to re-compile the whole chain of libraries into TruffleRuby format.

I gave it a quick try now. The main hurdle is the precompiled .a (on Linux). I actually want to run the lib-ruby-parser Rust code as native and not on Sulong (so Sulong doesn't run into these too big functions, and it's AOT-compiled), but I'd need a separate .so instead of a .a (otherwise we could have globals accessed both as native and by Sulong and it's complicated), that I can link in the C extension .so, so only the code using the Ruby C API is compiled and run by Sulong, the rest stays native.
I've tried this:

cd c-bindings
ar -x libruby_parser_c-x86_64-unknown-linux-gnu.a
gcc -shared -o libruby_parser_c-x86_64-unknown-linux-gnu.so *.o

But it gives me Polyglot::ForeignException: Cannot convert 0x7f54229fef08 (24 align 8) to native pointer. on TruffleRuby.
It does seem to work fine (for make test) on CRuby though, so it might be a Sulong issue.

Many actions in lexer depend on local variables that are a part of this giant switch, so it's possible only in a small subset of trivial actions. IIRC any Ragel DSL (like fnext/fgoto) relies on local variable p (pointer to the current byte in the buffer), and these lambdas must return new p (and potentially all values for other primitive local variables)

I don't think that's an issue because lambdas can simply read/write outer variables, as if the lambda was not there.
It could be an issue if an action declares new variables and those must be visible outside, but that sounds very unlikely (and could also be handled by declaring those vars outside the lambda).

@iliabylich
Copy link
Collaborator

I don't think that's an issue because lambdas can simply read/write outer variables, as if the lambda was not there.

Ah, as I understand you want to define these lambdas in the advance method, right? Then of course it should work. But it will slow down MRI version (because lambdas are re-created for every token), so I don't think it can be added to parser.

Do you know how much should be extracted to get this method suitable for JIT? There are several big actions that are emitted 16 times 😄

Also, I've just realised that Ragel supports embedding one FSM into another, maybe then it will generate code for them separately. Can't google it actually right now, but I think I've seen it somewhere.

@headius
Copy link
Contributor

headius commented Dec 13, 2022

This also impacts JRuby, although to a lesser extent.

MRI

$ ruby --yjit -v -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 15.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
ruby 3.1.3p185 (2022-11-24 revision 1a6b16756e) [arm64-darwin22]
0.2480760000180453
0.23679300001822412
0.2410890000173822
0.23989199998322874
0.24011799995787442
0.23913500003982335
0.23782300006132573
0.23797499993816018
0.23840399994514883
0.23862499999813735
0.23821500001940876
0.23765699996147305
0.23704100004397333
0.23872399993706495
0.23942200001329184

JRuby

$ ruby -Xcompile.invokedynamic -v -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 15.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
jruby 9.4.1.0-SNAPSHOT (3.1.0) 2022-12-10 99b930a7dc OpenJDK 64-Bit Server VM 19.0.1+10 on 19.0.1+10 +indy +jit [arm64-darwin]
warning: parser/current is loading parser/ruby31, which recognizes 3.1.3-compliant syntax, but you are running 3.1.0.
Please see https://github.com/whitequark/parser#compatibility-with-ruby-mri.
3.0391266249935143
1.6639934169943444
1.2258779580006376
1.0422158330038656
0.9583483749884181
0.5929004999925382
0.6371372499852441
0.7089660830097273
0.6819529169879388
0.5888017910183407
0.6666723749949597
0.5366863750095945
0.6753377499990165
0.5764439160120673
0.533481499995105

Indeed the advance method does not compile:

$ ruby -Xjit.logging -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 10.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' 2>&1 | grep advance
2022-12-13T10:19:53.398+07:00 [Ruby-0-JIT-1] INFO JITCompiler : could not compile method; passes run: []: Parser::Lexer advance at /Users/headius/work/parser/lib/parser/lexer.rb:11302 because of: "org.jruby.compiler.NotCompilableException: Could not compile org.jruby.internal.runtime.methods.MixedModeIRMethod@6128203e INSTANCE_METHOD advance[/Users/headius/work/parser/lib/parser/lexer.rb:11302]<startup> signature(pre=0,opt=0,post=0,rest=NONE,kwargs=0,kwreq=0,kwrest=-1); instruction count 25005 exceeds threshold of 1000"
2022-12-13T10:19:53.991+07:00 [Ruby-0-JIT-1] INFO JITCompiler : block done jitting: Parser::Lexer advance_CLOSURE_25 at /Users/headius/work/parser/lib/parser/lexer.rb:24610
2022-12-13T10:19:54.259+07:00 [Ruby-0-JIT-1] INFO JITCompiler : block done jitting: Parser::Lexer advance_CLOSURE_26 at /Users/headius/work/parser/lib/parser/lexer.rb:24654
2022-12-13T10:19:56.399+07:00 [Ruby-0-JIT-1] INFO JITCompiler : block done jitting: Parser::Lexer advance_CLOSURE_19 at /Users/headius/work/parser/lib/parser/lexer.rb:23460
2022-12-13T10:19:57.643+07:00 [Ruby-0-JIT-1] INFO JITCompiler : block done jitting: Parser::Lexer advance_CLOSURE_20 at /Users/headius/work/parser/lib/parser/lexer.rb:23487

This "instruction count 25005 exceeds threshold of 1000" is an artificial limit in JRuby, but even if I bump it up to 26000 is still fails because then the JVM bytecode produced is too large. We set the limit this low because we know the JVM itself has limits on how large a method it's willing to JIT to native code.

Years ago I experimented with pulling out when blocks as methods but many of these blocks have multiple outputs and can't be outlined. Those could be lambdas but that hamstrings performance by creating too many new lambda objects. In any case, even with outlined methods and lambdas the method is still extremely large and hard to JIT.

I also experimented with using a tree of case/when, simply chunking the method into pieces. That can probably make it small enough but it requires post-processing the ragel output.

This issue may be more relevant today with CRuby starting to gain JIT capabilities, since this giant method may also confound yjit.

@eregon
Copy link
Contributor Author

eregon commented Dec 13, 2022

Ah, as I understand you want to define these lambdas in the advance method, right? Then of course it should work.

Right. Even though these lambdas would be part of the advance method they are JITed separately and don't really count as part of the advance method except the bit of logic to create a lambda.

But it will slow down MRI version (because lambdas are re-created for every token), so I don't think it can be added to parser.

Yeah, but since that could be an automated process we could have the same lexer as today for CRuby/non-JIT and the lambda-ized or so lexer for the rest, so two files.


Another way to make the lexer method even smaller would be to have it as some kind of table dispatch, and so have an Array or Hash of Integer state -> lambda. That means we would need to declare all those lambdas in some method and declare the shared variables there. That's clearly more automated rewriting though.
Ruby's case/when is actually checking values one by one in order unless optimized by that Ruby impl so that might even be faster for some Rubies. I think e.g. CRuby uses a Hash for case/when where all when are integer literals, which is probably slower than an Array.

@iliabylich
Copy link
Collaborator

I've tried to extract extend_string_escaped action to a separate method (it is the most repetitive and probably the biggest action) and lexer size reduced from 838731 to 703723 bytes. @eregon your code from #871 (comment) no longer prints any warnings for me locally, so I suppose TruffleRuby will be able to compile it with tiny change. This change is definitely acceptable, this action doesn't depend on Ragel local variables. However, it's still pretty slow (it slowly goes from 9s to ~1s to parse test/test_lexer.rb in a loop, approx 4 times slower than MRI). If you think it's enough I could commit this change.

However it doesn't make any difference on JRuby.

I've also tried to extract part of the lexer into a separate file using include FsmName "file.rl" syntax but it doesn't change anything in the final output. I think it simply evals given filepath.

@eregon
Copy link
Contributor Author

eregon commented Dec 13, 2022

@iliabylich Nice, could you share a branch or PR with that change? I'd like to take a closer look if it compiles, which tier and also try it on bigger files.

@eregon
Copy link
Contributor Author

eregon commented Dec 15, 2022

@iliabylich I've tried it (with truffleruby-dev and truffleruby-22.3.0) and for me it still fails, even though I do see the size reduction.
Which command did you use?
I used the first command in #871 (comment), i.e.,

truffleruby --engine.TraceCompilation -Ilib -rparser/current -e 'loop { Parser::CurrentRuby.parse(File.read("test/helper.rb")) }' |& grep 'Lexer#advance'

If I increase the limit from 150000 (default) to 300000 it does compile, but the compilation takes ~20 seconds which is a big part of why such limits exist:

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=300000 -Ilib -rparser/current -e 'loop { Parser::CurrentRuby.parse(File.read("test/helper.rb")) }' |& grep 'Lexer#advance'
[engine] opt done   id=5366  Lexer#advance                                      |Tier 1|Time 20953(5344+15610)ms|AST 22810|Inlined   0Y 175N|IR  18875/105350|CodeSize  531622|Addr 0x7fddcd67d000|Timestamp 4369494731000|Src lexer.rb:11303
[engine] opt done   id=5366  Lexer#advance                                      |Tier 2|Time 16747(4881+11866)ms|AST 22810|Inlined   0Y 175N|IR  15972/ 95708|CodeSize  403314|Addr 0x7fddcd4df000|Timestamp 4386241461029|Src lexer.rb:11303

However, it's still pretty slow (it slowly goes from 9s to ~1s to parse test/test_lexer.rb in a loop, approx 4 times slower than MRI).

Which command line/script are you using? I guess you're measuring 15 parses at once or so like Charles above?
Here is what I see (on your branch):

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=300000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
3.8716205830005492
3.0478081459996247
2.66262898199966
2.601209966999704
...
2.5871676730002946
2.313523000000714
2.363621733999935
[engine] opt done   id=5463  Lexer#advance                                      |Tier 1|Time 56731(8068+48664)ms|AST 24539|Inlined   0Y 234N|IR  39070/200190|CodeSize  889375|Addr 0x7f2e01325000|Timestamp 5331300135214|Src lexer.rb:11303
0.7702079600003344
0.40126357000008284
0.4373156090005068
...
0.450402135999866
0.4029490379998606
0.34953478000079485
[engine] opt done   id=5463  Lexer#advance                                      |Tier 2|Time 22288(4782+17507)ms|AST 24539|Inlined   0Y 234N|IR  35194/162969|CodeSize  582741|Addr 0x7f2dcaf87000|Timestamp 5353588997678|Src lexer.rb:11303
0.4355955619994347
0.3757834369998818

So there is a clear speedup once it gets compiled in Tier 1 from ~2.5s to ~0.4s and not much difference in Tier 2 (because it doesn't inline anything anyway since it's so big).
Locally I see for CRuby 3.1 (no YJIT):

$ ruby -v -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 15.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'       
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
0.19772192899927177
0.19467738300045312
0.18961978400056978

In any case, your PR is helpful, I suggest to merge it.
I'll try with F0 too to see if that fits in the default JIT limit.

@iliabylich
Copy link
Collaborator

I ran this (on truffleruby 21.0.0):

$ ruby --engine.TraceCompilation -Ilib -rparser/current -e 'loop { Parser::CurrentRuby.parse(File.read("test/helper.rb")) }' |& grep 'Lexer#advance'
[engine] opt done     Parser::Lexer#advance                                       |AST 23292|Time 10903(2256+8647)ms|Tier 2|Inlined   0Y 186N|IR 43150/101967|CodeSize 405875|Addr 0x12048c000|Src lexer.rb:11303

@eregon
Copy link
Contributor Author

eregon commented Dec 15, 2022

With -F0, diff (on top of your PR):

diff --git a/Rakefile b/Rakefile
index 74af484..c9e0391 100644
--- a/Rakefile
+++ b/Rakefile
@@ -152,7 +152,7 @@ task :changelog do
 end
 
 rule '.rb' => '.rl' do |t|
-  sh "ragel -F1 -R #{t.source} -o #{t.name}"
+  sh "ragel -F0 -R #{t.source} -o #{t.name}"
 end
 
 rule '.rb' => '.y' do |t|
diff --git a/lib/parser/lexer.rl b/lib/parser/lexer.rl
index cea2fc6..3ac32e8 100644
--- a/lib/parser/lexer.rl
+++ b/lib/parser/lexer.rl
@@ -272,6 +272,7 @@ class Parser::Lexer
     _lex_to_state_actions   = klass.send :_lex_to_state_actions
     _lex_from_state_actions = klass.send :_lex_from_state_actions
     _lex_eof_trans          = klass.send :_lex_eof_trans
+    _lex_actions            = klass.send :_lex_actions
 
     pe = @source_pts.size + 2
     p, eof = @p, pe

It still doesn't fit in the default limit, but the size is already much smaller.
For example, comparing Tier2:

Lexer#advance F1|Tier 2|Time 22288(4782+17507)ms|AST 24539|Inlined   0Y 234N|IR  35194/162969|CodeSize  582741
Lexer#advance F0|Tier 2|Time 26416(7389+19027)ms|AST 15784|Inlined   0Y 192N|IR  29214/123678|CodeSize  458046

AST -36% Calls -18% IR(after) -24% CodeSize -21%

Also it fits in --engine.MaximumGraalGraphSize=200000 unlike F1 which needs --engine.MaximumGraalGraphSize=250000 for the test/test_lexer.rb example.

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=300000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
...
2.447555299000669
2.3657063669998024
2.282051142000455
2.424717626000529
[engine] opt done   id=5451  Lexer#advance                                      |Tier 1|Time 42349(6489+35861)ms|AST 15784|Inlined   0Y 192N|IR  32569/153201|CodeSize  693759|Addr 0x7f4ba358d000|Timestamp 6998886801123|Src lexer.rb:11487
1.6588009919996693
0.41041754100024264
0.5635040609995485
0.3805567110002812
...
0.4246950149999975
0.29191841700048826
0.2717088529998364
0.4588416619999407
[engine] opt done   id=5451  Lexer#advance                                      |Tier 2|Time 26416(7389+19027)ms|AST 15784|Inlined   0Y 192N|IR  29214/123678|CodeSize  458046|Addr 0x7f4ba1dc2000|Timestamp 7025302937162|Src lexer.rb:11487
0.5132588390006276
0.295992022000064
0.57665306600029
0.37548583899933874
0.2837178939998921

I think that's a big enough improvement to actually generate two lib/parser/lexer.rb files, and use F1 for CRuby and F0 otherwise.

CRuby 3.1 seems slightly slower with F0:

  • Without YJIT: 0.182 (F1) vs 0.209 (F0)
  • With YJIT: 0.172 (F1) vs 0.183 (F0)

I'll try -T0 too.

@eregon
Copy link
Contributor Author

eregon commented Dec 15, 2022

-T0 seems slightly bigger than -F0 and is also significantly slower on TruffleRuby and CRuby.
It also needs a bigger limit.

F0|Tier 1|Time 42349(6489+35861)ms|AST 15784|Inlined   0Y 192N|IR  32569/153201|CodeSize  693759
F1|Tier 1|Time 45334(6752+38583)ms|AST 15868|Inlined   0Y 194N|IR  32971/156085|CodeSize  725279

F0|Tier 2|Time 26416(7389+19027)ms|AST 15784|Inlined   0Y 192N|IR  29214/123678|CodeSize  458046
T0|Tier 2|Time 19638(6044+13594)ms|AST 15868|Inlined   0Y 194N|IR  29584/126201|CodeSize  468297

It seems to compile a bit faster but unsure if significant.

2.6976603209986934
2.7031824049990973
2.666301981000288
[engine] opt done   id=5459  Lexer#advance                                      |Tier 1|Time 45334(6752+38583)ms|AST 15868|Inlined   0Y 194N|IR  32971/156085|CodeSize  725279|Addr 0x7f77fe64d000|Timestamp 8345386888532|Src lexer.rb:3787
1.368835253000725
0.6057424179998634
0.7201194070003112
0.46935323600155243
0.5350453829996695
0.4838732100015477
...
0.41379461600081413
0.40234205599881534
0.4729159090002213
[engine] opt done   id=5459  Lexer#advance                                      |Tier 2|Time 19638(6044+13594)ms|AST 15868|Inlined   0Y 194N|IR  29584/126201|CodeSize  468297|Addr 0x7f77fe48c000|Timestamp 8365025029502|Src lexer.rb:3787
0.46432749400082685
0.6536771670016606
0.37602350000088336
0.49110449100044207

CRuby:

0.3339439399987896
0.3242585309999413
0.328331344999242

Diff for the record:

diff --git a/Rakefile b/Rakefile
index 74af484..9455dce 100644
--- a/Rakefile
+++ b/Rakefile
@@ -152,7 +152,7 @@ task :changelog do
 end
 
 rule '.rb' => '.rl' do |t|
-  sh "ragel -F1 -R #{t.source} -o #{t.name}"
+  sh "ragel -T0 -R #{t.source} -o #{t.name}"
 end
 
 rule '.rb' => '.y' do |t|
diff --git a/lib/parser/lexer.rl b/lib/parser/lexer.rl
index cea2fc6..d0d8db6 100644
--- a/lib/parser/lexer.rl
+++ b/lib/parser/lexer.rl
@@ -264,7 +264,7 @@ class Parser::Lexer
     # Ugly, but dependent on Ragel output. Consider refactoring it somehow.
     klass = self.class
     _lex_trans_keys         = klass.send :_lex_trans_keys
-    _lex_key_spans          = klass.send :_lex_key_spans
+    # _lex_key_spans          = klass.send :_lex_key_spans
     _lex_index_offsets      = klass.send :_lex_index_offsets
     _lex_indicies           = klass.send :_lex_indicies
     _lex_trans_targs        = klass.send :_lex_trans_targs
@@ -272,6 +272,10 @@ class Parser::Lexer
     _lex_to_state_actions   = klass.send :_lex_to_state_actions
     _lex_from_state_actions = klass.send :_lex_from_state_actions
     _lex_eof_trans          = klass.send :_lex_eof_trans
+    _lex_actions            = klass.send :_lex_actions
+    _lex_key_offsets        = klass.send :_lex_key_offsets
+    _lex_single_lengths     = klass.send :_lex_single_lengths
+    _lex_range_lengths      = klass.send :_lex_range_lengths
 
     pe = @source_pts.size + 2
     p, eof = @p, pe

So -F0 seems the most helpful, I'll try to make a PR for that.

eregon added a commit to eregon/parser that referenced this issue Dec 15, 2022
* F0 generates a much smaller Parser::Lexer#advance method which is much better for JITs.
* See whitequark#871
eregon added a commit to eregon/parser that referenced this issue Dec 15, 2022
* F0 generates a much smaller Parser::Lexer#advance method which is much better for JITs.
* See whitequark#871
@eregon
Copy link
Contributor Author

eregon commented Dec 15, 2022

Running on TruffleRuby JVM CE dev, with F0 and --engine.MaximumGraalGraphSize=200000 (which we still need to address) it's almost as fast as CRuby and more stable performance:

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=200000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
...
2.9317425930003083
3.039742467999531
[engine] opt done   id=5462  Lexer#advance                                      |Tier 1|Time 10912(2513+8399)ms|AST 15803|Inlined   0Y 192N|IR  32429/ 98063|CodeSize  476434|Addr 0x7fbb420c2b10|Timestamp 10859957891821|Src lexer-F0.rb:11487
1.1293470479995449
0.24392861399974208
0.21304583600067417
...
0.3103002779989765
0.3397589899996092
0.21339046700086328
[engine] opt done   id=5462  Lexer#advance                                      |Tier 2|Time  8073(2326+5747)ms|AST 15803|Inlined   0Y 192N|IR  29074/ 71859|CodeSize  340978|Addr 0x7fbb4267e290|Timestamp 10868030909213|Src lexer-F0.rb:11487
0.23619348099964554
0.19046759299999394
0.21681793099924107
0.22631248399920878

With the same limit and F0 it's 10x slower (e.g. 2.463123072999224).

With a high limit and F1 it's basically the same speed but F0 is much faster to compile:

F1|Tier 1|Time 20294(2702+17591)ms|AST 24539|Inlined   0Y 234N|IR  39084/128638|CodeSize  622866
F0|Tier 1|Time 10912(2513+ 8399)ms|AST 15803|Inlined   0Y 192N|IR  32429/ 98063|CodeSize  476434

F1|Tier 2|Time 13257(2559+10698)ms|AST 24539|Inlined   0Y 234N|IR  35208/ 96530|CodeSize  453682
F0|Tier 2|Time  8073(2326+ 5747)ms|AST 15803|Inlined   0Y 192N|IR  29074/ 71859|CodeSize  340978

@headius
Copy link
Contributor

headius commented Dec 20, 2022

The F0 patch is slower on JRuby. Did anyone test that before merging?

@headius
Copy link
Contributor

headius commented Dec 20, 2022

The F0 patch just merged degrades performance on JRuby and does not appear to help TruffleRuby except with experimental options: #897

I'm not sure why this was merged so quickly without more testing. It penalizes JRuby performance by almost 40% and only helps non-experimental TruffleRuby by about 3%.

@eregon
Copy link
Contributor Author

eregon commented Dec 20, 2022

I'm not sure why this was merged so quickly without more testing.

There was a lot of testing, I already spent almost a day on it. But indeed I forgot to test on JRuby. See #898.

@eregon
Copy link
Contributor Author

eregon commented Dec 20, 2022

FWIW, I wonder if Ragel is a good choice for parser's lexer, it seems to generate more code and much bigger methods than other lexers such as Ripper's lexer (707 lines) or JRuby's lexer (256 lines, which uses a lot of small methods, that's nice).
OTOH the Ragel lexer may be cleaner and depend less on state or make it better organized/more declarative, etc.
I think Ragel can be worth it for languages with goto and don't mind as much giant functions, but otherwise is probably like typical lexer/parser generators which produce a lot more code (and well lexing Ruby is really complex too).
Anyway I suppose it's not realistic to change that for parser it would most likely be a big effort.
I think we can look from that angle again once YARP is ready.

So let's keep trimming Parser::Lexer#advance.

@headius
Copy link
Contributor

headius commented Dec 20, 2022

I wonder if Ragel is a good choice for parser's lexer

I wonder if Ragel's Ruby generator is good for anything, to be honest. It suffers from the same problems as the Java generator: giant methods with no gotos, so it is a big ugly state machine inside a case/when or switch.

The same problems plague the JRuby json library, which uses Ragel to generate part of the Java implementation. If it generated bytecode with gotos, it would be far more efficient, but it generates a giant .java method instead.

@eregon
Copy link
Contributor Author

eregon commented Dec 21, 2022

I remeasured today, with the latest truffleruby-jvm-ce built locally.
There has been a bunch of changes notably I went Fedora 35->37, TruffleRuby is now built on JDK 11->17 (by default) and TruffleRuby updated some files and gems to Ruby 3.1.3, and of course the latest changes in truffleruby & graal
So lots changed, and it's actually much faster now, quite surprising, but who would complain about "free" performance gains?
(maybe I have time to bisect that another day)

So the numbers from #871 (comment) look 3-4x times faster now.
That's on 645e006 the same changes as then.

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=200000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
truffleruby 23.0.0-dev-0d398ebd, like ruby 3.1.3, GraalVM CE JVM [x86_64-linux]
...
[engine] opt done   id=5892  Lexer#advance                                      |Tier 1|Time 11943(2289+9654)ms|AST 15803|Inlined   0Y 192N|IR  32429/ 97657|CodeSize  449416|Addr 0x7f6603fc1590|Timestamp 6746487901246|Src lexer-F0.rb:11487
[engine] opt done   id=5892  Lexer#advance                                      |Tier 2|Time  7889(2162+5727)ms|AST 15803|Inlined   0Y 192N|IR  29074/ 71453|CodeSize  330810|Addr 0x7f66049de310|Timestamp 6754380753009|Src lexer-F0.rb:11487
0.06743356500010123
0.06744138500016561
0.06617268700028944
0.06570462899981067
0.0666100929993263

$ ruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 30.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
0.23728379900057917

So about 3.5x as fast as CRuby, pretty cool!
That's still with the MaximumGraalGraphSize=200000 though.

Trying latest master 357fefa:

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=180000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
...
[engine] opt done   id=5891  Lexer#advance                                      |Tier 1|Time 10263(2051+8212)ms|AST 13751|Inlined   0Y 164N|IR  29655/ 89030|CodeSize  413571|Addr 0x7f2a7c95cc90|Timestamp 6885508689269|Src lexer-F0.rb:11494
[engine] opt done   id=5891  Lexer#advance                                      |Tier 2|Time  6939(1923+5016)ms|AST 13751|Inlined   0Y 164N|IR  26725/ 65054|CodeSize  302179|Addr 0x7f2a7d165910|Timestamp 6892447628841|Src lexer-F0.rb:11494
0.06599668899980315
0.06564158400033193
0.06605729000057181
0.06549458200061054
0.0652992590003123

CRuby 3.1.3:
$ ruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 30.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
0.22782546100006584

About the same performance, and now it fits in MaximumGraalGraphSize=180000, progress!
Also the CodeSize is smaller.

Trying iliabylich/extract-actions 5b2135e:

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=210000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
...
[engine] opt done   id=5887  Lexer#advance                                      |Tier 1|Time 11245(2221+9024)ms|AST 14471|Inlined   0Y 109N|IR  30510/ 82868|CodeSize  370118|Addr 0x7fb16510a010|Timestamp 7535494382946|Src lexer-F0.rb:11494
[engine] opt done   id=5887  Lexer#advance                                      |Tier 2|Time  6766(2405+4361)ms|AST 14471|Inlined   0Y 109N|IR  28483/ 56131|CodeSize  259318|Addr 0x7fb165944b10|Timestamp 7542260320716|Src lexer-F0.rb:11494
0.06847213900073257
0.06988603299942042
0.06834740700014663
0.06822627499968803
0.06901649800056475

CRuby 3.1.3:
$ ruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 30.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
0.22745593000036024

The performance is about the same, maybe a tiny bit slower but unsure if that's significant.
It needs MaximumGraalGraphSize=210000 though, which is quite a bit worse.
But interestingly the CodeSize is better, and we have less calls (109 instead of 164).
I think the reason we need a higher limit there is ivar accesses during Partial Evaluation generate a lot of Graal nodes and so hit the limit faster.
We also see more AST nodes (14471 vs 13751 for 357fefa).

So I think using @p is problematic here.
So let's benchmark 8943a5d (p -> @p):

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=230000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
[engine] opt done   id=5891  Lexer#advance                                      |Tier 1|Time 13469(2489+10980)ms|AST 16561|Inlined   0Y 163N|IR  35260/ 97450|CodeSize  437037|Addr 0x7fd51abbf490|Timestamp 8089635899092|Src lexer-F0.rb:11494
[engine] opt done   id=5891  Lexer#advance                                      |Tier 2|Time  7602(2378+5224)ms|AST 16561|Inlined   0Y 163N|IR  32360/ 67299|CodeSize  310743|Addr 0x7fd51b47e410|Timestamp 8097237728126|Src lexer-F0.rb:11494
0.0640924660001474
0.06398455500038835
0.06492629800050054
0.06451737200040952
0.06376239000019268

CRuby 3.1.3:
$ ruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 30.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
0.22695019999991928

Performance-wise @p does seem fine.
But it needs MaximumGraalGraphSize=230000 vs 180000 on the previous commit (357fefa), that's not good.
And CodeSize is also worse than that previous commit.
AST is much bigger too.
All expected due to the @p.

On the bright side, 5b2135e manages to reduce the graph size by ~20000 (and 180k-20k almost fits in the default 150k!), the AST by 2090 nodes and the CodeSize by 51425 bytes (Tier2).

So I think there are two conclusions:

  • 8943a5d is problematic, it increases the size of the method to compile too much.
  • 5b2135e is a significant gain in method size and so would be valuable to redo but without @p.

I personally think it looks nice to pass p and get it back, it makes those methods more functional and the changes to p more explicit while not adding lots of noise.
Of course it's subjective, but given the gains it seems very much worth it to pass p instead of using @p.

Also, from this comment, I will try without having any block inside #advance to see the effect on MaximumGraalGraphSize and performance.
That might actually make local variable accesses much more efficient in number of emitted nodes during Partial Evaluation (because we know they are not captured).

Note I won't be on this computer until January so won't be able to measure until then.
Have a nice end-of-year holidays!

@eregon
Copy link
Contributor Author

eregon commented Dec 21, 2022

So when moving all blocks outside of Lexer#advance in #902, I see this:

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=180000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
[engine] opt done   id=5889  Lexer#advance                                      |Tier 1|Time  5750(1873+3877)ms|AST 13711|Inlined   0Y 164N|IR  17741/ 53216|CodeSize  288166|Addr 0x7fda17623c90|Timestamp 10044204771385|Src lexer-F0.rb:11494
[engine] opt done   id=5889  Lexer#advance                                      |Tier 2|Time  5526(1989+3537)ms|AST 13711|Inlined   0Y 164N|IR  14806/ 45642|CodeSize  249476|Addr 0x7fda17e5b790|Timestamp 10049730549135|Src lexer-F0.rb:11494
0.06223252800009504
0.06208532599885075
0.06196834399997897
0.06247724199965887
0.06212161700022989

CRuby 3.1.3:
$ ruby -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 30.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }'
0.22506063400032872

That seems faster for performance (0.062 vs 0.065 for 357fefa), which is a good start.
It's also a lot smaller in CodeSize (-30% in Tier1, -17% in Tier2) and in compilation time (~2x as fast in Tier1!).
It's also a lot less Graal nodes after compilation (IR, second value).
However it's not enough to lower MaximumGraalGraphSize to 170000, 180000 is needed as before (maybe it improves by less than 10k, that's possible).
I will ask Graal/Truffle people about that when I can, maybe it's expected or maybe not.
In any case it seems a big win on several metrics and doesn't hurt the rest of the metrics.

Update, after fixing #902 to pass the CI, here are the numbers, very similar:

$ truffleruby --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=180000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
[engine] opt done   id=5889  Lexer#advance                                      |Tier 1|Time  5806(1876+3929)ms|AST 13739|Inlined   0Y 164N|IR  17785/ 53477|CodeSize  288038|Addr 0x7f6b33373f10|Timestamp 11761157468441|Src lexer-F0.rb:11507
[engine] opt done   id=5889  Lexer#advance                                      |Tier 2|Time  5478(1819+3658)ms|AST 13739|Inlined   0Y 164N|IR  14841/ 45841|CodeSize  278046|Addr 0x7f6b33b0b590|Timestamp 11766662754434|Src lexer-F0.rb:11507
0.06206978000045638
0.0618902780006465
0.06219564200000605
0.06248685600075987
0.06214938099947176

eregon added a commit to eregon/parser that referenced this issue Dec 21, 2022
* This gives big compilation size and time benefit, because the local
  variables of Lexer#advance are no longer captured by those blocks
  (even if unused due to Ruby semantics and `Proc#binding`).
* See whitequark#871 (comment)
@iliabylich
Copy link
Collaborator

I've extracted A LOT in #901 and it's still not enough? I'm not sure that we are even able to reach minimum threshold that you mentioned, there are still a few actions that could be extracted but they are either small and repeated < 5 times or mid-size and repeated only once.

I'm not going to merge it yet, so no rush.

eregon added a commit to eregon/parser that referenced this issue Dec 21, 2022
* This gives big compilation size and time benefit, because the local
  variables of Lexer#advance are no longer captured by those blocks
  (even if unused due to Ruby semantics and `Proc#binding`).
* See whitequark#871 (comment)
@eregon
Copy link
Contributor Author

eregon commented Dec 21, 2022

I've extracted A LOT in #901 and it's still not enough? I'm not sure that we are even able to reach minimum threshold that you mentioned, there are still a few actions that could be extracted but they are either small and repeated < 5 times or mid-size and repeated only once.

I am unsure, maybe it is already enough, the above is just an estimate.
For sure it's a big gain and step in the right direction.
But it's hard to guess while @p is used instead of p because that completely blows up any gain.
Could you please change it to not use @p (i.e., equivalent of removing the first commit, but it might be easier to just revert @p on top, squash all 3 commits, and pass p around as necessary)?

@headius
Copy link
Contributor

headius commented Jan 4, 2023

At this point I lean toward agreeing with @iliabylich that the gains from these changes may not warrant the additional code complexity. It's clear that even with the most aggressive outlining of code we cannot get the method body small enough for TruffleRuby or JRuby to compile out of the box. On TruffleRuby, experimental options can be passed to force the method to compile, but I assume those limits were put in place to prevent an explosion of graph size and native code across the runtime. In JRuby, we can force it to try to compile to JVM bytecode, but the resulting bytecode size is still far too large for the JVM's classfile format (64k of bytecode). Even if we could get the size under the 64k limit, we'd end up in the same place as TruffleRuby, with the HotSpot JIT refusing to native-compile such a large method.

The plan going forward for JRuby, as it always has been, is to enable JIT-level "chunking" of methods. We have some prototype code floating around that can automatically break a large method into smaller virtual methods, but work remains to make it generally useful. When combined with homogeneous "switch" optimizations for "case/when" we might be able to get those virtual method chunks to JIT individually in JRuby and then the JVM can inline and native-compile the hot paths as needed.

I think the PRs that pull out obvious code duplication should remain in place, so long as they do not affect CRuby performance. The other more exotic changes, such as moving local state variables into instance variables, probably do not help anyone enough to warrant the complexity.

@iliabylich
Copy link
Collaborator

iliabylich commented Jan 4, 2023

I think we could try experimenting with extracting parts of the Lexer to sub-lexers. For example, strings lexing could be moved to a separate class that is called if main lexer's state is inside_string, they could share buffer + all kinds of source cursors, but the stack of literals + other string-lexing-related state could be moved to a different class.

Many other things (keywords, numeric literals, etc) could be moved if this approach doesn't make it significantly slower (I really hope it doesn't).

It's a major refactoring, so I need some time. I'll ping you once there's something to try.

@headius
Copy link
Contributor

headius commented Jan 4, 2023

@iliabylich Thank you for your continued work on this! Yes, that seems like a good approach; basically you'll be manually "chunking" these lexers into smaller components that might be able to compile on their own. I and @enebo are standing by to help.

I also want to make it clear: we very much do want to be able to native-compile this code, but we're so far away from that goal at this point I think it's worth reevaluating our approach. I outlined a ton of code, reducing instruction counts by a large percentage. Those changes are probably worth keeping, since they just turn a few common blocks of code into utility methods, but the later changes started to have diminishing returns (and we're still miles away from native-compiling this monster on JRuby/JVM).

@eregon
Copy link
Contributor Author

eregon commented Jan 4, 2023

I think for TruffleRuby we're getting pretty close to the default limit and then we are able to compile the advance method and it's a huge gain for performance (like ~10x IIRC, EDIT: 38x, see below).
With #901 we'd be around 160k, just 10k above the default 150k limit (see #871 (comment)), and we started with 250k so already did most of the outlining. We'll probably need a bit more for that final 10k.
Also if it's that close we might be able to shave the last 10k in TruffleRuby or Graal itself by analyzing the Partial Evaluation graphs.

Of course anything to make it even smaller is good for warmup time, but for TruffleRuby the only requirement to JIT it is it fits in those 150k nodes.

@eregon
Copy link
Contributor Author

eregon commented Jan 17, 2023

I tried #905, it looks great.

Before: 667d60c

truffleruby 23.0.0-dev-f3db7ba6, like ruby 3.1.3, GraalVM CE JVM [x86_64-linux]
$ truffleruby --jvm --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=180000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
[engine] opt done   id=5888  Lexer#advance                                      |Tier 1|Time  5677(1888+3789)ms|AST 13739|Inlined   0Y 164N|IR  17785/ 53477|CodeSize  294806|Addr 0x7fa679481dc0|Timestamp 12590656471429|Src lexer-F0.rb:11507
[engine] opt done   id=5888  Lexer#advance                                      |Tier 2|Time  5965(1823+4142)ms|AST 13739|Inlined   0Y 164N|IR  14841/ 45841|CodeSize  268312|Addr 0x7fa679db09e0|Timestamp 12596624059879|Src lexer-F0.rb:11507
0.06148607300019648
0.06154388299910352
0.06173742499959189
0.06237041299937118
0.06221564099905663
$ truffleruby --jvm --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=179000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
[engine] opt failed id=5888  Lexer#advance                                      |Tier 1|Time  817(-13485768+13486585)ms|Reason: org.graalvm.compiler.truffle.compiler.GraphTooBigBailoutException: Graph too big to safely compile. Node count: 141367. Graph Size: 179007. Limit: 179000.|Timestamp 13486585076560|Src lexer-F0.rb:11507
2.3860840289999032
2.3804523189992324
2.3851654150003014

After: 5b16cd0

$ truffleruby --jvm --engine.TraceCompilation --experimental-options --engine.MaximumGraalGraphSize=153000 -Ilib -rparser/current -rbenchmark -e 'code=File.read("test/test_lexer.rb"); 300.times { p Benchmark.realtime { Parser::CurrentRuby.parse(code) } }' |& grep -E 'Lexer#advance|^[0-9]'
[engine] opt done   id=5939  Lexer#advance                                      |Tier 1|Time  3999(1525+2474)ms|AST 10580|Inlined   0Y 108N|IR  14318/ 43010|CodeSize  252699|Addr 0x7f5fa963e5c0|Timestamp 13084675447003|Src lexer-F0.rb:8555
[engine] opt done   id=5939  Lexer#advance                                      |Tier 2|Time  3818(1430+2387)ms|AST 10580|Inlined   0Y 108N|IR  12251/ 36971|CodeSize  247420|Addr 0x7f5fa9c31040|Timestamp 13088505875861|Src lexer-F0.rb:8555
0.06172202100060531
0.06184316299913917
0.06209410700103035
0.06226966899885156
0.06194313399828388

So it fits in 153k graph nodes with that PR! (vs 180k before)
The default limit is 150k so it's very close to compile without extra flags (for test/test_lexer.rb).
Compilation time, AST size, IR nodes, CodeSize and number of call sites are all significantly lower.
So this is clear gain with no drawbacks from the numbers on TruffleRuby.

Also as we can see the performance difference is huge when it's compiled vs not, it's 38 times as fast when compiled!

@iliabylich
Copy link
Collaborator

iliabylich commented Jan 17, 2023

@eregon Thanks, I'll try get this PR merged this week.

Another thing I wanted to try is moving keywords handling from compile-time (i.e. Ragel rules) to runtime by merging it with a generic identifier rule and doing comparison with known keywords on the fly. I hope it will reduce the number of branches in the biggest case-when by a lot.

@iliabylich
Copy link
Collaborator

I've merged #905, now I'll try to move keywords handling at runtime, hopefully it won't hurt performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants