Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate using Profile-Guided Optimization (PGO) #1689

Open
zamazan4ik opened this issue May 4, 2024 · 3 comments
Open

Evaluate using Profile-Guided Optimization (PGO) #1689

zamazan4ik opened this issue May 4, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@zamazan4ik
Copy link

Hi!

I just read your article about optimizing the graphql-lint performance. Since I tested one specific compiler optimization - Profile-Guided Optimization (PGO) - on various projects with positive results (you can find all benchmarks here: https://github.com/zamazan4ik/awesome-pgo), I decided to test the optimization with graphql-lint.

Test environment

  • Fedora 39
  • Linux kernel 6.8.7
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.78
  • The project version: the latest for now from the main branch on commit 5605d62f69790f62a385e8155bddf838f977165b
  • Disabled Turbo boost

Benchmark

For benchmark purposes, I use built-in into the project benchmarks. For PGO optimization I use cargo-pgo tool. Release bench result I got with cargo bench command. The PGO training phase is done with cargo pgo bench, PGO optimization phase - with cargo pgo optimize bench.

All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Results

I got the following results:

PGO optimized compared to Release:

     Running benches/benchmark.rs (/home/zamazan4ik/open_source/grafbase/target/x86_64-unknown-linux-gnu/release/deps/benchmark-107057c07804eda6)
Benchmarking lint schema
Benchmarking lint schema: Warming up for 3.0000 s
Benchmarking lint schema: Collecting 100 samples in estimated 5.0165 s (162k iterations)
Benchmarking lint schema: Analyzing
lint schema             time:   [30.072 µs 30.093 µs 30.110 µs]
                        change: [-20.761% -20.644% -20.540%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

(just for reference) PGO instrumentation compared to Release:

     Running benches/benchmark.rs (/home/zamazan4ik/open_source/grafbase/target/x86_64-unknown-linux-gnu/release/deps/benchmark-107057c07804eda6)
Benchmarking lint schema
Benchmarking lint schema: Warming up for 3.0000 s
Benchmarking lint schema: Collecting 100 samples in estimated 5.3109 s (71k iterations)
Benchmarking lint schema: Analyzing
lint schema             time:   [75.332 µs 75.360 µs 75.389 µs]
                        change: [+98.777% +99.032% +99.272%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

According to the results, PGO measurably improves the tool's performance at least in the benchmark above. However, I think we need to perform the benchmarks on more datasets.

Further steps

I can suggest the following action points:

  • Perform more PGO benchmarks with other test files. If it shows improvements - add a note to the documentation (README file?) about possible improvements in the tool's performance with PGO.
  • Optimize prebuilt binaries with PGO. As a training set, you can try to gather multiple real-life files, train PGO on them, and deliver pre-PGO-optimized binaries to the users.

Testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

I would be happy to answer your questions about PGO.

@zamazan4ik zamazan4ik added the enhancement New feature or request label May 4, 2024
Copy link

linear bot commented May 4, 2024

@yoav-lavi
Copy link
Collaborator

yoav-lavi commented May 9, 2024

@zamazan4ik Thank you! Looks very interesting.

My understanding is that PGO optimizes based on input in this case, however do we know that we didn't specifically optimize for the benchmark schema only? (as in while there may be some commonalities in input, it'd still be somewhat random in terms of what it can contain)

Do these tools output any sort of indication as to what changes they're making? Since what you're benchmarking is directly used for the optimization it'd be hard to know whether it's generally faster or specifically faster for the benchmark

Thank you!

@zamazan4ik
Copy link
Author

(excuse me please for the so late response)

My understanding is that PGO optimizes based on input in this case, however do we know that we didn't specifically optimize for the benchmark schema only? (as in while there may be some commonalities in input, it'd still be somewhat random in terms of what it can contain)

I guess that many inputs for the tool share similar internal paths for the tool so it will be safe to prepare some "real life" training dataset for the tool and use it during the preoptimization, and it should work well for real users (deliver PGO-preoptimized linter for users).

Do these tools output any sort of indication as to what changes they're making?

If we are talking about PGO - it's not a dedicated tool, it's a compiler part. In general case, no - the compiler doesn't report changes that are made to your program with PGO - it's an internal thing for the compiler. You should expect different inlining decisions, hot/cold splitting and similar things. If you want to understand more, I can suggest using disassembler to compare non-PGOed vs PGOed assembly for the tool.

Since what you're benchmarking is directly used for the optimization it'd be hard to know whether it's generally faster or specifically faster for the benchmark

Fair point. You can prepare one different dataset for training and another one for evaluating, and then perform the corresponding benchmarks once again. I am 99% sure that for this tool PGO will bring the same performance benefits in such a benchmark scenario too since code paths will be similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

2 participants