Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

Open
zamazan4ik opened this issue Oct 19, 2023 · 13 comments
Open

Comments

@zamazan4ik
Copy link

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy. According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the Quilkin with PGO can be a good idea.

Codegen units (CGU) setting to 1 and enabling LTO also can help with optimizing Quilkin performance due to possibly more aggressive inlining (and could help with reducing the binary size).

I can suggest the following action points:

  • Perform PGO benchmarks on Quilkin. And if it shows improvements - add a note about possible improvements in Quilkin performance with PGO.
  • Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Quilkin according to their own workloads.
  • Optimize pre-built binaries

Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

For the Rust projects, I recommend starting experimenting with PGO with cargo-pgo.

Here are some examples of how PGO optimization is integrated in other projects:

I have already tried to perform PGO tests on my machine but met a bug (more details in #833). I think we can wait before the fix or execute the benchmark somehow else (e.g. with iperf).

@XAMPPRocky
Copy link
Collaborator

Thank you for your issue! I definitely agree with adding it as Quilkin is nearly entirely CPU bound from send_to and recv_from, so the more we can optimize are the CPU time, the more clients a single proxy can handle. FWIW I've mostly been using fortio for benchmarking. Mostly with flamegraphs but you could also use perf. I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Server

fortio udp-echo

Quilkin

cargo run --release -- proxy --to 127.0.0.1:8078

Client

fortio load -c 3000 -qps 1000000 udp://127.0.0.1:7777/

@zamazan4ik
Copy link
Author

@XAMPPRocky I just tried your instructions above and on my Linux machines nothing happens - fortio does not generate the test load. Also, quilkin does not react properly on CTRL+C in the terminal:

taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'
2023-10-20T09:50:41.791435Z  INFO quilkin::cli: src/cli.rs: Starting Quilkin version="0.8.0-dev" commit="aeb2871bbfa7144cc007a10afa3300f1f6ae1815"
2023-10-20T09:50:41.791571Z  INFO quilkin::cli::admin: src/cli/admin.rs: Starting admin endpoint address=[::]:8000
2023-10-20T09:50:41.791741Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Starting port=7777 proxy_id="fedora"
2023-10-20T09:50:41.791830Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Quilkin is ready
^C2023-10-20T09:53:50.908715Z  INFO quilkin::cli: src/cli.rs: shutting down from signal signal=SIGINT
2023-10-20T09:53:50.908821Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: waiting for active sessions to expire sessions=996
^C^C^C^C^C^C^C[1]    163477 killed     taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'

And the only option to close it is SIGKILL. Fortio instances are started exactly as you wrote above. Did I just miss something obvious?

@zamazan4ik
Copy link
Author

Oh, it seems like just something about overloading issues (maybe connections). The benchmark started fine when I lowered the connection number and target QPS. Sorry for the ping :)

@XAMPPRocky
Copy link
Collaborator

Yeah, you need to adjust the -c to match your system, as it will try to spawn that many threads and sockets.

@zamazan4ik
Copy link
Author

zamazan4ik commented Oct 20, 2023

I performed some benchmarks and want to share my results.

Test environment

  • Fedora 38
  • Linux kernel 6.5.6
  • AMD Ryzen 9 5900x
  • 48 Gib RAM
  • SSD Samsung 980 Pro 2 Tib
  • Compiler - Rustc 1.73
  • Quilkin version: the latest for now from the main branch on commit aeb2871bbfa7144cc007a10afa3300f1f6ae1815
  • Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use the setup from #834 (comment) (suggested by @XAMPPRocky). The only addition from my side is using taskset to reduce the influence of the OS thread scheduling. So the actual commands are:

  • taskset -c 23 fortio udp-echo - Server
  • taskset -c 0 quilkin proxy --to '127.0.0.1:8078' - Quilkin
  • taskset -c 11-12 fortio load -c 300 -qps 80000 -t 120s udp://127.0.0.1:7777/ - Client

The amount of QPS is tweaked to make sure that Quilkin's CPU core is always 100% (so we can easily measure the throughput improvements on the same hardware).

In this benchmark, I use 4 build configurations:

  • Release build
  • Release + codegen-units=1 + lto = fat build
  • Release + PGO build
  • Release + codegen-units=1 + lto = fat + PGO build

Release build is done with cargo build --release, PGO builds are done with cargo-pgo. PGO profiles are collected from the benchmark workload itself. Unfortunately, Release + LTO + PGO optimized builds do not work due to rust-lang/rust#115344 (comment) bug in Rustc (hopefully it will be fixed somewhen).

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run, quilkin was restarted. There is some reference between runs but it's not critical.

Results

For the build configurations:

  • quilkin_release - Release build
  • quilkin_lto - Release + codegen-units=1 + lto = fat build
  • quilkin_release_pgo_optimized - Release + PGO optimized build
  • quilkin_lto_instrumented - Release + codegen-units=1 + lto = fat + PGO instrumentation
  • quilkin_release_instrumented - Release + PGO instrumentation

I got the following results:

According to the tests, it's possible to achieve several percent improvements with LTO and/or PGO at least in the benchmark above.

Binary sizes for all binaries with size command (just for reference):

size quilkin_release quilkin_lto quilkin_release_pgo_optimized quilkin_lto_instrumented quilkin_release_instrumented
   text	   data	    bss	    dec	    hex	filename
20172458	 838016	   3664	21014138	140a67a	quilkin_release
16134916	 558568	   3576	16697060	 fec6e4	quilkin_lto
17604486	 848424	   3664	18456574	1199ffe	quilkin_release_pgo_optimized
45767668	10730544	  13288	56511500	35e4c0c	quilkin_lto_instrumented
59404083	15691328	  13376	75108787	47a11b3	quilkin_release_instrumented

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

  • Build time Quilkin Release: 1m 07s
  • Build time Quilkin Release + LTO: 4m 28s
  • Build time Quilkin Release + LTO + PGO Instrumentation: 6m 45s
  • Build time Quilkin Release + PGO Instrumented: 1m 16s
  • Build time Quilkin Release + PGO optimized: 53.49s

Possible further steps

  • Test LLVM BOLT applicability for Quilkin (can be done with cargo-pgo as well).

@XAMPPRocky
Copy link
Collaborator

Thank you for working on this @zamazan4ik! It's a shame we can't get both right now, is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

Are you interested in contributing the work to make this happen in our CU?

@zamazan4ik
Copy link
Author

is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

I recommend enabling LTO (codegen-units=1 + lto = "fat" or ThinLTO) since it's much easier to integrate into the CI pipeline - it's just enabling several compiler flags. Compare it to PGO when you need to implement a 2-stage build pipeline. Later, when LTO + PGO bug is fixed in the upstream, you can start integrating PGO as an additional optimization step after LTO.

Are you interested in contributing the work to make this happen in our CU?

If you agree to start with LTO, the changes in general would be as simple as the following change to the Cargo.toml file:

[profile.release]
lto = "fat"
codegen-units = 1

Since LTO (especially the Fat version) greatly slows down the build time (see my build time benchmarks above), you can enable LTO only for building actual releases, not on a usual CI build check. It's all up to you. I recommend you at the beginning just put these lines to the Cargo.toml. And later if you have any issues with build times or smth like that - think about separating different profiles, etc.

@markmandel
Copy link
Member

Thanks also for doing this work - this is super interesting, and great to see the performance improvements.

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Shall we switch out the iperf test for a fortio one? I'm not wedded to either, whatever is easiest to use!

@XAMPPRocky
Copy link
Collaborator

XAMPPRocky commented Oct 20, 2023

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

@markmandel
Copy link
Member

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Agreed. #835 filed.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

quilkin/build/Makefile

Lines 56 to 59 in aeb2871

cargo_build_x86_64_linux := build --release --target x86_64-unknown-linux-gnu
cargo_build_x86_64_apple := build --release --target x86_64-apple-darwin
cargo_build_aarch64-apple := build --release --target aarch64-apple-darwin
cargo_build_x86_64_windows := build --release --target x86_64-pc-windows-gnu

@zamazan4ik
Copy link
Author

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

Agree. Just to highlight - some projects enable such "heavy" optimization only for building actual binaries. E.g. Vector implements it via special release script. So if you decide to implement such an approach - there are already examples in the current ecosystem to take a look on.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

Definitely! It's a good way to integrate PGO into the project.

@markmandel
Copy link
Member

Definitely! It's a good way to integrate PGO into the project.

If you would love to show us how it's done 😃 @zamazan4ik - would definitely love your help in this area for sure. Seems like an easy win to me 👍🏻

@zamazan4ik
Copy link
Author

Sure. You can create an additional LTO-specific profile in Cargo.toml like it's done in G3 project. And then from the Makefile just call building Quilkin with specific Cargo profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants