Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

zamazan4ik · 2023-10-19T23:09:27Z

Hi!

Recently I checked Profile-Guided Optimization (PGO) improvements on multiple projects. The results are here. E.g. PGO helps with optimizing Envoyproxy. According to the multiple tests, PGO can help with improving performance in many other cases. That's why I think trying to optimize the Quilkin with PGO can be a good idea.

Codegen units (CGU) setting to 1 and enabling LTO also can help with optimizing Quilkin performance due to possibly more aggressive inlining (and could help with reducing the binary size).

I can suggest the following action points:

Perform PGO benchmarks on Quilkin. And if it shows improvements - add a note about possible improvements in Quilkin performance with PGO.
Providing an easier way (e.g. a build option) to build scripts with PGO can be helpful for the end-users and maintainers since they will be able to optimize Quilkin according to their own workloads.
Optimize pre-built binaries

Maybe testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO) but I recommend starting from the usual PGO.

For the Rust projects, I recommend starting experimenting with PGO with cargo-pgo.

Here are some examples of how PGO optimization is integrated in other projects:

Rustc: a CI script for the multi-stage build
GCC:
- Official docs, section "Building with profile feedback" (even AutoFDO build is supported)
- A part in a "wonderful" configure script
Clang: Docs
Python:
- CPython: README
- Pyston: README
Go: Bash script
V8: Bazel flag
ChakraCore: Scripts
Chromium: Script
Firefox: Docs
- Thunderbird has PGO support too
PHP - Makefile command and old Centminmod scripts
MySQL: CMake script
YugabyteDB: GitHub commit
FoundationDB: Script
Zstd: Makefile
Foot: Scripts
Windows Terminal: GitHub PR
Pydantic-core: GitHub PR
file.d: GitHub PR
OceanBase: CMake flag

I have already tried to perform PGO tests on my machine but met a bug (more details in #833). I think we can wait before the fix or execute the benchmark somehow else (e.g. with iperf).

The text was updated successfully, but these errors were encountered:

XAMPPRocky · 2023-10-20T06:52:36Z

Thank you for your issue! I definitely agree with adding it as Quilkin is nearly entirely CPU bound from send_to and recv_from, so the more we can optimize are the CPU time, the more clients a single proxy can handle. FWIW I've mostly been using fortio for benchmarking. Mostly with flamegraphs but you could also use perf. I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Server

fortio udp-echo

Quilkin

cargo run --release -- proxy --to 127.0.0.1:8078

Client

fortio load -c 3000 -qps 1000000 udp://127.0.0.1:7777/

zamazan4ik · 2023-10-20T09:56:53Z

@XAMPPRocky I just tried your instructions above and on my Linux machines nothing happens - fortio does not generate the test load. Also, quilkin does not react properly on CTRL+C in the terminal:

taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'
2023-10-20T09:50:41.791435Z  INFO quilkin::cli: src/cli.rs: Starting Quilkin version="0.8.0-dev" commit="aeb2871bbfa7144cc007a10afa3300f1f6ae1815"
2023-10-20T09:50:41.791571Z  INFO quilkin::cli::admin: src/cli/admin.rs: Starting admin endpoint address=[::]:8000
2023-10-20T09:50:41.791741Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Starting port=7777 proxy_id="fedora"
2023-10-20T09:50:41.791830Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: Quilkin is ready
^C2023-10-20T09:53:50.908715Z  INFO quilkin::cli: src/cli.rs: shutting down from signal signal=SIGINT
2023-10-20T09:53:50.908821Z  INFO quilkin::cli::proxy: src/cli/proxy.rs: waiting for active sessions to expire sessions=996
^C^C^C^C^C^C^C[1]    163477 killed     taskset -c 0 target/release/quilkin proxy --to '127.0.0.1:8078'

And the only option to close it is SIGKILL. Fortio instances are started exactly as you wrote above. Did I just miss something obvious?

zamazan4ik · 2023-10-20T10:01:07Z

Oh, it seems like just something about overloading issues (maybe connections). The benchmark started fine when I lowered the connection number and target QPS. Sorry for the ping :)

XAMPPRocky · 2023-10-20T10:33:40Z

Yeah, you need to adjust the -c to match your system, as it will try to spawn that many threads and sockets.

zamazan4ik · 2023-10-20T13:13:24Z

I performed some benchmarks and want to share my results.

Test environment

Fedora 38
Linux kernel 6.5.6
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.73
Quilkin version: the latest for now from the main branch on commit aeb2871bbfa7144cc007a10afa3300f1f6ae1815
Disabled Turbo boost

Benchmark setup

For benchmarking purposes, I use the setup from #834 (comment) (suggested by @XAMPPRocky). The only addition from my side is using taskset to reduce the influence of the OS thread scheduling. So the actual commands are:

taskset -c 23 fortio udp-echo - Server
taskset -c 0 quilkin proxy --to '127.0.0.1:8078' - Quilkin
taskset -c 11-12 fortio load -c 300 -qps 80000 -t 120s udp://127.0.0.1:7777/ - Client

The amount of QPS is tweaked to make sure that Quilkin's CPU core is always 100% (so we can easily measure the throughput improvements on the same hardware).

In this benchmark, I use 4 build configurations:

Release build
Release + codegen-units=1 + lto = fat build
Release + PGO build
Release + codegen-units=1 + lto = fat + PGO build

Release build is done with cargo build --release, PGO builds are done with cargo-pgo. PGO profiles are collected from the benchmark workload itself. Unfortunately, Release + LTO + PGO optimized builds do not work due to rust-lang/rust#115344 (comment) bug in Rustc (hopefully it will be fixed somewhen).

All benchmarks are done multiple times, on the same hardware/software setup, with the same background "noise" (as much I can guarantee ofc). Between each run, quilkin was restarted. There is some reference between runs but it's not critical.

Results

For the build configurations:

quilkin_release - Release build
quilkin_lto - Release + codegen-units=1 + lto = fat build
quilkin_release_pgo_optimized - Release + PGO optimized build
quilkin_lto_instrumented - Release + codegen-units=1 + lto = fat + PGO instrumentation
quilkin_release_instrumented - Release + PGO instrumentation

I got the following results:

quilkin_release: https://gist.github.com/zamazan4ik/77d6272d0ae80f823ee92526fe3df418
quilkin_lto: https://gist.github.com/zamazan4ik/b153cb2d61d6410721fda843b72a9ee3
quilkin_release_pgo_optimized: https://gist.github.com/zamazan4ik/be0dc2d58e1e753ff4838846a137a1ec
quilkin_lto_instrumented: https://gist.github.com/zamazan4ik/a6b1a29b516e5e67122ba6c1fe0c4f3f
quilkin_release_instrumented: https://gist.github.com/zamazan4ik/56ecf4fdc1de189bffb6f296eb27d901

According to the tests, it's possible to achieve several percent improvements with LTO and/or PGO at least in the benchmark above.

Binary sizes for all binaries with size command (just for reference):

size quilkin_release quilkin_lto quilkin_release_pgo_optimized quilkin_lto_instrumented quilkin_release_instrumented
   text	   data	    bss	    dec	    hex	filename
20172458	 838016	   3664	21014138	140a67a	quilkin_release
16134916	 558568	   3576	16697060	 fec6e4	quilkin_lto
17604486	 848424	   3664	18456574	1199ffe	quilkin_release_pgo_optimized
45767668	10730544	  13288	56511500	35e4c0c	quilkin_lto_instrumented
59404083	15691328	  13376	75108787	47a11b3	quilkin_release_instrumented

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

Build time Quilkin Release: 1m 07s
Build time Quilkin Release + LTO: 4m 28s
Build time Quilkin Release + LTO + PGO Instrumentation: 6m 45s
Build time Quilkin Release + PGO Instrumented: 1m 16s
Build time Quilkin Release + PGO optimized: 53.49s

Possible further steps

Test LLVM BOLT applicability for Quilkin (can be done with cargo-pgo as well).

XAMPPRocky · 2023-10-20T15:20:58Z

Thank you for working on this @zamazan4ik! It's a shame we can't get both right now, is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

Are you interested in contributing the work to make this happen in our CU?

zamazan4ik · 2023-10-20T16:32:11Z

is there one in particular that you'd recommend we adopt while we wait for it to be fixed?

I recommend enabling LTO (codegen-units=1 + lto = "fat" or ThinLTO) since it's much easier to integrate into the CI pipeline - it's just enabling several compiler flags. Compare it to PGO when you need to implement a 2-stage build pipeline. Later, when LTO + PGO bug is fixed in the upstream, you can start integrating PGO as an additional optimization step after LTO.

Are you interested in contributing the work to make this happen in our CU?

If you agree to start with LTO, the changes in general would be as simple as the following change to the Cargo.toml file:

[profile.release]
lto = "fat"
codegen-units = 1

Since LTO (especially the Fat version) greatly slows down the build time (see my build time benchmarks above), you can enable LTO only for building actual releases, not on a usual CI build check. It's all up to you. I recommend you at the beginning just put these lines to the Cargo.toml. And later if you have any issues with build times or smth like that - think about separating different profiles, etc.

markmandel · 2023-10-20T18:29:19Z

Thanks also for doing this work - this is super interesting, and great to see the performance improvements.

Also, I would share some numbers about enabling LTO and PGO and its impact on the build time:

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

I've never managed to get iperf working because it requires UDP and TCP where as fortio only needs UDP.

Shall we switch out the iperf test for a fortio one? I'm not wedded to either, whatever is easiest to use!

XAMPPRocky · 2023-10-20T21:23:44Z

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

markmandel · 2023-10-20T21:29:38Z

Yeah, I think fortio is better, it's never worked for me for locally hosting an iperf server or using a public server.

Agreed. #835 filed.

Re: the release flags, I think I would lean towards enabling them only CI, so that running benchmarks locally and iterating on improvements is still fast. For CI extra time is worth better performance.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

quilkin/build/Makefile

Lines 56 to 59 in aeb2871

    
           cargo_build_x86_64_linux := build --release --target x86_64-unknown-linux-gnu 
        
           cargo_build_x86_64_apple := build --release --target x86_64-apple-darwin 
        
           cargo_build_aarch64-apple := build --release --target aarch64-apple-darwin 
        
           cargo_build_x86_64_windows := build --release --target x86_64-pc-windows-gnu

zamazan4ik · 2023-10-20T21:34:42Z

This doesn't seem like a huge jump. Even for + LTO being 4 minutes -- that's not the end of the world. So definitely not a blocker.

Agree. Just to highlight - some projects enable such "heavy" optimization only for building actual binaries. E.g. Vector implements it via special release script. So if you decide to implement such an approach - there are already examples in the current ecosystem to take a look on.

Yeah, that makes sense - We could add the optimisation when building out the images via the Makefile (links below) - which hooks into CI, but for a local cargo build keep them off. @zamazan4ik I assume that's possible?

Definitely! It's a good way to integrate PGO into the project.

markmandel · 2023-10-20T21:58:05Z

Definitely! It's a good way to integrate PGO into the project.

If you would love to show us how it's done 😃 @zamazan4ik - would definitely love your help in this area for sure. Seems like an easy win to me 👍🏻

zamazan4ik · 2023-10-20T22:18:22Z

Sure. You can create an additional LTO-specific profile in Cargo.toml like it's done in G3 project. And then from the Makefile just call building Quilkin with specific Cargo profile.

markmandel mentioned this issue Oct 20, 2023

Replace iperf loadtest with fortio #835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

zamazan4ik commented Oct 19, 2023

XAMPPRocky commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

XAMPPRocky commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023 •

edited

XAMPPRocky commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

markmandel commented Oct 20, 2023

XAMPPRocky commented Oct 20, 2023 •

edited

markmandel commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

markmandel commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

Evaluate LTO, CGU=1, Profile-Guided Optimization (PGO) and LLVM BOLT #834

Comments

zamazan4ik commented Oct 19, 2023

XAMPPRocky commented Oct 20, 2023

Server

Quilkin

Client

zamazan4ik commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

XAMPPRocky commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023 • edited

Test environment

Benchmark setup

Results

Possible further steps

XAMPPRocky commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

markmandel commented Oct 20, 2023

XAMPPRocky commented Oct 20, 2023 • edited

markmandel commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

markmandel commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023

zamazan4ik commented Oct 20, 2023 •

edited

XAMPPRocky commented Oct 20, 2023 •

edited