Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant performance regression from 0.8.6 to 0.8.7 #194

Open
rodya-mirov opened this issue Dec 30, 2023 · 11 comments
Open

Significant performance regression from 0.8.6 to 0.8.7 #194

rodya-mirov opened this issue Dec 30, 2023 · 11 comments
Assignees

Comments

@rodya-mirov
Copy link

Right this second I don't have a great MVE code for you, although I can link to the github repo: https://github.com/rodya-mirov/twisty/blob/main/src/main.rs

Run cargo run --release -- skewb; with the ahash version pinned at 0.8.6, the script finishes in about 3.5s. With the bump to 0.8.7, the script finishes in about 5s. This is, obviously, not great.

Under the hood this is basically a stress test for the ahash hashset; the hashed objects are somewhat deeply nested, but everything is Copy (no pointers or anything) and the objects should be 20 bytes, assuming rustc did what I expected it to do. It's generating millions of them, inserting them, and branching on whether or not the inserted object was new.

The correctness is unaffected; that is, I'm getting the same answers before and after; but obviously performance is way down. Since the numbers are unchanging and the other code is the same, I have to assume the issue lies with the ahash changes.

My local environment:

I'm using rust 1.75.0 (stable).

Chip: Apple M1 Pro
OS: macOS Ventura 13.6.2

I can try and trim it down to an MVE if this isn't a known issue and that would be helpful (although the linked code is probably less than 400 lines all told, if you restrict just to the skewb.rs and cubesearch.rs; everything else is dead in this invocation).

@tkaitchuck tkaitchuck self-assigned this Jan 2, 2024
@tkaitchuck
Copy link
Owner

Looking into this, I found the code was not using the proper fast path, I will correct this in the next major release, it will take some time. However this is not the cause of the regression, as the old version was also using the slower path.

@rodya-mirov Can you run a cargo asm with each of the versions and grep the result for "aes". I suspect that this: https://github.com/tkaitchuck/aHash/pull/183/files change may have affected the build.

@rodya-mirov
Copy link
Author

Hi thanks for looking. I felt quite bad about dumping such a non-minimal example. Here is a better one (I didn't realize it would be this easy).

Source code:

use std::time::Instant;
use ahash::HashSet;

fn main() {
    let start = Instant::now();

    const FLAG: usize = !5324;

    let mut s: HashSet<usize> = HashSet::default();

    for i in 0 .. 500_000_000 {
        s.insert(i & FLAG);
    }
    
    println!("Got {} rows (avoid optimizing away the whole exercise)", s.len());
    println!("Took {:?}", start.elapsed());
}

On 0.8.6, took 1.86s
On 0.8.7, took 3.06s

Timings are relatively stable.

I'm not sure I'm using cargo asm correctly; I ran cargo asm for both version (no arguments) and got 326 lines of ... I have no idea what. I put in cargo asm ahash_bench::main (the name of my test crate is ahash_bench) and got 360 lines of what looks more like assembly language, but it was the same for both versions. I don't think it's what you meant. In neither case did I get any mention of aes.

Happy to try with different arguments but it's an unfamiliar tool and I'm not sure what I'm looking for.

@tkaitchuck
Copy link
Owner

If cargo asm ahash_bench::main gives output was the same for both versions, then there is no difference in the code between versions.

That leaves two possibilities: there is a difference in flags between cargo asm and cargo run. Assuming this is not the case, it could be code alignment. I have found there are sometimes large perf differences in tiny benchmarks depending on totally unrelated change in the same binary, as it shifts code around.

I'll try this code on a few computers to see if I can reproduce it.

@rodya-mirov
Copy link
Author

rodya-mirov commented Jan 17, 2024

Code alignment seems possible but dubious; these are fairly large benchmarks, and I was able to reproduce the behavior in the original project and the toy project with approximately the same proportion difference in runtime, suggesting there is really something there. I"m definitely not an expert, but I've also been told alignment "can account for up to 10% performance differences" (for instance). 40% seems like an awful lot.

My money is on me not using cargo asm correctly, unfortunately (maybe I need to pass some flags, or something).

@tkaitchuck
Copy link
Owner

@rodya-mirov Can you check if this is fixed on the 0.9 prerelease branch

@rodya-mirov
Copy link
Author

@tkaitchuck The problem seems to be much, much worse with that branch.

Some measurements I just took on my computer (which is now running MacOS Sonoma):

0.8.6 - 1.87s (3 trials)
0.8.7 - 2.97s (3 trials)
0.9.0 prerelease - 25.84s (3 trials)

That is to say, more than a 10x slowdown from the 0.8.6 branch.

This is so catastrophically bad I spent some time trying to figure out if I had some something wrong with my ability to use git dependencies, but, I don't think there's a lot of room for error here.

My cargo.toml:

[package]
name = "ahash_bench"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
# ahash = "=0.8.6"
# ahash = "=0.8.7"
ahash = { git = "https://github.com/tkaitchuck/aHash.git", branch = "0.9-prerelease" }

@rodya-mirov
Copy link
Author

rodya-mirov commented Mar 24, 2024

I dug into the commit history a bit:

At revision 8c3f257f3c62debd37ec0fc99069c2872d24e710 the performance is fine, equal to 0.8.6, at 1.87s or so.

Then at revision b424dc4e6c37585c9109442e75796ceb4e0ab645 (this PR: https://github.com/tkaitchuck/aHash/pull/217/files) the performance craters, all the way to the 20+ second range (the performance is fairly volatile, but consistently bad).

Note I'm using rust 1.76.

@tkaitchuck
Copy link
Owner

Humm. That is the PR activating AES on arm on stable.
My first thought was the instruction was slow. But according to arm the instruction should have a latency of 2 cycles and a throughput of 1 per cycle.

I will have to look at the generated assembly. It may be the case that an #[inline] is missing somewhere critical

@rodya-mirov
Copy link
Author

I appreciate you prioritizing this.

@tkaitchuck
Copy link
Owner

There was a pretty serious regression that was missed because the GitHub actions script was actually running on Intel Macs instead on ARM. Fortunately it was never merged to master so there should be no harm.

With that fixed, running the above code I do see a perf regression, but it is 15% not 10x.

@tkaitchuck
Copy link
Owner

Humm, there is some other problem I am seeing now. It is somehow even slower...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants