Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoiding struct method call overhead of extract_pyclass_ref_mut #3843

Open
Thell opened this issue Feb 15, 2024 · 4 comments · May be fixed by #4026
Open

Avoiding struct method call overhead of extract_pyclass_ref_mut #3843

Thell opened this issue Feb 15, 2024 · 4 comments · May be fixed by #4026

Comments

@Thell
Copy link

Thell commented Feb 15, 2024

Firstly, thanks for PyO3, it's great!

This issue is being opened following a [chat post on gitter].(https://matrix.to/#/!AAhjIWoaKSExrkkhlG:gitter.im/$9gOL75NMgSzKVrKugcQUMBUFcy5QMpLvxuJafAHLieU?via=gitter.im&via=matrix.org&via=nitro.chat)

In short, consider these performance ordered cargo bench results:

running 5 tests
test tests::bench_xoshiro_struct       ... bench:  56,930,651 ns/iter (+/- 145,472)
test tests::bench_lcg_struct           ... bench:  86,424,781 ns/iter (+/- 337,028)
test tests::bench_lcg_static_lazy      ... bench:  86,614,389 ns/iter (+/- 781,944)
test tests::bench_lcg_static           ... bench:  87,029,922 ns/iter (+/- 353,495)
test tests::bench_xoshiro_static_lazy  ... bench: 108,713,707 ns/iter (+/- 1,500,070)

and contrast that to these timeit results:

lcg_static:            2.2382071590000123
lcg_static_lazy:       2.2475717499983148
xoshiro_static_lazy:   2.333924032998766
lcg_struct:            2.603333091999957
xoshiro_struct:        2.6484997750012553

See something odd there? 🤓 Hey structs! What is going on!?!? Sure we have overhead but all of these test functions are returning a simple type to python and, yeah, the structs have to be mutable but for xoshiro_struct to go from being about twice as fast as the lazy to about 14% slower is unexpected, and to see both of the 'struct' versions leading in the Rust benches to trailing in the Python timeits is unexpected.

Running some long pytest loops on these to get enough samples to see what's going on reveals that our two structs are getting penalized for being structs:

image

The question is can this be avoided through some hint/annotation to short-circuit that extract/type-info stack?

Here's one of the minimal structs:

#[pyclass]
pub struct XoshiroStruct {
    state: Xoshiro256Plus,
}

impl XoshiroStruct {
    fn next_state(&mut self) -> u64 {
        self.state.next_u64()
    }
}

#[pymethods]
impl XoshiroStruct {
    #[new]
    pub fn new() -> Self {
        Self {
            state: Xoshiro256Plus::from_entropy(),
        }
    }

    pub fn do_something(&mut self) -> u64 {
        do_it(self.next_state())
    }
}

Here's the profile: pytest 2024-02-15 22.59 profile.json.gz

And here's a repo with these examples: https://github.com/Thell/struct_perf

@adamreichold
Copy link
Member

Generally speaking, some amount of overhead is unavoidable due pervasive shared mutability applied in Python which requires us to use interior mutability patterns to enable safe Rust code from obtaining a &mut Self at all.

You might be able to reduce that overhead by opting out of dynamic borrow check by marking your pyclass as frozen using #[pyclass(frozen)] (check the guide for details). This does imply that do_something must take &self and hence you need to manage the interior mutability yourself. But since you seem to be benchmarking small PRNG, using std::cell::Cell should be possible with doing reference checking, e.g.

let mut state = self.state.get();
let value = self.next_state();
self.state.set(state);

@Thell
Copy link
Author

Thell commented Feb 16, 2024

Thanks for the reply @adamreichold. Unfortunately that doesn't seem to have negated the "extract" cost, although it is now extract_pyclass_ref instead of extract_pyclass_ref_mut. :)

The image cut off the struct declaration:

#[pyclass(frozen)]
pub struct XoshiroStruct {
    state: Cell<Xoshiro256Plus>,
}

image

note: if anyone else tries to do Cell<Xoshiro256Plus> be warned you'll need to use a local rand_xoshiro and add the impl for Copy since it only has Clone in the official repo.

@davidhewitt
Copy link
Member

So from your flamegraphs it looks like the time is dominated by retrieval of the type object, which in general is a necessary part of the type check. I've got a couple of thoughts here:

  1. @Thell can you try using lto = true in your Cargo.toml if you are not already doing so? I wonder if more aggressive inlining can help here.
  2. I wonder if this is a hint that we should be exploring storing these type objects within in the module state; the "lazy type object" machinery seems to be doing a lot of work. (This would be a bigger refactor of PyO3).
  3. We don't currently have a flag to opt out of the type check. Depending on the protection that CPython gives us, this type check may actually not be needed at all.

@samuelcolvin might have hit the same thing when observing "methods overhead" in #3827

@Thell
Copy link
Author

Thell commented Feb 16, 2024

@davidhewitt It's set already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants