add flag to skip reference pool mutex if the program doesn't use the pool #4174

davidhewitt · 2024-05-11T07:18:35Z

This is a tiny idea I had to try to move ReferencePool work off the critical path for programs which have avoided using the reference pool.

The idea is that after #4095 we end up with the reference pool only containing deferred drops. We can use a boolean flag without any atomic synchronization to record if we ever wrote to the pool. If this flag is false, we skip the work. This should hopefully be extremely friendly to branch prediction - if the program never touches the pool then the branch predictor should elide the (more heavy) mutex work.

I wonder if this might be a more gentle alternative to the config-flag in #4095 which might be good enough for most users. I would also support keeping the flag but make it abort-only as a debugging tool to get to the never-touches-pool state for this PR. cc @adamreichold

Measuring the same sample from #3787 (comment) on my machine, this feature does indeed make quite a marked improvement:

# before - rust noop is slower
$ python test.py
py 2.3021166001853998e-08
rust 3.136266599904047e-08

# after - rust noop is slightly faster
$ python test.py
py 2.5457915995502846e-08
rust 2.360866600065492e-08

…pool

adamreichold · 2024-05-11T08:32:52Z

src/gil.rs

+#[inline]
+pub(crate) fn update_deferred_reference_counts(py: Python<'_>) {
+    // Justification for relaxed: worst case this causes already deferred drops to be
+    // delayed slightly later, and this is also a one-time flag, so if the program is
+    // using deferred drops it is highly likely that branch prediction will always
+    // assume this is true and we don't need the atomic overhead.
+    if POOL.ever_used.load(sync::atomic::Ordering::Relaxed) {
+        POOL.update_counts(py)
+    }
+}
+


If we do this, I would prefer to better encapsulate it, e.g. in impl ReferencePool

#[inline] fn update_counts(&self, _py: Python<'_>) { if self.ever_used.load(Ordering::Relaxed) { self.update_counts_impl(py); } } #[inline(never)] fn update_counts_impl(&self, _py: Python<'_>) { .. }

We might even get away with not implementing all of this ourselves at all and get better self-documenting code: Just put it inside a once_cell::sync::Lazy, i.e.

static POOL: Lazy<ReferencePool> = Lazy::new(ReferencePool::new);

and call register_decref as before (going via impl Deref for Lazy, i.e. Lazy::force), but for update, do

if let Some(pool) = POOL.get() { pool.update_counts(); }

(This would also take care of not updating the flag after the first change, but it might be a bit slower because Lazy track initialization state inside a AtomicU8.)

Yes it's a good idea, I'll test out performance of using Lazy when I get a chance later 👍

adamreichold · 2024-05-11T08:39:10Z

src/gil.rs

            pointer_ops: sync::Mutex::new((Vec::new(), Vec::new())),
        }
    }

    fn register_incref(&self, obj: NonNull<ffi::PyObject>) {
+        self.ever_used.store(true, sync::atomic::Ordering::Relaxed);


It might actually help to even avoid this memory traffic if the flag already is set, e.g.

self.ever_used.compare_exchange(false, true, Ordering::Relaxed, Ordering::Relaxed);

Testing this locally (albeit with a single thread), the .store seems to perform better, so will leave as-is for now.

albeit with a single thread

To be honest, a single-threaded test cannot really tell us anything about the effects of cache line bouncing as there is only a single core's cache involved? (In any case, I would be more keen on the Lazy-based approach than open-coding this.)

davidhewitt · 2024-05-11T09:17:07Z

Regarding the encapsulation, I fully agree making this part of update_counts but I wasn't keen on conflicting with #4095

(I started with exactly the if / impl split you propose)

adamreichold · 2024-05-11T09:25:27Z

I fully agree making this part of update_counts but I wasn't keen on conflicting with #4095

I would be glad if we could finish #4095 first. I it is a soundness fix and even if we decide here to replace the flag entirely, I think doing it one step at a step would be preferable.

alex · 2024-05-11T16:06:03Z

In #1608 I introduced a similar bool, but it got removed at some point. May be worthwhile to figure out why we removed it.

adamreichold · 2024-05-11T16:21:26Z

In #1608 I introduced a similar bool, but it got removed at some point. May be worthwhile to figure out why we removed it.

Note that this is slightly different: This bool is a one-time flag that is set when the program first touches the reference pool and never reset. Hence, this will end up always shared if it is able to stay on a separate cache line (we should make sure to use alignment to prevent any false sharing here).

The flag that was removed did track whether the reference pool was dirty and hence had the same amount of coherence traffic as the atomic on which all futex-based mutexes depend on in any case, i.e. in the uncontended case it would be just as fast/slow as the mutex access to check if the pool is dirty itself.

Now, it might be that the mutex access got slower now that we have removed parking_lot which guaranteed these futex-like performance characteristics. Especially if @davidhewitt did run his benchmark on macOS which IIRC (c.f. rust-lang/rust#93740) still uses pthread mutexes behind the scenes.

adamreichold · 2024-05-11T16:27:41Z

Looking at https://github.com/rust-lang/rust/blob/master/library/std/src/sys/sync/mutex/mod.rs, macOS really does appear to use pthread still, so I would assume that moving away from parking_lot to a fair PI mutex to certainly regressed this. But I also don't think that this functionality is worth a dependency like parking_lot.

I really wonder whether we should just use a lockfree stack or queue here as we really do not need mutual exclusion, we just need to steal a single pointer using a compare-exchange.

add flag to skip reference pool mutex if the program doesn't use the …

f74b527

…pool

adamreichold reviewed May 11, 2024

View reviewed changes

davidhewitt mentioned this pull request May 11, 2024

Remove deferred reference count increments and make the global reference pool optional #4095

Merged

adamreichold mentioned this pull request May 11, 2024

Use a lock-free stack to implement the global reference pool to penalize producers versus consumers #4178

Draft

Merge branch 'main' into reference-pool-flag

d7aa0b9

davidhewitt mentioned this pull request May 12, 2024

add benchmark for deferred object drops #4179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add flag to skip reference pool mutex if the program doesn't use the pool #4174

add flag to skip reference pool mutex if the program doesn't use the pool #4174

davidhewitt commented May 11, 2024

adamreichold May 11, 2024

adamreichold May 11, 2024

adamreichold May 11, 2024

davidhewitt May 11, 2024

adamreichold May 11, 2024

davidhewitt May 12, 2024 •

edited

adamreichold May 12, 2024

davidhewitt commented May 11, 2024 •

edited

adamreichold commented May 11, 2024

alex commented May 11, 2024

adamreichold commented May 11, 2024 •

edited

adamreichold commented May 11, 2024 •

edited

add flag to skip reference pool mutex if the program doesn't use the pool #4174

Are you sure you want to change the base?

add flag to skip reference pool mutex if the program doesn't use the pool #4174

Conversation

davidhewitt commented May 11, 2024

adamreichold May 11, 2024

Choose a reason for hiding this comment

adamreichold May 11, 2024

Choose a reason for hiding this comment

adamreichold May 11, 2024

Choose a reason for hiding this comment

davidhewitt May 11, 2024

Choose a reason for hiding this comment

adamreichold May 11, 2024

Choose a reason for hiding this comment

davidhewitt May 12, 2024 • edited

Choose a reason for hiding this comment

adamreichold May 12, 2024

Choose a reason for hiding this comment

davidhewitt commented May 11, 2024 • edited

adamreichold commented May 11, 2024

alex commented May 11, 2024

adamreichold commented May 11, 2024 • edited

adamreichold commented May 11, 2024 • edited

davidhewitt May 12, 2024 •

edited

davidhewitt commented May 11, 2024 •

edited

adamreichold commented May 11, 2024 •

edited

adamreichold commented May 11, 2024 •

edited