Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype: New epoch algorithm #963

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

danielkeller
Copy link

@danielkeller danielkeller commented Feb 25, 2023

Hi! I've come up with a somewhat different epoch algorithm, which performs very similarly to the current one while being much simpler. (It also fixes #551 and might help with #869.) It might need some performance tuning on Linux, Windows, or weakly ordered architectures, but I'm curious to know what you think of the approach, or if you have any ideas to make it faster.

Unlike the current algorithm, it uses a fixed number of "pinned" indicators, instead of one per thread. Using more indicators is less helpful against contention as the number of them increases, especially when there are more of them than cores. (An interesting experiment would be to pick one based on sched_getcpu(). I didn't try this because my system doesn't support it.)

Also unlike the current algorithm, it uses the ordering of epochs to ensure that garbage can't be simultaneously added and removed for the same epoch. This greatly simplified storing the garbage, because these operations then don't have to be thread-safe with each other.

Finally, it doesn't use any memory ordering stronger than acquire or release. In my opinion this makes it easier to reason about. (It might help performance on ARM, but I don't have one to test it on.)

Internally it uses an approach similar to a RwLock, with reference counters which stores the write reference in the high bit and read references in the low bits. Here's how it works in detail:

Steps

To pin a thread

  • Read current epoch
  • Increment the epoch's reference counter, with acquire ordering. If the counter was write-locked, try the next epoch instead.
  • Critical section: read or write the concurrent data structure or defer functions
  • Decrement the reference counter, with release ordering

Once the local buffer of deferred functions is full enough

  • Push local garbage to the global pile for the current epoch. This is a simple append-only lock free stack.
  • Attempt to increment the epoch

To advance the epoch (while pinned)

  • Attempt to write-lock the previous epoch's reference counter, with acquire ordering. If this fails, bail out.
  • Clear the garbage pile for the next epoch. This doesn't need to be thread safe.
  • Write-unlock the next epoch's reference counter, with release ordering.
  • Increment the epoch.

Reference counter

The reference counter is divided into 16 shards.

To read-lock, pick a shard and acquire-increment. If the high bit is set, fail, and if the next-to-high bit is set, panic (this indicates an overflow). To read-unlock, release-decrement the same shard.

To write-lock, attempt to acquire-CAS each counter from 0 to 0 plus the high bit. If the original value wasn't 0 or HIGH_BIT, fail. If the final counter's original value wasn't 0, fail. This allow writers which failed after setting some counters to not cause deadlocks, but the final counter decides which writer wins. To write-unlock, release-set all counters to 0.

Proof sketch

As with the classic epoch algorithm, each epoch overlaps the one before and after it (which is required for wait-freedom), but everything in epoch n happens-before everything in epoch n+2. This is because the advancing thread in epoch n+1 write-locks (acquire) epoch n then write-unlocks (release) epoch n+2. A thread in n+2 must advance the epoch to n+3, so n+3 happens-after n, and so on. Thus the latest epoch that could have observed pointers that epoch n unlinked from the data structure is n+1.

Since the advancing thread in n+2 write-locks n+1, it happens-after it as well, thus it happens-after any uses of those pointers, and they are safe to delete. In addition, the advancing thread in n+1 write-locked n, so it is already write-locked from the point of view of the advancing thread in n+2, and no one will touch the n's garbage pile until it's unlocked. (I can draw a diagram if it helps.)

@powergee
Copy link

Thanks for suggesting an interesting approach!

I conducted the built-in benchmarks in crossbeam-epoch on my ARM machine and got the following results.

  • CPU: M1 Pro (8 cores)
  • OS: MacOS 13.2.1
Before (99ec614) After (b933bc9)
multi_alloc_defer_free 18,525,504 ns/iter 3,312,100 ns/iter
multi_defer 1,436,637 ns/iter 3,470,481 ns/iter
single_alloc_defer_free 59 ns/iter 28 ns/iter
single_defer 11 ns/iter 15 ns/iter
multi_flush 8,149,443 ns/iter 11,759,008 ns/iter
single_flush 58 ns/iter 12 ns/iter
multi_pin 2,732,046 ns/iter 3,116,602 ns/iter
single_pin 4 ns/iter 9 ns/iter

As we can see, multi_alloc_defer_free becomes about 6 times faster than before. However, multi_defer and multi_flush become 2 times and 1.5 times slower.

I think defer and flush could be bottlenecks, and we need to conduct more robust testing/benchmarks to verify the implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

The Local structure is 2104 bytes long and jemalloc rounds this to 4096 bytes.
2 participants