Prototype: New epoch algorithm #963

danielkeller · 2023-02-25T15:22:25Z

Hi! I've come up with a somewhat different epoch algorithm, which performs very similarly to the current one while being much simpler. (It also fixes #551 and might help with #869.) It might need some performance tuning on Linux, Windows, or weakly ordered architectures, but I'm curious to know what you think of the approach, or if you have any ideas to make it faster.

Unlike the current algorithm, it uses a fixed number of "pinned" indicators, instead of one per thread. Using more indicators is less helpful against contention as the number of them increases, especially when there are more of them than cores. (An interesting experiment would be to pick one based on sched_getcpu(). I didn't try this because my system doesn't support it.)

Also unlike the current algorithm, it uses the ordering of epochs to ensure that garbage can't be simultaneously added and removed for the same epoch. This greatly simplified storing the garbage, because these operations then don't have to be thread-safe with each other.

Finally, it doesn't use any memory ordering stronger than acquire or release. In my opinion this makes it easier to reason about. (It might help performance on ARM, but I don't have one to test it on.)

Internally it uses an approach similar to a RwLock, with reference counters which stores the write reference in the high bit and read references in the low bits. Here's how it works in detail:

Steps

To pin a thread

Read current epoch
Increment the epoch's reference counter, with acquire ordering. If the counter was write-locked, try the next epoch instead.
Critical section: read or write the concurrent data structure or defer functions
Decrement the reference counter, with release ordering

Once the local buffer of deferred functions is full enough

Push local garbage to the global pile for the current epoch. This is a simple append-only lock free stack.
Attempt to increment the epoch

To advance the epoch (while pinned)

Attempt to write-lock the previous epoch's reference counter, with acquire ordering. If this fails, bail out.
Clear the garbage pile for the next epoch. This doesn't need to be thread safe.
Write-unlock the next epoch's reference counter, with release ordering.
Increment the epoch.

Reference counter

The reference counter is divided into 16 shards.

To read-lock, pick a shard and acquire-increment. If the high bit is set, fail, and if the next-to-high bit is set, panic (this indicates an overflow). To read-unlock, release-decrement the same shard.

To write-lock, attempt to acquire-CAS each counter from 0 to 0 plus the high bit. If the original value wasn't 0 or HIGH_BIT, fail. If the final counter's original value wasn't 0, fail. This allow writers which failed after setting some counters to not cause deadlocks, but the final counter decides which writer wins. To write-unlock, release-set all counters to 0.

Proof sketch

As with the classic epoch algorithm, each epoch overlaps the one before and after it (which is required for wait-freedom), but everything in epoch n happens-before everything in epoch n+2. This is because the advancing thread in epoch n+1 write-locks (acquire) epoch n then write-unlocks (release) epoch n+2. A thread in n+2 must advance the epoch to n+3, so n+3 happens-after n, and so on. Thus the latest epoch that could have observed pointers that epoch n unlinked from the data structure is n+1.

Since the advancing thread in n+2 write-locks n+1, it happens-after it as well, thus it happens-after any uses of those pointers, and they are safe to delete. In addition, the advancing thread in n+1 write-locked n, so it is already write-locked from the point of view of the advancing thread in n+2, and no one will touch the n's garbage pile until it's unlocked. (I can draw a diagram if it helps.)

powergee · 2023-02-28T11:52:24Z

Thanks for suggesting an interesting approach!

I conducted the built-in benchmarks in crossbeam-epoch on my ARM machine and got the following results.

CPU: M1 Pro (8 cores)
OS: MacOS 13.2.1

	Before (`99ec614`)	After (`b933bc9`)
multi_alloc_defer_free	18,525,504 ns/iter	3,312,100 ns/iter
multi_defer	1,436,637 ns/iter	3,470,481 ns/iter
single_alloc_defer_free	59 ns/iter	28 ns/iter
single_defer	11 ns/iter	15 ns/iter
multi_flush	8,149,443 ns/iter	11,759,008 ns/iter
single_flush	58 ns/iter	12 ns/iter
multi_pin	2,732,046 ns/iter	3,116,602 ns/iter
single_pin	4 ns/iter	9 ns/iter

As we can see, multi_alloc_defer_free becomes about 6 times faster than before. However, multi_defer and multi_flush become 2 times and 1.5 times slower.

I think defer and flush could be bottlenecks, and we need to conduct more robust testing/benchmarks to verify the implementation.

danielkeller added 20 commits February 3, 2023 17:49

one possibility

5e7f7a1

A different one

1722cac

Striped ref count

0362c2b

use uninit slice

ee78ac9

use striped bags

9d916d0

Buffer garbage

41fc6ba

Tuning

de4ba48

Docs

a0df736

Remove commented out code

657fc61

New Loom test

e51dac8

Another loom test

63997be

RwLock-like approach

c5df1f0

Nicer debug prints

4d66e92

Fix loom test

446f895

Fix memory orderings

dd15bf8

Fix destructors

df88d50

UnsafeCell -> Cell

2488b4e

Factor out pile to its own module

7f42958

Clean up internals a bit

59fc5cc

Comments

b933bc9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype: New epoch algorithm #963

Prototype: New epoch algorithm #963

danielkeller commented Feb 25, 2023 •

edited

powergee commented Feb 28, 2023

Prototype: New epoch algorithm #963

Are you sure you want to change the base?

Prototype: New epoch algorithm #963

Conversation

danielkeller commented Feb 25, 2023 • edited

Steps

Reference counter

Proof sketch

powergee commented Feb 28, 2023

danielkeller commented Feb 25, 2023 •

edited