Refactor #560

matheusd · 2024-04-05T14:57:30Z

Elided tests and creating as draft to get a first pass review.

The diff is large(ish) so it may be easier to read the full code vs the diff at the moment. But the basic idea of the refactor is the following:

Split allocator strategy and segment management

Using the bufferpool should not be forced upon the caller. And in fact, it is dangerous to use the current implementation of single/multi segment arenas if you send a buffer allocated from anywhere not there. For example, an mmaped file buffer.

Spinning the allocator into its own thing means we can define different allocation strategies (bufferpool, regular runtime functions, simpler caching strategy, read-only, etc).

Unfortunately, due to some tests failing otherwise, I couldn't unify literally everything inside the allocator (see further below for discussion).

Base arena implementation

SingleSegment and MultiSegment arenas have been unified into a single arena impl that offloads the logic to the allocator and segment list.

So the full matrix of Single/Multi and BufferPool/runtime-backed options can be exercised.

Reduce message complexiy

Most decisions have been offloaded from message into the arena, segment list or allocator. This makes Message more generic and easier to reason about: in particular, it no longer cares how many segments there are during init (see further below for discussion on Release) and the roundabound way it used to initialize the first segment.

Test compatibility and issues

All the existing tests pass. A few that are no longer applicable (dealing with the concrete arena implementations) have been commented out. One test (TestPromiseOrdering) is skipped because it's flaky even in the current main branch.

Other than those, the code has been specifically designed to not require changes in the existing tests, and therefore should be ensuring full compatibility to the existing code.

Tests for the new features haven't been done yet (but will if this is deemed to be in the right direction).

Message.Release is full of special cases

One main source of frustration during this rewrite is that Message.Release is full of special cases, mostly to deal with initializing a message for writing. It has all these cases I had to add to avoid having to touch the existing tests. These special cases are documented now in the code, after a FIXME(matheusd) line in that function.

Personally, I think my ReleaseForRead() should be the actual implementation of Release, but in the interest of not breaking client code, I opted for adding a new function instead.

This is also somewhat the reason for having to add a ReadOnlySingleSegmentArena instead of using a read only allocator: Release() is (currently) expected to check the arena is clear and re-allocate the first segment (i.e. "Prove Reset() cannot be used to reset a read-only message" commit), so I had to go out of my way to create an arena that would make it easier to just read only, while reusing the message struct.

This commit switches the bufferpool to use the zeropool implementation for sync pools. The stdlib sync.Pool implementation has an issue where it causes an additional heap allocation per Put() call when used with byte slices. github.com/colega/zeropool package has been specifically designed to work around this issue, which reduces GC pressure and improves performance. This also fixes the bufferpool's pkg benchmark to use a new pool per test, to avoid other tests influencing the behavior of the benchmark and sets it to report the allocations.

This fixes the Message's Reset() call to allow reuse of the first segment. Prior to this fix, the first segment was discarded after the first Reset call, effectively causing a new segment to be initialized on every Reset call. By reusing the first segment, the number of heap allocations is reduced and therefore performance is increased in use cases where the message object is reused. The fix involved associtating the segment to the message and fixing checks to ensure the data of the segment is re-allocated after the reset. A benchmark is included to show the current performance of this.

Momentarily, while refactoring is going on.

This shows that the BenchmarkUnmarshal_Reuse is broken.

The prior version isn't commented and it's hard to reason about.

This makes the captable release more efficient, avoid unnecessary allocations.

This avoids some duffcopy calls and improves perf.

matheusd · 2024-04-26T15:24:52Z

Here is another find.

NewStruct() followed by SetRoot() is pretty inefficient. All the struct copying going around the calls are making them somewhat heavy. Unrolling the call to allocate and set the root pointer in a single function more than triples the performance in CPU terms:

BenchmarkMessageSetRoot-7          3173535      380.7 ns/op    0 B/op     0 allocs/op
BenchmarkMessageAllocateAsRoot-7   9872070      121.0 ns/op    0 B/op     0 allocs/op

Additionally, this makes it much easier to reason about what "setting the root of the message" entails and the code is much flatter.

Also, this gets rid of the need to have the root pointer allocated on Reset(), moving the message to an "allocate on first set" paradigm, getting rid of some of the quirks in Reset().

The new AllocateAsRoot() can easily replace the existing code by modifying the generator to use that for the NewRootXXXX calls.

The greatest benefit of this would be for code that continually instantiates new messages for writing (which necessarily implies setting the root of the new message).

matheusd · 2024-05-01T20:24:38Z

For the next update I introduce an unrolled version of SetNewText, which gets rid of about 100ns of cpu time for setting a text in a struct.

The SimpleSingleSegmentArena is there as a simpler implementation of the single segment arena that gets rid of all the indirection and caching for allocation, so when it has a correctly sized buffer it ends up faster than the standard arena impl.

Bringing everything so far, we get the following benchmark results, which we can compare to the ones in #554:

(Earlier attempts)
BenchmarkSetText01-7     6966074               174.9 ns/op            99 B/op          0 allocs/op
BenchmarkSetText02-7     1000000                1100 ns/op           352 B/op          5 allocs/op
BenchmarkSetText03-7     2315973               517.9 ns/op            72 B/op          2 allocs/op
BenchmarkSetText04-7     2661026               429.9 ns/op           260 B/op          0 allocs/op

(With the refactor)
BenchmarkSetTextFlat-7  26186197               50.69 ns/op             0 B/op          0 allocs/op

matheusd · 2024-05-03T19:11:31Z

And for the next update, I introduce an unrolled UpdateText version which reuses the storage for a string instead of always allocating a new one. This removes the need to reset the arena if you're rewriting a single field in a large structure.

This significantly reduces the latency for an operation that consists only of updating a field, down to only ~16ns:

BenchmarkSetTextUpdate
BenchmarkSetTextUpdate-7        80923449                15.90 ns/op            0 B/op          0 allocs/op

lthibault · 2024-05-03T20:22:23Z

@matheusd Thank you so much for all this incredible work! I'll review this asap and make sure it gets the attention it deserves. Probably won't be today, but know that you're on my radar! 🙂 🙏

matheusd · 2024-05-04T10:43:25Z

Sure thing. I still have at least one additional idea to test out to make this particular workflow faster.

Also note that these are all wip, and specially the later pushes are all experimental stuff that I wouldn't expect to merge as is, but rather to use as a baseline to refactor the code to reach these benchmark results.

The TextField is a reference to a specific text field inside a struct. It records both the pointer and value locations inside a struct, which may be used to fetch or update the underlying value.

matheusd · 2024-05-06T18:03:54Z

Pushed a new experiment that adds a TextField definition, which keeps track of the pointer and value locations of a specific field (basically a slimmed down Struct).

By keeping these around, we can forgo the most costly op from UpdateText (which is the resolveFarPointer call) verify directly whether we can copy the new string, then replace it.

This gets to within 2.2x of a baseline benchmark that just copies the data into a slice:

BenchmarkSetTextAsField-7       281488827                4.248 ns/op           0 B/op          0 allocs/op

lthibault · 2024-06-01T01:21:39Z

@matheusd I'm getting ready to dive into this, and firstly wanted to thank you again for the detailed overview.

Now that you seem to be converging on an implementation, I'm wondering if there's any part of this that we can break off into a smaller PR and merge separately?

matheusd added 11 commits March 5, 2024 17:35

rpc: Disable flaky test

64f1015

Momentarily, while refactoring is going on.

test: Demonstrate test is broken

29d5cb9

This shows that the BenchmarkUnmarshal_Reuse is broken.

message: Prove Reset() cannot be used to reset read-only message

4bbbafc

wip refactor

95a6fbc

Refactor nextAlloc

4d53867

The prior version isn't commented and it's hard to reason about.

Improve captable.release()

336313b

This makes the captable release more efficient, avoid unnecessary allocations.

Reset message in a cleaner way

2af3b97

This avoids some duffcopy calls and improves perf.

fixup! Refactor nextAlloc

9f83283

Add AllocateAsRoot

1651df3

Add experimental AllocateAsRoot and SetFlatText

c93652f

Add UpdateText() version

bde3071

experimental: Add TextField

e7d8d6f

The TextField is a reference to a specific text field inside a struct. It records both the pointer and value locations inside a struct, which may be used to fetch or update the underlying value.

wip add get textfield

082a6b9

wip benchmark

7c36466

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor #560

Refactor #560

matheusd commented Apr 5, 2024

matheusd commented Apr 26, 2024 •

edited

matheusd commented May 1, 2024

matheusd commented May 3, 2024

lthibault commented May 3, 2024

matheusd commented May 4, 2024

matheusd commented May 6, 2024

lthibault commented Jun 1, 2024

Refactor #560

Are you sure you want to change the base?

Refactor #560

Conversation

matheusd commented Apr 5, 2024

Split allocator strategy and segment management

Base arena implementation

Reduce message complexiy

Test compatibility and issues

Message.Release is full of special cases

matheusd commented Apr 26, 2024 • edited

matheusd commented May 1, 2024

matheusd commented May 3, 2024

lthibault commented May 3, 2024

matheusd commented May 4, 2024

matheusd commented May 6, 2024

lthibault commented Jun 1, 2024

matheusd commented Apr 26, 2024 •

edited