Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor #560

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Refactor #560

wants to merge 16 commits into from

Conversation

matheusd
Copy link
Contributor

@matheusd matheusd commented Apr 5, 2024

Requires #555, #556

Elided tests and creating as draft to get a first pass review.

The diff is large(ish) so it may be easier to read the full code vs the diff at the moment. But the basic idea of the refactor is the following:

Split allocator strategy and segment management

Using the bufferpool should not be forced upon the caller. And in fact, it is dangerous to use the current implementation of single/multi segment arenas if you send a buffer allocated from anywhere not there. For example, an mmaped file buffer.

Spinning the allocator into its own thing means we can define different allocation strategies (bufferpool, regular runtime functions, simpler caching strategy, read-only, etc).

Unfortunately, due to some tests failing otherwise, I couldn't unify literally everything inside the allocator (see further below for discussion).

Base arena implementation

SingleSegment and MultiSegment arenas have been unified into a single arena impl that offloads the logic to the allocator and segment list.

So the full matrix of Single/Multi and BufferPool/runtime-backed options can be exercised.

Reduce message complexiy

Most decisions have been offloaded from message into the arena, segment list or allocator. This makes Message more generic and easier to reason about: in particular, it no longer cares how many segments there are during init (see further below for discussion on Release) and the roundabound way it used to initialize the first segment.

Test compatibility and issues

All the existing tests pass. A few that are no longer applicable (dealing with the concrete arena implementations) have been commented out. One test (TestPromiseOrdering) is skipped because it's flaky even in the current main branch.

Other than those, the code has been specifically designed to not require changes in the existing tests, and therefore should be ensuring full compatibility to the existing code.

Tests for the new features haven't been done yet (but will if this is deemed to be in the right direction).

Message.Release is full of special cases

One main source of frustration during this rewrite is that Message.Release is full of special cases, mostly to deal with initializing a message for writing. It has all these cases I had to add to avoid having to touch the existing tests. These special cases are documented now in the code, after a FIXME(matheusd) line in that function.

Personally, I think my ReleaseForRead() should be the actual implementation of Release, but in the interest of not breaking client code, I opted for adding a new function instead.

This is also somewhat the reason for having to add a ReadOnlySingleSegmentArena instead of using a read only allocator: Release() is (currently) expected to check the arena is clear and re-allocate the first segment (i.e. "Prove Reset() cannot be used to reset a read-only message" commit), so I had to go out of my way to create an arena that would make it easier to just read only, while reusing the message struct.

This commit switches the bufferpool to use the zeropool implementation
for sync pools.

The stdlib sync.Pool implementation has an issue where it causes an
additional heap allocation per Put() call when used with byte slices.

github.com/colega/zeropool package has been specifically designed to
work around this issue, which reduces GC pressure and improves
performance.

This also fixes the bufferpool's pkg benchmark to use a new pool per
test, to avoid other tests influencing the behavior of the benchmark and
sets it to report the allocations.
This fixes the Message's Reset() call to allow reuse of the first
segment.

Prior to this fix, the first segment was discarded after the first Reset
call, effectively causing a new segment to be initialized on every Reset
call.

By reusing the first segment, the number of heap allocations is reduced
and therefore performance is increased in use cases where the message
object is reused.

The fix involved associtating the segment to the message and fixing
checks to ensure the data of the segment is re-allocated after the
reset.

A benchmark is included to show the current performance of this.
Momentarily, while refactoring is going on.
This shows that the BenchmarkUnmarshal_Reuse is broken.
The prior version isn't commented and it's hard to reason about.
This makes the captable release more efficient, avoid unnecessary
allocations.
This avoids some duffcopy calls and improves perf.
@matheusd
Copy link
Contributor Author

matheusd commented Apr 26, 2024

Here is another find.

NewStruct() followed by SetRoot() is pretty inefficient. All the struct copying going around the calls are making them somewhat heavy. Unrolling the call to allocate and set the root pointer in a single function more than triples the performance in CPU terms:

BenchmarkMessageSetRoot-7          3173535      380.7 ns/op    0 B/op     0 allocs/op
BenchmarkMessageAllocateAsRoot-7   9872070      121.0 ns/op    0 B/op     0 allocs/op

Additionally, this makes it much easier to reason about what "setting the root of the message" entails and the code is much flatter.

Also, this gets rid of the need to have the root pointer allocated on Reset(), moving the message to an "allocate on first set" paradigm, getting rid of some of the quirks in Reset().

The new AllocateAsRoot() can easily replace the existing code by modifying the generator to use that for the NewRootXXXX calls.

The greatest benefit of this would be for code that continually instantiates new messages for writing (which necessarily implies setting the root of the new message).

@matheusd
Copy link
Contributor Author

matheusd commented May 1, 2024

For the next update I introduce an unrolled version of SetNewText, which gets rid of about 100ns of cpu time for setting a text in a struct.

The SimpleSingleSegmentArena is there as a simpler implementation of the single segment arena that gets rid of all the indirection and caching for allocation, so when it has a correctly sized buffer it ends up faster than the standard arena impl.

Bringing everything so far, we get the following benchmark results, which we can compare to the ones in #554:

(Earlier attempts)
BenchmarkSetText01-7     6966074               174.9 ns/op            99 B/op          0 allocs/op
BenchmarkSetText02-7     1000000                1100 ns/op           352 B/op          5 allocs/op
BenchmarkSetText03-7     2315973               517.9 ns/op            72 B/op          2 allocs/op
BenchmarkSetText04-7     2661026               429.9 ns/op           260 B/op          0 allocs/op

(With the refactor)
BenchmarkSetTextFlat-7  26186197               50.69 ns/op             0 B/op          0 allocs/op

@matheusd
Copy link
Contributor Author

matheusd commented May 3, 2024

And for the next update, I introduce an unrolled UpdateText version which reuses the storage for a string instead of always allocating a new one. This removes the need to reset the arena if you're rewriting a single field in a large structure.

This significantly reduces the latency for an operation that consists only of updating a field, down to only ~16ns:

BenchmarkSetTextUpdate
BenchmarkSetTextUpdate-7        80923449                15.90 ns/op            0 B/op          0 allocs/op

@lthibault
Copy link
Collaborator

@matheusd Thank you so much for all this incredible work! I'll review this asap and make sure it gets the attention it deserves. Probably won't be today, but know that you're on my radar! 🙂 🙏

@matheusd
Copy link
Contributor Author

matheusd commented May 4, 2024

Sure thing. I still have at least one additional idea to test out to make this particular workflow faster.

Also note that these are all wip, and specially the later pushes are all experimental stuff that I wouldn't expect to merge as is, but rather to use as a baseline to refactor the code to reach these benchmark results.

The TextField is a reference to a specific text field inside a struct.
It records both the pointer and value locations inside a struct, which
may be used to fetch or update the underlying value.
@matheusd
Copy link
Contributor Author

matheusd commented May 6, 2024

Pushed a new experiment that adds a TextField definition, which keeps track of the pointer and value locations of a specific field (basically a slimmed down Struct).

By keeping these around, we can forgo the most costly op from UpdateText (which is the resolveFarPointer call) verify directly whether we can copy the new string, then replace it.

This gets to within 2.2x of a baseline benchmark that just copies the data into a slice:

BenchmarkSetTextAsField-7       281488827                4.248 ns/op           0 B/op          0 allocs/op

@lthibault
Copy link
Collaborator

@matheusd I'm getting ready to dive into this, and firstly wanted to thank you again for the detailed overview.

Now that you seem to be converging on an implementation, I'm wondering if there's any part of this that we can break off into a smaller PR and merge separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants