Reuse byte buffer in WriteChunks and writeHash #653
Reuse byte buffer in WriteChunks and writeHash #653
Conversation
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Here are the benchmark results (ns/op diff fluctuates, it's not always negative)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than changing the interface, would it make sense to add the buffer to the Writer
struct, and then initialize it to the correct size in NewWriter
?
That is a good idea. I will make the change. |
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
a371048
to
b93ea11
Compare
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
b93ea11
to
41bacfe
Compare
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
if _, err := h.Write([]byte{byte(cm.Chunk.Encoding())}); err != nil { | ||
// 'buf' byte slice is used to avoid allocating new byte slice. Assumes that len(buf) is 0. | ||
func (cm *Meta) writeHash(h hash.Hash, buf []byte) error { | ||
buf = append(buf[:0], byte(cm.Chunk.Encoding())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. are we trying to save here one byte of allocation per ops? (: Also not sure if it really assumes that len(buf)
is 0 right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot the remove the old comment (removed now), we don't assume len(buf)
to be 0
now.
Hm.. are we trying to save here one byte of allocation per ops
Nope. Slice header = 2 bytes (?), 1 byte for the data. And this is called per series per chunk. So it would be millions of allocs at a large scale. This is the benchmark difference just for the writeHash
(the base contains optimization in WriteChunks
).
benchmark old allocs new allocs delta
BenchmarkCompaction/type=normal,blocks=2,series=1000000,samplesPerSeriesPerBlock=51-8 14825915 12829175 -13.47%
Apparently, in my benchmark, about 60-67% of allocs savings is from writeHash
and remaining from re-using buffer in WriteChunks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't new slice headers be created when passing the slice around, so the only saving is the single byte? That said, I can see how this being in an inner loop could save a lot of gc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't new slice headers be created when passing the slice around
Interesting, I didn't know/think about it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. Slice header = 2 bytes (?),
https://golang.org/pkg/reflect/#SliceHeader
I can see 3 integers so it's definitely not 2 bytes -> rather 12 as those probably are int64 (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shows that I was in a hurry 😅. I considered 1 byte per variable without much thought!
The savings in allocs with this is good anyway, so WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 👍
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
@@ -97,6 +98,7 @@ type Writer struct { | |||
wbuf *bufio.Writer | |||
n int64 | |||
crc32 hash.Hash | |||
buf [binary.MaxVarintLen32]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought here, by moving this array to the chunk, it will now always be allocated to the heap, rather than having a chance of being allocated to the stack when it was in the function correct? Could that make things actually less efficient overall?
Not sure if it matters, but I am a bit curious to have a brief discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a valid point - it was likely on stack in previous version and thus a noop for gc and quickly cleared when the function ends.
But still allocations for those 5 bytes has to be there for every call instead of just one time in creation time. But are those allocs visible on allocs
for mem benchmarks? Aren't those for heap only? Or it is just improving latency @codesome ? Would be interested why we see gain in benchmarks here as well (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunks.Writer
is created only once per compaction, so I would not worry much about it (5-byte slice) being allocated either in heap or stack (would it make a big difference?). This looked cleaner hence kept it like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's LGTM from my side, thanks for explanation.
Some questions around the second improvement as @csmarchbanks mentioned. (:
@@ -97,6 +98,7 @@ type Writer struct { | |||
wbuf *bufio.Writer | |||
n int64 | |||
crc32 hash.Hash | |||
buf [binary.MaxVarintLen32]byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be a valid point - it was likely on stack in previous version and thus a noop for gc and quickly cleared when the function ends.
But still allocations for those 5 bytes has to be there for every call instead of just one time in creation time. But are those allocs visible on allocs
for mem benchmarks? Aren't those for heap only? Or it is just improving latency @codesome ? Would be interested why we see gain in benchmarks here as well (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from me as well with a nit and a question.
Also, I haven't done too much optimisation, but these changes honestly look like micro-optimisations to me.
func (cm *Meta) writeHash(h hash.Hash) error { | ||
if _, err := h.Write([]byte{byte(cm.Chunk.Encoding())}); err != nil { | ||
func (cm *Meta) writeHash(h hash.Hash, buf []byte) error { | ||
buf = append(buf[:0], byte(cm.Chunk.Encoding())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we be consistent here and do buf[0] = byte(...)
as we do here: https://github.com/prometheus/tsdb/pull/653/files#diff-ba6ef14901e90aea742f3fd2909d7f07R315
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As writeHash
is not a part of chunks.Writer
, I would not make any size assumptions of the passed buffer. It would be safer to append. The one you pointed is a chunks.Writer
method and is assured of it's size. (But if you think it is fine as it is all internal, I can make the change)
for i := range chks { | ||
chk := &chks[i] | ||
|
||
chk.Ref = seq | uint64(w.n) | ||
|
||
n := binary.PutUvarint(b[:], uint64(len(chk.Chunk.Bytes()))) | ||
n := binary.PutUvarint(w.buf[:], uint64(len(chk.Chunk.Bytes()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious what the difference b/w passing w.buf
and w.buf[:]
is...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w.buf
is an array, but PutUvarint
expects a slice. w.buf[:]
would convert an array into a slice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
~19% savings in allocs, would not call it micro-opt :) |
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as well!
@codesome I would also prefer to run a prombench test with all these before merging to justify this additional complexity. |
@krasi-georgiev given that I have merged multiple optimizations, do you want me to run prombench for each of them separately? |
I would guess each one would be a bit of a headache so maybe all at once. |
I will start prombench with all together on Monday as I wouldn't be able to track on the weekends. |
This is a broken down piece of #627
[Don't merge until Prometheus 2.11 is out]