Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Reuse byte buffer in WriteChunks and writeHash #653

Merged
merged 6 commits into from Jul 12, 2019

Conversation

codesome
Copy link
Contributor

@codesome codesome commented Jul 5, 2019

This is a broken down piece of #627

  • CHANGELOG entry

[Don't merge until Prometheus 2.11 is out]

@codesome
Copy link
Contributor Author

codesome commented Jul 5, 2019

Here are the benchmark results (ns/op diff fluctuates, it's not always negative)

benchmark                                                                                 old ns/op       new ns/op       delta
BenchmarkCompaction/type=normal,blocks=2,series=1000000,samplesPerSeriesPerBlock=51-8     13072562459     13007510801     -0.50%

benchmark                                                                                 old allocs     new allocs     delta
BenchmarkCompaction/type=normal,blocks=2,series=1000000,samplesPerSeriesPerBlock=51-8     15860945       12854710       -18.95%

benchmark                                                                                 old bytes      new bytes      delta
BenchmarkCompaction/type=normal,blocks=2,series=1000000,samplesPerSeriesPerBlock=51-8     2075024560     2063998112     -0.53%

Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than changing the interface, would it make sense to add the buffer to the Writer struct, and then initialize it to the correct size in NewWriter?

chunks/chunks.go Outdated Show resolved Hide resolved
@codesome
Copy link
Contributor Author

codesome commented Jul 9, 2019

Rather than changing the interface, would it make sense to add the buffer to the Writer struct, and then initialize it to the correct size in NewWriter?

That is a good idea. I will make the change.

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
if _, err := h.Write([]byte{byte(cm.Chunk.Encoding())}); err != nil {
// 'buf' byte slice is used to avoid allocating new byte slice. Assumes that len(buf) is 0.
func (cm *Meta) writeHash(h hash.Hash, buf []byte) error {
buf = append(buf[:0], byte(cm.Chunk.Encoding()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. are we trying to save here one byte of allocation per ops? (: Also not sure if it really assumes that len(buf) is 0 right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot the remove the old comment (removed now), we don't assume len(buf) to be 0 now.

Hm.. are we trying to save here one byte of allocation per ops

Nope. Slice header = 2 bytes (?), 1 byte for the data. And this is called per series per chunk. So it would be millions of allocs at a large scale. This is the benchmark difference just for the writeHash (the base contains optimization in WriteChunks).

benchmark                                                                                 old allocs     new allocs     delta
BenchmarkCompaction/type=normal,blocks=2,series=1000000,samplesPerSeriesPerBlock=51-8     14825915       12829175       -13.47%

Apparently, in my benchmark, about 60-67% of allocs savings is from writeHash and remaining from re-using buffer in WriteChunks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't new slice headers be created when passing the slice around, so the only saving is the single byte? That said, I can see how this being in an inner loop could save a lot of gc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't new slice headers be created when passing the slice around

Interesting, I didn't know/think about it!

Copy link
Contributor

@bwplotka bwplotka Jul 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope. Slice header = 2 bytes (?),

https://golang.org/pkg/reflect/#SliceHeader

I can see 3 integers so it's definitely not 2 bytes -> rather 12 as those probably are int64 (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shows that I was in a hurry 😅. I considered 1 byte per variable without much thought!

The savings in allocs with this is good anyway, so WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup 👍

Signed-off-by: Ganesh Vernekar <cs15btech11018@iith.ac.in>
@@ -97,6 +98,7 @@ type Writer struct {
wbuf *bufio.Writer
n int64
crc32 hash.Hash
buf [binary.MaxVarintLen32]byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A thought here, by moving this array to the chunk, it will now always be allocated to the heap, rather than having a chance of being allocated to the stack when it was in the function correct? Could that make things actually less efficient overall?

Not sure if it matters, but I am a bit curious to have a brief discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a valid point - it was likely on stack in previous version and thus a noop for gc and quickly cleared when the function ends.

But still allocations for those 5 bytes has to be there for every call instead of just one time in creation time. But are those allocs visible on allocs for mem benchmarks? Aren't those for heap only? Or it is just improving latency @codesome ? Would be interested why we see gain in benchmarks here as well (:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunks.Writer is created only once per compaction, so I would not worry much about it (5-byte slice) being allocated either in heap or stack (would it make a big difference?). This looked cleaner hence kept it like that.

Copy link
Contributor

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's LGTM from my side, thanks for explanation.

Some questions around the second improvement as @csmarchbanks mentioned. (:

@@ -97,6 +98,7 @@ type Writer struct {
wbuf *bufio.Writer
n int64
crc32 hash.Hash
buf [binary.MaxVarintLen32]byte
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a valid point - it was likely on stack in previous version and thus a noop for gc and quickly cleared when the function ends.

But still allocations for those 5 bytes has to be there for every call instead of just one time in creation time. But are those allocs visible on allocs for mem benchmarks? Aren't those for heap only? Or it is just improving latency @codesome ? Would be interested why we see gain in benchmarks here as well (:

Copy link
Collaborator

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from me as well with a nit and a question.

Also, I haven't done too much optimisation, but these changes honestly look like micro-optimisations to me.

func (cm *Meta) writeHash(h hash.Hash) error {
if _, err := h.Write([]byte{byte(cm.Chunk.Encoding())}); err != nil {
func (cm *Meta) writeHash(h hash.Hash, buf []byte) error {
buf = append(buf[:0], byte(cm.Chunk.Encoding()))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we be consistent here and do buf[0] = byte(...) as we do here: https://github.com/prometheus/tsdb/pull/653/files#diff-ba6ef14901e90aea742f3fd2909d7f07R315

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As writeHash is not a part of chunks.Writer, I would not make any size assumptions of the passed buffer. It would be safer to append. The one you pointed is a chunks.Writer method and is assured of it's size. (But if you think it is fine as it is all internal, I can make the change)

for i := range chks {
chk := &chks[i]

chk.Ref = seq | uint64(w.n)

n := binary.PutUvarint(b[:], uint64(len(chk.Chunk.Bytes())))
n := binary.PutUvarint(w.buf[:], uint64(len(chk.Chunk.Bytes())))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious what the difference b/w passing w.buf and w.buf[:] is...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

w.buf is an array, but PutUvarint expects a slice. w.buf[:] would convert an array into a slice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@codesome
Copy link
Contributor Author

but these changes honestly look like micro-optimisations to me

~19% savings in allocs, would not call it micro-opt :)
All these optimizations add up to 76-77% saved allocs during compaction. #627 (comment)

@bwplotka
Copy link
Contributor

👍

Copy link
Contributor

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as well!

@codesome codesome merged commit 3bc1ea3 into prometheus-junkyard:master Jul 12, 2019
@krasi-georgiev
Copy link
Contributor

@codesome I would also prefer to run a prombench test with all these before merging to justify this additional complexity.

@codesome
Copy link
Contributor Author

@krasi-georgiev given that I have merged multiple optimizations, do you want me to run prombench for each of them separately?

@krasi-georgiev
Copy link
Contributor

I would guess each one would be a bit of a headache so maybe all at once.
However you prefer, I don't mind.
Main point is to decide whether it is worth this extra mental baggage.

@codesome
Copy link
Contributor Author

I will start prombench with all together on Monday as I wouldn't be able to track on the weekends.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants