New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress WAL #5592
Compress WAL #5592
Conversation
0e17c43
to
0945873
Compare
/benchmark |
@krasi-georgiev: Welcome to Prometheus Benchmarking Tool. The two prometheus versions that will be compared are pr-5592 and master The logs can be viewed at the links provided in the GitHub check blocks at the end of this conversation After successfull deployment, the benchmarking metrics can be viewed at :
The Prometheus servers being benchmarked can be viewed at :
To stop the benchmark process comment /benchmark cancel . In response to this:
If you have questions or suggestions related to my behavior, please file an issue against the prometheus/prombench repository. |
Will run for a day and will also check the WAL size savings. |
Looks like compression ratios in the 1.8 - 2.8 range (prom query) Looks good to me so far, WAL fsync latencies are also down significantly. Is it possible to restart each of the Prometheus instances? It would be interesting to see how reading the WAL at startup differs. |
restarted. not sure if there is a metric for the time it took to read the logs so let me know and I can probably figure out the timings from the logs. |
Thanks! I can get an estimate based on when samples start getting ingested again: Almost no difference, so seems like decoding speed is not something to be worried about either. |
amazing difference in the WAL size no compression
compressed
|
do we need more data or can I shut it down? |
That’s all the data I want, feel free to shut it down. |
/benchmark cancel |
/benchmark |
@krasi-georgiev: Welcome to Prometheus Benchmarking Tool. The two prometheus versions that will be compared are pr-5592 and master The logs can be viewed at the links provided in the GitHub check blocks at the end of this conversation After successfull deployment, the benchmarking metrics can be viewed at :
The Prometheus servers being benchmarked can be viewed at :
To stop the benchmark process comment /benchmark cancel . In response to this:
If you have questions or suggestions related to my behavior, please file an issue against the prometheus/prombench repository. |
/benchmark cancel |
@csmarchbanks TSDB https://github.com/prometheus/tsdb/releases/tag/v0.9.1 is released. (: |
0945873
to
9a3fab4
Compare
@brancz @bwplotka @krasi-georgiev This is ready for a review now. Did a quick smoke test of remote write in both compressed and non-compressed modes and everything is looking good. |
I broke it into two commits for ease of review - one for a minimal tsdb update, one for adding the WAL compression flag. |
@@ -78,6 +78,7 @@ var ( | |||
}, | |||
[]string{queue}, | |||
) | |||
liveReaderMetrics = wal.NewLiveReaderMetrics(prometheus.DefaultRegisterer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to see us refactor this eventually to not use global :) not a blocker here though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would also love to see that, I think the struct has to be exported to enable that and then we would have to pass it down from somewhere pretty high. Seemed like a more invasive change than making it global for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely, just evangelizing this, this pattern is pretty spread around the codebase right now, so definitely something to solve here :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can do this as part of prometheus-junkyard/tsdb#606
storage/remote/wal_watcher_test.go
Outdated
@@ -505,5 +505,5 @@ func TestCheckpointSeriesReset(t *testing.T) { | |||
// If you modify the checkpoint and truncate segment #'s run the test to see how | |||
// many series records you end up with and change the last Equals check accordingly | |||
// or modify the Equals to Assert(len(wt.seriesLabels) < seriesCount*10) | |||
testutil.Equals(t, 14, wt.checkNumLabels()) | |||
testutil.Equals(t, 13, wt.checkNumLabels()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is due to the WAL compression causing a different number of segments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TSDB changes LGTM
storage/remote/wal_watcher_test.go
Outdated
@@ -171,7 +171,7 @@ func TestReadToEndNoCheckpoint(t *testing.T) { | |||
err = os.Mkdir(wdir, 0777) | |||
testutil.Ok(t, err) | |||
|
|||
w, err := wal.NewSize(nil, nil, wdir, 128*pageSize) | |||
w, err := wal.NewSize(nil, nil, wdir, 128*pageSize, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this intentional for making the wal compressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted the tests to run using the compressed WAL since that is the future. Uncompressed is the default right now though so I could change them to false for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably make sense to test both cases here. I do agree compressed should some day become the default, but for now uncompressed is the default and the default should be tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or for this PR I'd be ok with just testing the default and adding the compressed case as a follow up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the compressed case is pretty easy will be done in a couple minutes. If it looks too complicated I can quickly revert it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added t.Run statements for both compressed and uncompressed.
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK from my side.
In terms of test. Isn't wotth to test... both? uncompressed and compressed? (as @brancz mentioned)
Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
9a3fab4
to
06f1ba7
Compare
Yep, I agree. I updated all the tests to test both cases similar to the tsdb tests. |
This is a PR in order to run benchmarks for compressing the WAL (see prometheus-junkyard/tsdb#609).
@krasi-georgiev would you be willing to start a benchmark run when you get a chance?
Currently WALCompression is defaulted to true for the benchmarks, I will switch it to false before marking this ready to merge.