Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compress WAL #5592

Merged
merged 2 commits into from Jul 3, 2019
Merged

Compress WAL #5592

merged 2 commits into from Jul 3, 2019

Conversation

csmarchbanks
Copy link
Member

This is a PR in order to run benchmarks for compressing the WAL (see prometheus-junkyard/tsdb#609).

@krasi-georgiev would you be willing to start a benchmark run when you get a chance?

Currently WALCompression is defaulted to true for the benchmarks, I will switch it to false before marking this ready to merge.

@krasi-georgiev
Copy link
Contributor

/benchmark

@prombot
Copy link
Contributor

prombot commented May 23, 2019

@krasi-georgiev: Welcome to Prometheus Benchmarking Tool.

The two prometheus versions that will be compared are pr-5592 and master

The logs can be viewed at the links provided in the GitHub check blocks at the end of this conversation

After successfull deployment, the benchmarking metrics can be viewed at :

The Prometheus servers being benchmarked can be viewed at :

To stop the benchmark process comment /benchmark cancel .

In response to this:

/benchmark

If you have questions or suggestions related to my behavior, please file an issue against the prometheus/prombench repository.

@krasi-georgiev
Copy link
Contributor

Will run for a day and will also check the WAL size savings.

@csmarchbanks
Copy link
Member Author

Looks like compression ratios in the 1.8 - 2.8 range (prom query)

Looks good to me so far, WAL fsync latencies are also down significantly. Is it possible to restart each of the Prometheus instances? It would be interesting to see how reading the WAL at startup differs.

@krasi-georgiev
Copy link
Contributor

restarted.

not sure if there is a metric for the time it took to read the logs so let me know and I can probably figure out the timings from the logs.

@csmarchbanks
Copy link
Member Author

Thanks! I can get an estimate based on when samples start getting ingested again:
master: 00:10:10 -> 00:24:55 = 14 minutes, 45 seconds
pr: 00:18:10 -> 00:33:00 = 14 minutes, 50 seconds

Almost no difference, so seems like decoding speed is not something to be worried about either.

@krasi-georgiev
Copy link
Contributor

amazing difference in the WAL size

no compression

7.9G    /mnt/disks/ssd0/wal/checkpoint.001516
63G     /mnt/disks/ssd0/wal/

compressed

828M    /mnt/disks/ssd0/wal/checkpoint.000798
30G     /mnt/disks/ssd0/wal/

@krasi-georgiev
Copy link
Contributor

do we need more data or can I shut it down?

@csmarchbanks
Copy link
Member Author

That’s all the data I want, feel free to shut it down.

@krasi-georgiev
Copy link
Contributor

/benchmark cancel

@krasi-georgiev
Copy link
Contributor

/benchmark

@prombot
Copy link
Contributor

prombot commented May 29, 2019

@krasi-georgiev: Welcome to Prometheus Benchmarking Tool.

The two prometheus versions that will be compared are pr-5592 and master

The logs can be viewed at the links provided in the GitHub check blocks at the end of this conversation

After successfull deployment, the benchmarking metrics can be viewed at :

The Prometheus servers being benchmarked can be viewed at :

To stop the benchmark process comment /benchmark cancel .

In response to this:

/benchmark

If you have questions or suggestions related to my behavior, please file an issue against the prometheus/prombench repository.

@krasi-georgiev
Copy link
Contributor

/benchmark cancel

@bwplotka
Copy link
Member

bwplotka commented Jul 3, 2019

@csmarchbanks csmarchbanks changed the title WIP: Compress WAL Compress WAL Jul 3, 2019
@csmarchbanks csmarchbanks marked this pull request as ready for review July 3, 2019 13:27
@csmarchbanks
Copy link
Member Author

@brancz @bwplotka @krasi-georgiev This is ready for a review now.

Did a quick smoke test of remote write in both compressed and non-compressed modes and everything is looking good.

@csmarchbanks
Copy link
Member Author

I broke it into two commits for ease of review - one for a minimal tsdb update, one for adding the WAL compression flag.

@@ -78,6 +78,7 @@ var (
},
[]string{queue},
)
liveReaderMetrics = wal.NewLiveReaderMetrics(prometheus.DefaultRegisterer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love to see us refactor this eventually to not use global :) not a blocker here though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would also love to see that, I think the struct has to be exported to enable that and then we would have to pass it down from somewhere pretty high. Seemed like a more invasive change than making it global for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, just evangelizing this, this pattern is pretty spread around the codebase right now, so definitely something to solve here :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do this as part of prometheus-junkyard/tsdb#606

@@ -505,5 +505,5 @@ func TestCheckpointSeriesReset(t *testing.T) {
// If you modify the checkpoint and truncate segment #'s run the test to see how
// many series records you end up with and change the last Equals check accordingly
// or modify the Equals to Assert(len(wt.seriesLabels) < seriesCount*10)
testutil.Equals(t, 14, wt.checkNumLabels())
testutil.Equals(t, 13, wt.checkNumLabels())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is due to the WAL compression causing a different number of segments.

Copy link
Contributor

@krasi-georgiev krasi-georgiev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSDB changes LGTM

@@ -171,7 +171,7 @@ func TestReadToEndNoCheckpoint(t *testing.T) {
err = os.Mkdir(wdir, 0777)
testutil.Ok(t, err)

w, err := wal.NewSize(nil, nil, wdir, 128*pageSize)
w, err := wal.NewSize(nil, nil, wdir, 128*pageSize, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this intentional for making the wal compressed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted the tests to run using the compressed WAL since that is the future. Uncompressed is the default right now though so I could change them to false for now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably make sense to test both cases here. I do agree compressed should some day become the default, but for now uncompressed is the default and the default should be tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or for this PR I'd be ok with just testing the default and adding the compressed case as a follow up)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the compressed case is pretty easy will be done in a couple minutes. If it looks too complicated I can quickly revert it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added t.Run statements for both compressed and uncompressed.

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK from my side.

In terms of test. Isn't wotth to test... both? uncompressed and compressed? (as @brancz mentioned)

Signed-off-by: Chris Marchbanks <csmarchbanks@gmail.com>
@csmarchbanks
Copy link
Member Author

In terms of test. Isn't wotth to test... both?

Yep, I agree. I updated all the tests to test both cases similar to the tsdb tests.

@brancz brancz mentioned this pull request Jul 3, 2019
@brancz brancz merged commit 0727318 into prometheus:master Jul 3, 2019
@csmarchbanks csmarchbanks deleted the compress-wal branch July 3, 2019 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants