Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb: check for context cancel before regex matching postings #14096

Merged

Conversation

krajorama
Copy link
Member

@krajorama krajorama commented May 14, 2024

Regex matching can be heavy if the regex takes a lot of cycles to evaluate and we can get stuck evaluating postings for a long time without this fix.

This PR adds a missing unit test for checking that the index reader honors context cancel when doing r.traversePostingOffsets iteration.

Regex matching can be heavy if the regex takes a lot of cycles to
evaluate and we can get stuck evaluating postings for a long time
without this fix.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama marked this pull request as draft May 14, 2024 05:50
@krajorama

This comment was marked as outdated.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama marked this pull request as ready for review May 14, 2024 07:08
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama
Copy link
Member Author

We could probably have more sophisticated generated tests, but these show the intent at least.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
tsdb/index/index_test.go Outdated Show resolved Hide resolved
tsdb/index/postings_test.go Outdated Show resolved Hide resolved
tsdb/querier_test.go Outdated Show resolved Hide resolved
tsdb/querier_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (modulo a couple of nits)

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Copy link
Member

@jesusvazquez jesusvazquez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a small question

tsdb/index/index.go Show resolved Hide resolved
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
tsdb/index/index_test.go Outdated Show resolved Hide resolved
Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM! (modulo @pracucci comments)

tsdb/index/index.go Show resolved Hide resolved
Comment on lines +55 to +56
// checkContextEveryNIterations is used in some tight loops to check if the context is done.
checkContextEveryNIterations = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be 1024

  • I feel like 100 is too often. How often do we want to check? I'd say every 10ms or even 100ms would be enough. For 10ms, this would mean that each check takes 100 microseconds, and that sounds like a lot (do we have numbers?)
  • Can we make it a power of two like 1024? We don't really care if it's every 1000 or every 1024 (or 100 vs 128), but it's way cheaper for the CPU to do the latter operation. (I did the benchmarks, and there's a noticeable difference when switching to 128. Interestingly, same difference as checking count == count>>7<<7 but for some reason, checking count & (1<<7-1) == 0 is less performant)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also happy to follow up later with a separate PR and benchmarks that would support this comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that checking every 100ms is enough.
We didn't have timings; I guessed that 100 was right.
The case that prompted this PR had a 7KB regexp with a 40-deep stack.
Also the CPU was throttled because it was well over its request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a favorite in this race, we've seen 1000 be effective in a test we did. @bboreham are you ok changing to 1024?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we put out the fire in our own infra, I would like to see one timing of one comparable case before changing this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge with 100 and I'll bring proper benchmarks to update the value later.

Copy link
Contributor

@colega colega May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow up: #14118

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
@krajorama krajorama merged commit fdaafdb into prometheus:main May 15, 2024
25 checks passed
krajorama added a commit to grafana/mimir that referenced this pull request May 15, 2024
krajorama added a commit to krajorama/prometheus that referenced this pull request May 15, 2024
Followup to prometheus#14096

Unfortunately the previous PR introduced this bug by not releasing the
lock before returning.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
krajorama added a commit to grafana/mimir that referenced this pull request May 15, 2024
grafanabot pushed a commit to grafana/mimir that referenced this pull request May 15, 2024
* Update mimir-prometheus to e5b85c151ba8

grafana/mimir-prometheus@e5b85c1

prometheus/prometheus#14103
prometheus/prometheus#14096
prometheus/prometheus#14068
prometheus/prometheus#13669

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
(cherry picked from commit 307567a)
krajorama added a commit to grafana/mimir that referenced this pull request May 15, 2024
* Update mimir-prometheus to e5b85c151ba8

grafana/mimir-prometheus@e5b85c1

prometheus/prometheus#14103
prometheus/prometheus#14096
prometheus/prometheus#14068
prometheus/prometheus#13669

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
(cherry picked from commit 307567a)

Co-authored-by: George Krajcsovits <krajorama@users.noreply.github.com>
paulojmdias pushed a commit to paulojmdias/prometheus that referenced this pull request May 15, 2024
…heus#14096)

* tsdb: check for context cancel before regex matching postings

Regex matching can be heavy if the regex takes a lot of cycles to
evaluate and we can get stuck evaluating postings for a long time
without this fix. The constant checkContextEveryNIterations=100
may be changed later.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
paulojmdias pushed a commit to paulojmdias/prometheus that referenced this pull request May 15, 2024
Followup to prometheus#14096

Unfortunately the previous PR introduced this bug by not releasing the
lock before returning.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
paulojmdias added a commit to paulojmdias/prometheus that referenced this pull request May 15, 2024
Quote label name in matchers when needed

When the label name of a matcher contains non-standard characters, like
a dot, or starts with a digit, it should be quoted.

If it's not quoted, then `VectorSelector.String()` isn't a valid PromQL.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

Optimize Matcher.String()

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

Use bytes.Buffer from stack buf in Matcher.String()

Also removed the growing until there's a benchmark for that.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

tsdb: check for context cancel before regex matching postings (prometheus#14096)

* tsdb: check for context cancel before regex matching postings

Regex matching can be heavy if the regex takes a lot of cycles to
evaluate and we can get stuck evaluating postings for a long time
without this fix. The constant checkContextEveryNIterations=100
may be changed later.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

tsdb/index/postings: fix missing lock unlock

Followup to prometheus#14096

Unfortunately the previous PR introduced this bug by not releasing the
lock before returning.

Signed-off-by: György Krajcsovits <gyorgy.krajcsovits@grafana.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>

fix: rebase

bugfix: allow opting-out of multi-cluster setups

Allow users to opt-out of the multi-cluster setup for Prometheus
dashboard, in environments where it isn't applicable.

Refer: prometheus#13180.

Signed-off-by: Pranshu Srivastava <rexagod@gmail.com>
Signed-off-by: Paulo Dias <paulodias.gm@gmail.com>
colega added a commit to colega/prometheus that referenced this pull request May 17, 2024
Follow up on prometheus#14096

As promised, I bring a benchmark, which shows a very small improvement
if context is checked every 128 iterations of label instead of every
100.

It's much easier for a computer to check modulo 128 than modulo 100.
This is a very small 0-2% improvement but I'd say this is one of the
hottest paths of the app so this is still relevant.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
jesusvazquez pushed a commit that referenced this pull request May 21, 2024
Follow up on #14096

As promised, I bring a benchmark, which shows a very small improvement
if context is checked every 128 iterations of label instead of every
100.

It's much easier for a computer to check modulo 128 than modulo 100.
This is a very small 0-2% improvement but I'd say this is one of the
hottest paths of the app so this is still relevant.

Signed-off-by: Oleg Zaytsev <mail@olegzaytsev.com>
francoposa pushed a commit to grafana/mimir that referenced this pull request May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants