Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose paused and retired workers separately in prometheus #8613

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

phofl
Copy link
Collaborator

@phofl phofl commented Apr 11, 2024

Closes #xxxx

  • Tests added / passed
  • Passes pre-commit run --all-files

cc @ntabris for the grafana dashboards

@phofl phofl requested a review from fjetter as a code owner April 11, 2024 13:14
@ntabris
Copy link
Contributor

ntabris commented Apr 11, 2024

Having paused and retiring and paused_or_retiring makes things more complicated for me, because various things would need to include that iff paused and retiring are not present (and not include if they are, otherwise we'd be double counting).

Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.

@fjetter
Copy link
Member

fjetter commented Apr 11, 2024

Thoughts about removing paused_or_retiring? I know this would be a breaking change in some sense, but it's also kinda a breaking change to have more non-exclusive states.

I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?

@ntabris
Copy link
Contributor

ntabris commented Apr 11, 2024

I think we don't have any strong preferences about keeping/removing the paused_or_retiring metric. Can you elaborate how adding those would be a breaking change?

I said "kinda". It messes up anything that assumes states other than "connected" are exclusive, or that (eg) a chart of all states other than "connected" would make sense.

Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).

@fjetter
Copy link
Member

fjetter commented Apr 11, 2024

Instead, one would need logic that includes paused_or_retiring or [paused and retiring] but not both... which isn't very straightforward in Prometheus (I'm still thinking about how to do this).

Right, that makes sense. It's a shame we didn't introduce the split metrics earlier. For now, we're mostly interested in the paused metric because this may highlight a problematic cluster behavior (too small in size) while the retiring signal is a little bit of noise and is usually harmless.

So I don't have a solution for the "nice chart" problem but when using this as a tag, the retired metric as a standalone thing already makes sense.

Copy link
Contributor

github-actions bot commented Apr 11, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files      29 suites   11h 4m 38s ⏱️
 4 060 tests  3 948 ✅   109 💤 3 ❌
54 958 runs  52 546 ✅ 2 409 💤 3 ❌

For more details on these failures, see this check.

Results for commit 97675e9.

♻️ This comment has been updated with latest results.

@phofl
Copy link
Collaborator Author

phofl commented Apr 11, 2024

I'd be fine with removing the paused_or_retiring metric, my main concern was backward compatibility.

We mostly care about paused as @fjetter said

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants