Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler Metrics #2674

Open
d80tb7 opened this issue Jul 13, 2023 · 0 comments
Open

Scheduler Metrics #2674

d80tb7 opened this issue Jul 13, 2023 · 0 comments
Assignees
Labels
component/scheduling Armada Server, Scheduler and Scheduler Injester type/design Design / Architecture suggestions

Comments

@d80tb7
Copy link
Collaborator

d80tb7 commented Jul 13, 2023

The new "pulsar backed" scheduler should expose a set of Prometheus metrics that shed light on its internal working. An initial set of metrics would be:

  • Scheduler cycle time
  • Number of jobs considered (per queue?)
  • Number of jobs scheduled (per cluster etc.)
  • Number of jobs preempted
  • Number of clusters scheduled
  • Evaluated fair share of each queue
  • Delta between fair share and usage of each queue
  • Did the cycle complete successfully (added 23/08)

Note that due to the way Prometheus works (i.e. it samples) we probably want to store some or all of these as histograms rather than gauges.

There is already some prior art for exposing Prometheus metrics in Armada- see for example here and here (the latter of those being the new scheduler exposing which instance is leader). We use the official Prometheus library for this, but we've found it quite difficult because:

  • It's hard to write unit tests
  • It is quite fiddly to use (lots of strings and array sizes that need to match up across different places in the code, with panics if they don't
  • Quite a lot of boilerplate to write
  • Everything is asynchronous.

It might therefore be worth evaluating one of two possible improvments here:

  • we're idiots and we're using this library incorrectly
  • there is another library that we can use which is more suited to our use case.
@theAntiYeti theAntiYeti self-assigned this Jul 13, 2023
@Sharpz7 Sharpz7 added type/design Design / Architecture suggestions component/scheduling Armada Server, Scheduler and Scheduler Injester labels Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/scheduling Armada Server, Scheduler and Scheduler Injester type/design Design / Architecture suggestions
Projects
None yet
Development

No branches or pull requests

3 participants